2015-12-06 49 views
0

我已经在优化Eratosthenes Sieve的代码方面取得了一些进展,但我需要进一步提高指令数量和数据缓存命中率。任何帮助表示赞赏。优化MIPS中Eratosthenes的筛选

.data   # the data segment to store global data 
space: .asciiz " "  # whitespace to separate prime numbers 

    .text   # the text segment to store instructions 
    .globl main  # define main to be a global label 
main: li $s0, 0x00000000 # initialize $s0 with zeros 
    nor $s1, $s0, $s0 # saves one ALU count over using li $s1, 0x11111111 
    li $t9, 200 # find prime numbers from 2 to $t9 

    add $s2, $sp, 0 # backup bottom of stack address in $s2 

    li $t0, 2  # set counter variable to 2 

init: sw $s1, ($sp) # write ones to the stackpointer's address 
    add $t0, $t0, 1 # increment counter variable 
    sub $sp, $sp, 4 # subtract 4 bytes from stackpointer (push) 
    bne  $t0, $t9, init # take loop if $t0 != $t9, changed ble to bne 
    addi $t8, $t0, 15 # approximate square root of 200 
    li $t0, 1  # reset counter variable to 1 

outer: add  $t0, $t0, 2 # increment counter variable (start at 2) 
    mul $t1, $t0, $t0 # squaring $t0 and save it to $t1 
    bgt $t1, $t8, print # start printing prime numbers when $t1 > $t8, changed so only bgt if $t1 > square root 

check: add $t2, $s2, 0 # save the bottom of stack address to $t2 
    sll $t3, $t0, 2 # calculate the number of bytes to jump over 
    sub $t2, $t2, $t3 # subtract them from bottom of stack address 
    add $t2, $t2, 8 # add 2 words - we started counting at 2! 

    lw $t3, ($t2) # load the content into $t3 

    beq $t3, $s0, outer # only 0's? go back to the outer loop 

inner: add $t2, $s2, 0 # save the bottom of stack address to $t2 
    sll $t3, $t1, 2 # calculate the number of bytes to jump over 

    add  $t4, $t1, 2 # save $t1 + 2 into $t4, added 
    sll  $t5, $t4, 2 # mul by 4, added 

    sub $t2, $t2, $t3 # subtract them from bottom of stack address 

    add $t2, $t2, 8 # add 2 words - we started counting at 2! 

    sw $s0, ($t2) # store 0's -> it's not a prime number! 

    add $t1, $t1, $t0 # do this for every multiple of $t0 

    add $t1, $t1, $t0 # adding $t0 to $t1, added 
    sub $t2, $t2, $t5 # save $t2 - $t5 into $t2, added 
    add $t2, $t2, 8 # adding 8 to $t2, added 
    sw $s0, ($t2) # store contents of $s0 at address contained in $t2, added 
    add $t4, $t4, $t0 # adding $t0 to $t4, added 

    blt $t1, $t9, inner # every multiple done? go back to outer loop, changed to blt and branching to inner 

    j outer  # some multiples left? go back to inner loop, changed to branching to outer 

print: li $t0, 1  # reset counter variable to 1 

    # hard coding a 2 
    li $v0, 1 
    addi $a0, $a0, 2 
    syscall 

    # hard coding a space 
    li $v0, 4 
    la $a0, space 
    syscall 

count: add $t0, $t0, 2 # increment counter variable (start at 2), skipping even numbers 

    bgt $t0, $t9, exit # make sure to exit when all numbers are done (branch to exit if $t0 > $t9) 

    add $t2, $s2, 0 # save the bottom of stack address to $t2 
    sll $t3, $t0, 2 # calculate the number of bytes to jump over 
    sub $t2, $t2, $t3 # subtract them from bottom of stack address 
    add $t2, $t2, 8 # add 2 words - we started counting at 2! 

    lw $t3, ($t2) # load the content into $t3 
    beq $t3, $s0, count # only 0's? go back to count loop 

    add $t3, $s2, 0 # save the bottom of stack address to $t3 

    sub $t3, $t3, $t2 # substract higher from lower address (= bytes) 
    srl $t3, $t3, 2 # changed div to srl 
    add $t3, $t3, 2 # add 2 (words) = the final prime number! 

    li $v0, 1  # system code to print integer 
    add $a0, $t3, 0 # the argument will be our prime number in $t3 
    syscall   # print it! 

    li $v0, 4  # system code to print string 
    la $a0, space # the argument will be a whitespace 
    syscall   # print it! 

    bne $t0, $t9, count # take loop when $t0 != $t9, changed ble to bne 

exit: li $v0, 10  # set up system call 10 (exit) 
    syscall 

回答

0

第一个优化是在C/C++中编写算法并应用尽职调查(清理/收紧代码)。如果这还不够快,开始应用通常的优化工作埃拉托色尼的筛:

  • 使用打包位阵列,而不是由一个字节或字代表了许多
  • 使用的赔率只有筛(拉素2选自需要时薄空气)
  • 筛在小型,高速缓存友好段(例如在许多CPU 32KB L1高速缓存)
  • 记住段之间的偏移,以避免昂贵的模分割
  • 初始化片段与位模式的相当于通过一小撮筛子筛分ES
  • 用于并行

Sieve of Eratosthenes - segmented to increase speed and range筛分到所有可用核调度段示出了清洁和贫执行分段筛的并解释这些最佳化(除了多线程);它还具有指向这些优化的可编译测试程序的链接。

除非C编译器特别差,否则程序集编码可以将筛网的普通C实现的速度提高仅10%左右。在实际的数字中,这意味着C版本可能需要15秒才能筛选出高达2^32的数字,而汇编器版本可能需要13秒。应用上面列表中的前五个优化将C版本降低到大约2秒,并且添加最后一个可以让您少于一秒(如果您有8个空闲内核,则大约为0.3)。

如果速度不够快,那么汇编编码可能没有帮助,因为现在的优化更多的是减少争用,避免深度内存缓存层次结构的延迟,并保持管道充分供应指令,以准备执行(而不是等待依赖关系)。即,优化战争在算法和架构层面赢得了胜利,而不是装配微观优化。当前的编译器倾向于非常擅长自动应用微观优化,如指令调度,避免分支误预测处罚等。pp。当手工编译程序集需要很难达到编译代码的速度时,更不用说超越它了。

优化的下一个级别是将可能性筛选的概念扩展到排除更多的小素数(3,5,7等)。这通常被称为“轮子”。但是,增加辐条的收益越来越小。双辐轮(好几率筛)已经减少了一半的工作量,但在图片中增加3则只会进一步减少三分之一等等,同时使代码更加复杂。复杂性可能会使代码慢于仅赔率版本,所以您需要转到更高阶的车轮才能看到显着的改进。在任何情况下,无论如何,优化都需要用高级语言进行原型和参考实现,这些东西是人类可读的,例如, C/C++,并不需要MIPS的高级牧师解释。