我应该平行执行多少代码执行？

4

关键规则是“只有在分岔开销比分岔工作量小得多时才分岔”。由于分配开销是您使用的特定技术的一个属性，因此您需要努力完成这项工作，您在某种程度上必须凭经验确定。您可能最终会在代码中使用一些阈值调整常量来表示这种折衷。

你会发现在实践中发现可分离的工作块实际上很难。如果您将工作块设置得很小，则它没有太多依赖关系，并且只要所有输入数据流都准备就绪，您就可以安排它。但小块通常意味着小工作，而分配开销通常会抵消收益。如果你试图让这些块很大，他们有很多的依赖，你不能将它们分开来安排它们。

有些人很幸运，可以找到这么大的块;我们称这些人中的大多数是物理学家和/或Fortran程序员，他们正在利用将世界分割成尽可能多的小块的数据并行性。

我知道的唯一象样的治疗方法是使用一种极其快速的分叉机制，以便您可以找到最小的实用块。不幸的是，提供这种操作的并行库是动态调用的库，具有相应的动态调用开销。包含并行原语的典型库需要100到数千个周期才能实现“分叉”;如果你的大量工作是100条机器指令，这是一个坏消息。

我坚信为了得到如此快速的分叉机制，语言编译器必须知道你在做分叉，例如“fork”（拼写为:-)在语言中有一个关键字。然后编译器可以看到叉，并预先分配所需的所有内容以最小化完成此操作的时间，并生成特殊代码来管理分叉（和连接）步骤。

我设计的PARLANSE语言，以及我们在语义设计中使用的语言就是这样一种语言。它是一种类似于Lisp的语言（但不包含语义）。它的并行操作符拼写为“（|| ...）”。你可以在下面的日常使用的Quicksort模块中看到它。您还可以看到经验确定的显式QuickSortParallelThreshold值。此Quicksort在Intel x86系统上线性扩展至8个内核。

(define QuickSort 
    (module 
    (;; (define Value nu) 
     (compileifthen (~ (defined QuickSortWithParlanseBuiltInOrderingOfNu)) 
      (define QuickSortWithParlanseBuiltInOrderingOfNu ~f) ; use PARLANSE comparison operators 
     )compileifthen 
     (compileifthen (~ (defined QuickSortParallelThreshold)) 
      (define QuickSortParallelThreshold 100) 
     )compileifthen 
     (compileifthen (~ (defined QuickSortThreshold)) 
      (compileifthenelse QuickSortWithParlanseBuiltInOrderingOfNu 
      (define QuickSortThreshold 16) 
      (define QuickSortThreshold 8) 
     )compileifthenelse 
     )compileifthen 
     (compileifthenelse (~ (defined QuickSortWithCompareByReference)) 
      (define QuickSortWithCompareByReference ~f) 
      (compileifthen QuickSortWithParlanseBuiltInOrderingOfNu 
      (define QuickSortWithCompareByReference ~f) 
     )compileifthen 
     )compileifthenelse 
     (define SortRange 
      (action (procedure (structure (compileifthen (~ QuickSortWithParlanseBuiltInOrderingOfNu) 
              (compileifthenelse (~ QuickSortWithCompareByReference) 
              [compare (function (sort integer (range -1 +1)) (structure [value1 Value] [value2 Value]))] 
              [compare (function (sort integer (range -1 +1)) (structure [value1 (reference Value)] [value2 (reference Value)]))] 
             )compileifthenelse 
             )compileifthen 
             [a (reference (array Value 1 dynamic))] 
             [from natural] 
             [to natural] 
          )structure 
       )procedure 
      (local (;; (define quicksort 
         (action (procedure (structure [l integer] [r integer]))) 
         )define 

         (define quicksort 
         (action (procedure (structure [l integer] [r integer])) 
          (ifthenelse (<= (- r l) (coerce integer QuickSortThreshold)) 
          (do [i integer] (++ l) r +1 
           (local (= [exch Value] a:i) 
           (block exit_if_inserted 
            (;; (do [j integer] (-- i) l -1 
             (ifthenelse (compileifthenelse QuickSortWithParlanseBuiltInOrderingOfNu 
                 (> a:j exch) 
                 (compileifthenelse (~ QuickSortWithCompareByReference) 
                 (== (compare a:j exch) +1) 
                 (== (compare (. a:j) (. exch)) +1) 
                 )compileifthenelse 
                )compileifthenelse 
              (= a:(++ j) a:j) 
              (;; (= a:(++ j) exch) 
               (exitblock exit_if_inserted) 
              );; 
             )ifthenelse 
             )do 
             (= a:l exch) 
            );; 
           )block 
           )local 
          )do 
          (local (;; (= [i integer] l) 
             (= [j integer] r) 
             (= [p integer] l) 
             (= [q integer] r) 
             [exch Value] 
            );; 
           (;; 
            `use middle element as pivot': 
            (local (= [m integer] (// (+ l r) +2)) 
             (;; (= exch a:m) 
              (= a:m a:r) 
              (= a:r exch) 
            );; 
            )local 
            `4-way partitioning = < > =': 
            (loop exit_if_partitioned 
             (;; 
             `find element greater than pivot': 
              (loop exit_if_greater_than_found 
              (;; (compileifthenelse QuickSortWithParlanseBuiltInOrderingOfNu 
                (ifthenelse (< a:i a:r) 
                (consume ~t) 
                (ifthenelse (> a:i a:r) 
                 (exitblock exit_if_greater_than_found) 
                 (;; (ifthen (>= i j) 
                  (exitblock exit_if_partitioned) 
                  )ifthen 
                  (= exch a:p) 
                  (= a:p a:i) 
                  (= a:i exch) 
                  (+= p 1) 
                 );; 
                )ifthenelse 
                )ifthenelse 
                (case (compileifthenelse (~ QuickSortWithCompareByReference) 
                  (compare a:i a:r) 
                  (compare (. a:i) (. a:r)) 
                 )compileifthenelse 
                -1 
                 (consume ~t) 
                +1 
                 (exitblock exit_if_greater_than_found) 
                else (;; (ifthen (>= i j) 
                   (exitblock exit_if_partitioned) 
                  )ifthen 
                   (= exch a:p) 
                   (= a:p a:i) 
                   (= a:i exch) 
                   (+= p 1) 
                 );; 
                )case 
               )compileifthenelse 
               (+= i 1) 
              );; 
              )loop 
             `find element less than to pivot': 
              (loop exit_if_less_than_found 
              (;; (-= j 1) 
               (ifthen (>= i j) 
                (exitblock exit_if_partitioned) 
               )ifthen 
               (compileifthenelse QuickSortWithParlanseBuiltInOrderingOfNu 
                (ifthenelse (< a:j a:r) 
                (exitblock exit_if_less_than_found) 
                (ifthenelse (> a:j a:r) 
                 (consume ~t) 
                 (;; (-= q 1) 
                  (= exch a:j) 
                  (= a:j a:q) 
                  (= a:q exch) 
                 );; 
                )ifthenelse 
                )ifthenelse 
                (case (compileifthenelse (~ QuickSortWithCompareByReference) 
                  (compare a:j a:r) 
                  (compare (. a:j) (. a:r)) 
                 )compileifthenelse 
                -1 
                 (exitblock exit_if_less_than_found) 
                +1 
                 (consume ~t) 
                else (;; (-= q 1) 
                   (= exch a:j) 
                   (= a:j a:q) 
                   (= a:q exch) 
                 );; 
                )case 
               )compileifthenelse 
              );; 
              )loop 
             `move found elements to proper partitions': 
              (;; (= exch a:i) 
               (= a:i a:j) 
               (= a:j exch) 
              );; 
             `increment index': 
              (+= i 1) 
            );; 
            )loop 
            `3-way partitioning <=>': 
            (;; 
             `move pivot to final location': 
             (;; (= exch a:i) 
              (= a:i a:r) 
              (= a:r exch) 
              (= j (-- i)) 
              (= i (++ i)) 
             );; 
             `move elements equal to pivot to final locations': 
             (;; (do [k integer] l (-- p) +1 
               (;; (= exch a:k) 
                (= a:k a:j) 
                (= a:j exch) 
                (-= j 1) 
               );; 
              )do 
              (do [k integer] (-- r) q -1 
               (;; (= exch a:i) 
                (= a:i a:k) 
                (= a:k exch) 
                (+= i 1) 
               );; 
              )do 
             );; 
            );; 
            `sort partitions not equal to pivot': 
            (ifthenelse (<= (- r l) (coerce integer QuickSortParallelThreshold)) 
             (;; (quicksort l j) 
              (quicksort i r) 
            );; 
             (|| (quicksort l j) 
              (quicksort i r) 
            )|| 
            )ifthenelse 
           );; 
          )local 
          )ifthenelse 
         )action 
         )define 

        );; 
       (;; (quicksort (coerce integer from) (coerce integer to)) 
        (ifdebug (do [i integer] (coerce integer from) (-- (coerce integer to)) +1 
          (trust (compileifthenelse QuickSortWithParlanseBuiltInOrderingOfNu 
             (<= a:i a:(++ i)) 
             (compileifthenelse (~ QuickSortWithCompareByReference) 
             (<= (compare a:i a:(++ i)) +0) 
             (<= (compare (. a:i) (. a:(++ i))) +0) 
            )compileifthenelse 
            )compileifthenelse 
            `QuickSort:Sort -> The array is not sorted.' 
          )trust 
          )do 
       )ifdebug 
      );; 
      )local 
     )action 
     )define 

     (define Sort 
      (action (procedure (structure (compileifthen (~ QuickSortWithParlanseBuiltInOrderingOfNu) 
              (compileifthenelse (~ QuickSortWithCompareByReference) 
              [compare (function (sort integer (range -1 +1)) (structure [value1 Value] [value2 Value]))] 
              [compare (function (sort integer (range -1 +1)) (structure [value1 (reference Value)] [value2 (reference Value)]))] 
             )compileifthenelse 
             )compileifthen 
             [a (reference (array Value 1 dynamic))] 
          )structure 
       )procedure 
      (compileifthenelse (~ QuickSortWithParlanseBuiltInOrderingOfNu) 
       (SortRange compare a (coerce natural (lowerbound (@ a) 1)) (coerce natural (upperbound (@ a) 1))) 
       (SortRange a (coerce natural (lowerbound (@ a) 1)) (coerce natural (upperbound (@ a) 1))) 
      )compileifthenelse 
     )action 
     )define 

    );; 
)module 
)define

来源

2010-02-06 21:44:10

2

它取决于线程间通信的开销。我用图像处理测试了openMP，并且有一行像素很方便，同时也提供了很好的加速。我的图像是百万像素，所以有1000个任务，这可能足以让今天的许多核心机器忙碌起来。你也不需要限制自己只需要一秒钟左右的工作。在这个例子中，10毫秒量级的作业加速清晰可见。

现在，这是一个令人愉快的算法，因为它不是递归的，所以一个任务对另一个任务没有依赖关系，并且所有任务自动具有相同的大小。

由于任务大小不同，排序算法将变得更加困难。你希望能够尝试这一点，也许可以选择一种更容易并行化的排序。

来源

2010-02-06 21:38:24

1

参加并行和并行编程的一些课程。学习几种技术，如简单的旧叉，忘记或“手动”多线程（Java线程或pthreads），MPI，OpenMP，BSP，甚至可能是CUDA或OpenCL。然后或者决定成为专家，或者让专家设计和实现高效和正确的并行算法。 “平行”部分很容易，当需要两者时，“高效”和“正确”部分不是。即使是由专家设计和实施的Java并发矢量集合，在第一个版本中也没有缺陷。内存模型的简单定义在Java标准的第一个版本中并不清楚！

最简单的规则：使用现成可用的组件设计，由专家实现，不要试图同时实现正确性和高效设计自己的并行算法，除非你是一个专家。

来源

2010-02-06 22:54:04

1

以编程方式解决此问题是并行计算的圣杯之一，并且有许多库可以近似特定问题（例如，Data Parallel Haskell）的最佳并行性。

总之，通过手工做到这一点，你需要了解：

您要并行算法（难道是并行）
数据的特征，例如，大小，位置（在磁盘上，在内存中）等。
您正在运行的硬件，例如，数量核心，内存延迟，缓存大小/行/结合等。
实现语言（协程，绿色线程，OS线程）和操作系统。
线程之间产卵和上下文切换的代价。

假设算法是可并行化的，您的目标是找到线程的数量和数据的相对块大小，以便您可以优化使用硬件来生成解决方案。

这是很难做到没有大量的实验。我的首选方法是运行大量基准测试，并将性能数据作为以下一项或多项组合的函数：

线程数。
缓冲区大小（如果数据不在RAM中）以一些合理的值（例如，块大小，数据包大小，缓存大小等）递增
变化块大小（如果可以逐渐处理数据）。
用于操作系统或语言运行时的各种调节旋钮。
将线程固定到CPU以改善局部性。
等等

总之，这是不容易的任务，并且有一些工具和函数库，有助于为可能出你的并行问题，你挤尽可能多的性能。通过对数据，代码和运行时环境有良好的理解，您可以正确地做到这一点。

来源

2010-02-07 01:49:19 0xfe

我应该平行执行多少代码执行？

回答

相关问题