有一个在Cilk的运行特定文件,你可能会发现有趣的即Cilk的-sysdep.h中,其中包含系统特定的映射w.r.t记忆障碍。我抽出一小部分w.r.t乌尔在x86即I386
file:-- cilk-sysdep.h (the numbers on the LHS are actually line numbers)
252 * We use an xchg instruction to serialize memory accesses, as can
253 * be done according to the Intel Architecture Software Developer's
254 * Manual, Volume 3: System Programming Guide
255 * (http://www.intel.com/design/pro/manuals/243192.htm), page 7-6,
256 * "For the P6 family processors, locked operations serialize all
257 * outstanding load and store operations (that is, wait for them to
258 * complete)." The xchg instruction is a locked operation by
259 * default. Note that the recommended memory barrier is the cpuid
260 * instruction, which is really slow (~70 cycles). In contrast,
261 * xchg is only about 23 cycles (plus a few per write buffer
262 * entry?). Still slow, but the best I can find. -KHR
263 *
264 * Bradley also timed "mfence", and on a Pentium IV xchgl is still quite a bit faster
265 * mfence appears to take about 125 ns on a 2.5GHZ P4
266 * xchgl apears to take about 90 ns on a 2.5GHZ P4
267 * However on an opteron, the performance of mfence and xchgl are both *MUCH MUCH BETTER*.
268 * mfence takes 8ns on a 1.5GHZ AMD64 (maybe this is an 801)
269 * sfence takes 5ns
270 * lfence takes 3ns
271 * xchgl takes 14ns
272 * see mfence-benchmark.c
273 */
274 int x=0, y;
275 __asm__ volatile ("xchgl %0,%1" :"=r" (x) :"m" (y), "0" (x) :"memory");
276 }
问题,我喜欢怎么样这是xchgl似乎更快:)虽然你应该真正落实和检查出来的事实。
'lwsync'做什么? – 2010-08-17 10:52:48