玩了一段时间后导致了另一种方法,使我们可以用模板元编程结合C++ 11的可变参数模板和lambda功能展开的for循环所需的数字:
template <unsigned int DIMENSION>
inline unsigned long tuple_to_index_fixed_dimension(const_tup_t tup, const_tup_t shape) {
unsigned long res = 0; unsigned int k;
for (k=0; k<DIMENSION−1; ++k) {
res += tup[k];
res *= shape[k+1];
}
res += tup[k];
return res;
}
template <unsigned int DIMENSION, unsigned int CURRENT>
class ForEachFixedDimensionHelper {
public:
template <typename FUNCTION, typename ...TENSORS>
inline static void apply(tup_t counter, const_tup_t shape, FUNCTION function, TENSORS & ...args) {
for (counter[CURRENT]=0; counter[CURRENT]<shape[CURRENT]; ++counter[CURRENT])
ForEachFixedDimensionHelper<DIMENSION−1, CURRENT+1>::template apply<FUNCTION, TENSORS...>(counter, shape, function, args...);
}
};
template <unsigned int CURRENT>
class ForEachFixedDimensionHelper<1u, CURRENT> {
public:
template <typename FUNCTION, typename ...TENSORS>
inline static void apply(tup_t counter, const_tup_t shape, FUNCTION function, TENSORS & ...args) {
for (counter[CURRENT]=0; counter[CURRENT]<shape[CURRENT]; ++counter[CURRENT])
function(args[tuple_to_index_fixed_dimension<CURRENT+1>(counter, args.data_shape())]...); /* tensor.data_shape() is an accessor for returning the shape member. */
}
};
template <unsigned char DIMENSION>
class ForEachFixedDimension {
public:
template <typename FUNCTION, typename ...TENSORS>
inline static void apply(const_tup_t shape, FUNCTION function, TENSORS & ...args) {
unsigned long counter[DIMENSION];
memset(counter, 0, DIMENSION*sizeof(unsigned long));
ForEachFixedDimensionHelper<DIMENSION,0>::template apply<FUNCTION, TENSORS...>(counter, shape, function, args...);
}
};
还要注意元组值和形状可以安全地声明为__restrict,这意味着它们指向不同的内存位置,因为它们将专门构建用于迭代,然后解除分配。当另一个指针被解除引用和更改时(“指针别名”问题),由这些指针索引的值不需要从内存中重新读取。当调用ForEachFixedDimension :: template apply时,可以在编译时根据张量参数的内容推断出typename FUNCTION(可能是一个lambda函数)和模板参数包typename ... TENSORS(variadic support)参数类型的功能。的展开for循环
所需的数量可以在运行时抬头:
typedef unsigned int TEMPLATE_SEARCH_INT_TYPE;
template <TEMPLATE_SEARCH_INT_TYPE MINIMUM, TEMPLATE_SEARCH_INT_TYPE MAXIMUM, template <TEMPLATE_SEARCH_INT_TYPE> class WORKER>
class LinearTemplateSearch {
public:
template <typename...ARG_TYPES>
inline static void apply(TEMPLATE_SEARCH_INT_TYPE v, ARG_TYPES && ... args) {
if (v == MINIMUM)
WORKER<MINIMUM>::apply(std::forward<ARG_TYPES>(args)...);
else
LinearTemplateSearch<MINIMUM+1, MAXIMUM, WORKER>::apply(v, std::forward<ARG_TYPES>(args)...);
}
};
template <TEMPLATE_SEARCH_INT_TYPE MAXIMUM, template <TEMPLATE_SEARCH_INT_TYPE> class WORKER >
class LinearTemplateSearch<MAXIMUM, MAXIMUM, WORKER> {
public:
template <typename...ARG_TYPES>
inline static void apply(TEMPLATE_SEARCH_INT_TYPE v, ARG_TYPES && ... args) {
assert(v == MAXIMUM);
WORKER<MAXIMUM>::apply(std::forward<ARG_TYPES>(args)...);
}
};
注意,在这里,尽管模板递归时,尺寸需要不直到运行时是已知的。这基本上是通过使用模板作为即时(JIT)编译形式,为所有感兴趣的维度预先计算策略,然后在运行时查找正确的策略来实现的。
所以这些方法用Benchmark进行了测试。在基准1中,数据被从形状的张量(2 ,2 ,)复制到形状的张量(2 ,2 ,2 )。在基准2,和形状的两个张量之间的内积(2 ,2 ,2 )(2 ,2 ,2 )被计算(仅来访由两者共享的元组索引)。将模板递归的实现与其他替代方法进行比较:元组迭代;元组迭代,其中维在编译时已知;整数重新索引;整数重新索引,其中轴限制为2的幂; numpy的; C风格for循环(硬编码);矢量化的Fortran代码; Go中的循环。
事实证明模板递归比元组索引,并且该方法通过升压使用更快:
灰色数字表示平均运行时和误差棒的分和最大。这里是本方法的那些实施了用于基准1为每个方法:
// Tuple iteration (DIMENSION must be compile−time constant): vector<unsigned long> t(DIMENSION);
t.fill(0);
unsigned long k;
for (k=0; k<x.flat.size(); advance_tuple_fixed_dimension<DIMENSION>(&t[0], &x.data_shape()[0]), ++k)
x[k] = y[tuple_to_index_fixed_dimension<DIMENSION>(&t[0], &y.data_shape()[0])];
// boost:
x[ boost::indices[range(0, x.shape[0])][range(0,x.shape[1])][range(0,x.shape[2])] ] = y[ boost::indices[range(0,x.shape[0])][range(0,x.shape[1])][range(0,x.shape[2])] ];
! Fortran 95
x = y(1:2**5,1:2**9,1:2**9)
// Hard−coded for loops in C: unsigned long k;
for (k=0; k<x.data_shape()[0]; ++k) {
for (unsigned long j=0; j<x.data_shape()[1]; ++j) {
unsigned long x_bias = (k*x.data_shape()[1] + j)*x.data_shape()[2];
unsigned long y_bias = (k*y.data_shape()[1] + j)*y.data_shape()[2];
for (unsigned long i=0; i<x.data_shape()[2]; ++i)
x[x_bias + i] = y[y_bias + i];
}
}
// Integer reindexing:
unsigned long k;
for (k=0; k<x.flat.size(); ++k)
x[k] = y[reindex(k, &x.data_shape()[0], &y.data_shape()[0], DIMENSION)];
// Integer reindexing (axes are powers of 2):
unsigned long k;
for (k=0; k<x.flat.size(); ++k)
x[k] = y[reindex_powers_of_2(k, &x_log_shape[0], &y_log_shape[0], DIMENSION)];
// Tuple iteration (DIMENSION unknown at compile time):
vector<unsigned long> t(DIMENSION);
t.fill(0);
unsigned long k;
for (k=0; k<x.flat_size(); advance_tuple(&t[0], &x.data_shape()[0], DIMENSION), ++k)
x[k] = y[t];
# numpy (python):
x_sh = x.shape.
x = np.array(y[:x_sh[0], :x_sh[1], :x_sh[2]])
// Go:
for i:=0; i<1<<9; i++ {
for j:=0; j<1<<9; j++{
for k:=0; k<1<<5; k++{
x[i][j][k] = y[i][j][k]
}
}
}
// TRIOT (DIMENSION unknown at compile time):
apply_tensors([](double & xV, double yV) {
xV = yV;
},
x.data_shape(),
x, y);
令人惊讶地,整数重新索引(即使轴是2的幂)基本上不是使一个元组计数器慢。与模板递归版本有时更快(包括比boost更快30%,即使boost :: multi_array必须知道编译时的维数)。
这里是你将如何使用这个嵌套的循环招用模板递归一个又一个例子:
double dot_product(const Tensor & x<double>, const Tensor<double> & y) { // This function written for homogeneous types, but not unnecessary
double tot = 0.0;
for_each_tensors([&tot](double xV, double yV) {
tot += xV * yV;
},
x.data_shape(), /* Iterate over valid tuples for x.data_shape(); as written, this line assumes x has smaller shape*/
x, y);
return tot;
}
,并通过元组迭代多维卷积的实现,该版本与模板递归和numpy的进行了比较通过卷积两个矩阵,每个矩阵具有形状(2 ,2 )。
Tensor<double> triot_naive_convolve(const Tensor<double> & lhs, const Tensor<double> & rhs) {
assert(lhs.dimension() == rhs.dimension());
Tensor<double> result(lhs.data_shape() + rhs.data_shape() − 1ul);
result.flat().fill(0.0);
Vector<unsigned long> counter_result(result.dimension());
enumerate_for_each_tensors([&counter_result, &result, &rhs](const_tup_t counter_lhs, const unsigned int dim_lhs, double lhs_val) {
enumerate_for_each_tensors([&counter_result, &result, &rhs, &counter_lhs, &lhs_val](const_tup_t counter_rhs, const unsigned int dim_rhs, double rhs_val) {
for (unsigned int i=0; i<dim_rhs; ++i)
counter_result[i] = counter_lhs[i] + counter_rhs[i];
unsigned long result_flat = tuple_to_index(counter_result, result.data_shape(), dim_rhs);
result.flat()[result_flat] += lhs_val * rhs_val;
},
rhs.data_shape(),
rhs);
},
lhs.data_shape(), lhs);
return result;
}
这些基准是定时的2.0 GHz的英特尔Core i7的芯片上的优化(-std = C++ 11 -Ofast -march =天然 - mtune中=天然-fomit帧指针)。所有Fortran实现都以相反顺序使用轴,并以缓存优化方式访问数据,因为Fortran使用列主数组格式。 可在此small journal article中找到详细信息和源代码(一个简单的多维数组库,其中在编译时不需要知道维数)。