您可以想出一些方法来优化这段代码吗?它意味着在ARMv7处理器(Iphone 3GS)中执行:优化C++代码的性能
4.0% inline float BoxIntegral(IplImage *img, int row, int col, int rows, int cols)
{
0.7% float *data = (float *) img->imageData;
1.4% int step = img->widthStep/sizeof(float);
// The subtraction by one for row/col is because row/col is inclusive.
1.1% int r1 = std::min(row, img->height) - 1;
1.0% int c1 = std::min(col, img->width) - 1;
2.7% int r2 = std::min(row + rows, img->height) - 1;
3.7% int c2 = std::min(col + cols, img->width) - 1;
float A(0.0f), B(0.0f), C(0.0f), D(0.0f);
8.5% if (r1 >= 0 && c1 >= 0) A = data[r1 * step + c1];
11.7% if (r1 >= 0 && c2 >= 0) B = data[r1 * step + c2];
7.6% if (r2 >= 0 && c1 >= 0) C = data[r2 * step + c1];
9.2% if (r2 >= 0 && c2 >= 0) D = data[r2 * step + c2];
21.9% return std::max(0.f, A - B - C + D);
3.8% }
所有此代码取自OpenSURF库。这里的功能的情况下(有些人要求的上下文中):
//! Calculate DoH responses for supplied layer
void FastHessian::buildResponseLayer(ResponseLayer *rl)
{
float *responses = rl->responses; // response storage
unsigned char *laplacian = rl->laplacian; // laplacian sign storage
int step = rl->step; // step size for this filter
int b = (rl->filter - 1) * 0.5 + 1; // border for this filter
int l = rl->filter/3; // lobe for this filter (filter size/3)
int w = rl->filter; // filter size
float inverse_area = 1.f/(w*w); // normalisation factor
float Dxx, Dyy, Dxy;
for(int r, c, ar = 0, index = 0; ar < rl->height; ++ar)
{
for(int ac = 0; ac < rl->width; ++ac, index++)
{
// get the image coordinates
r = ar * step;
c = ac * step;
// Compute response components
Dxx = BoxIntegral(img, r - l + 1, c - b, 2*l - 1, w)
- BoxIntegral(img, r - l + 1, c - l * 0.5, 2*l - 1, l)*3;
Dyy = BoxIntegral(img, r - b, c - l + 1, w, 2*l - 1)
- BoxIntegral(img, r - l * 0.5, c - l + 1, l, 2*l - 1)*3;
Dxy = + BoxIntegral(img, r - l, c + 1, l, l)
+ BoxIntegral(img, r + 1, c - l, l, l)
- BoxIntegral(img, r - l, c - l, l, l)
- BoxIntegral(img, r + 1, c + 1, l, l);
// Normalise the filter responses with respect to their size
Dxx *= inverse_area;
Dyy *= inverse_area;
Dxy *= inverse_area;
// Get the determinant of hessian response & laplacian sign
responses[index] = (Dxx * Dyy - 0.81f * Dxy * Dxy);
laplacian[index] = (Dxx + Dyy >= 0 ? 1 : 0);
#ifdef RL_DEBUG
// create list of the image coords for each response
rl->coords.push_back(std::make_pair<int,int>(r,c));
#endif
}
}
}
一些问题:
这是个好主意,该功能是内联? 会使用内联汇编提供显着的加速吗?
对您的问题的单一正确答案是:措施。 – dirkgently 2010-09-08 14:02:09
是的,看看最近的C++问题 - 有一个关于向量与数组的速度 - 代码显示了如何使用升压定时器进行性能分析。您也可以查看graphics.stanford.edu/~seander/bithacks.html - 这里的许多小黑客可以提供更快捷的处理方式。内联汇编 - 也许 - 我不知道那个CPU,所以不能说。 – 2010-09-08 14:16:54
r1,r2,c1,c2中的任何一个为负数?那些测试应该都是多余的。 – phkahler 2010-09-08 14:45:10