ios上更快的卷积

我试图用16X16生成的内核对图像执行卷积。我使用了opencv filterengine类，但它只能在CPU上运行，而且我试图加速应用程序。我知道opencv也有filterengine_gpu，但我的理解是它不支持IOS。 GPU图像允许您使用3X3生成的滤镜执行卷积。有没有其他的方法来加速卷积？在GPU上运行的不同的库文件？ios上更快的卷积

来源

2013-10-14 RamBracha

您可以使用GPUImage进行16x16卷积，但是您需要编写自己的滤镜来完成。来自输入图像中每个像素周围3x3区域像素的框架样本中的3x3卷积，并应用您输入的权重矩阵。框架中的GPUImage3x3ConvolutionFilter.m源文件应该相当容易阅读，但我可以提供如果你希望超越我在那里的范围，那么这是一个小背景。

我要做的第一件事是使用下面的顶点着色器：

attribute vec4 position; 
attribute vec4 inputTextureCoordinate; 

uniform float texelWidth; 
uniform float texelHeight; 

varying vec2 textureCoordinate; 
varying vec2 leftTextureCoordinate; 
varying vec2 rightTextureCoordinate; 

varying vec2 topTextureCoordinate; 
varying vec2 topLeftTextureCoordinate; 
varying vec2 topRightTextureCoordinate; 

varying vec2 bottomTextureCoordinate; 
varying vec2 bottomLeftTextureCoordinate; 
varying vec2 bottomRightTextureCoordinate; 

void main() 
{ 
    gl_Position = position; 

    vec2 widthStep = vec2(texelWidth, 0.0); 
    vec2 heightStep = vec2(0.0, texelHeight); 
    vec2 widthHeightStep = vec2(texelWidth, texelHeight); 
    vec2 widthNegativeHeightStep = vec2(texelWidth, -texelHeight); 

    textureCoordinate = inputTextureCoordinate.xy; 
    leftTextureCoordinate = inputTextureCoordinate.xy - widthStep; 
    rightTextureCoordinate = inputTextureCoordinate.xy + widthStep; 

    topTextureCoordinate = inputTextureCoordinate.xy - heightStep; 
    topLeftTextureCoordinate = inputTextureCoordinate.xy - widthHeightStep; 
    topRightTextureCoordinate = inputTextureCoordinate.xy + widthNegativeHeightStep; 

    bottomTextureCoordinate = inputTextureCoordinate.xy + heightStep; 
    bottomLeftTextureCoordinate = inputTextureCoordinate.xy - widthNegativeHeightStep; 
    bottomRightTextureCoordinate = inputTextureCoordinate.xy + widthHeightStep; 
}

来计算从该位置到样品中的卷积使用的像素的颜色。由于使用了归一化坐标，因此像素之间的X和Y间距分别为1.0/[图像宽度]和1.0/[图像高度]。

在顶点着色器中计算要采样的像素的纹理坐标有两个原因：每个顶点执行一次该计算效率更高（其中构成矩形的两个三角形中有六个图像）比每个片段（像素），并尽可能避免依赖纹理读取。从属纹理读取是在片段着色器中计算要读取的纹理坐标的位置，而不是简单地从顶点着色器传入，并且它们在iOS GPU上慢得多。

曾经有在顶点着色器计算出的纹理位置，我将其传递到片段着色器作为varyings和使用下面的代码有：

uniform sampler2D inputImageTexture; 

uniform mat3 convolutionMatrix; 

varying vec2 textureCoordinate; 
varying vec2 leftTextureCoordinate; 
varying vec2 rightTextureCoordinate; 

varying vec2 topTextureCoordinate; 
varying vec2 topLeftTextureCoordinate; 
varying vec2 topRightTextureCoordinate; 

varying vec2 bottomTextureCoordinate; 
varying vec2 bottomLeftTextureCoordinate; 
varying vec2 bottomRightTextureCoordinate; 

void main() 
{ 
    vec3 bottomColor = texture2D(inputImageTexture, bottomTextureCoordinate).rgb; 
    vec3 bottomLeftColor = texture2D(inputImageTexture, bottomLeftTextureCoordinate).rgb; 
    vec3 bottomRightColor = texture2D(inputImageTexture, bottomRightTextureCoordinate).rgb; 
    vec4 centerColor = texture2D(inputImageTexture, textureCoordinate); 
    vec3 leftColor = texture2D(inputImageTexture, leftTextureCoordinate).rgb; 
    vec3 rightColor = texture2D(inputImageTexture, rightTextureCoordinate).rgb; 
    vec3 topColor = texture2D(inputImageTexture, topTextureCoordinate).rgb; 
    vec3 topRightColor = texture2D(inputImageTexture, topRightTextureCoordinate).rgb; 
    vec3 topLeftColor = texture2D(inputImageTexture, topLeftTextureCoordinate).rgb; 

    vec3 resultColor = topLeftColor * convolutionMatrix[0][0] + topColor * convolutionMatrix[0][1] + topRightColor * convolutionMatrix[0][2]; 
    resultColor += leftColor * convolutionMatrix[1][0] + centerColor.rgb * convolutionMatrix[1][1] + rightColor * convolutionMatrix[1][2]; 
    resultColor += bottomLeftColor * convolutionMatrix[2][0] + bottomColor * convolutionMatrix[2][1] + bottomRightColor * convolutionMatrix[2][2]; 

    gl_FragColor = vec4(resultColor, centerColor.a);

这读取每个9种颜色的和适用的权重从提供用于卷积的3x3矩阵。

也就是说，16x16卷积是一个相当昂贵的操作。您正在查看每像素256个纹理读取。在较旧的设备上（iPhone 4左右），如果它们是非依赖性读取，则免费获得每像素8个纹理读取空间。一旦你过去了，表现开始大幅下降。不过，后来的GPU显着加速了这一点。例如，iPhone 5S几乎免费提供每像素40个以上的相关纹理读取。即使1080p视频中最重的着色器也几乎不会使其变慢。如sansuiso所说，如果你有一种将你的内核分离成水平和垂直通道的方式（就像高斯模糊内核一样），由于纹理读取的大幅减少，你可以获得更好的性能。对于你的16x16内核，你可以从256个读取下降到32个，甚至这32个将会更快，因为它们来自一次仅采样16个纹素的传递。

在CPU上进行加速比在OpenGL ES中执行此操作的交叉点会因您正在运行的设备而异。一般而言，iOS设备上的GPU在最近一代的性能增长方面已经超过了CPU，因此在过去的几款iOS机型中，该棒已经转移到了GPU端。

来源

2013-10-14 15:13:28

您可以使用Apple的Accelerate framework。它可以在iOS和MacOS上使用，因此稍后可能会重用您的代码。

为了达到最佳性能，您可能需要考虑以下选项：

如果你的卷积核是可分的，使用separable implementation。对称内核就是这种情况（如高斯卷积）。这将在计算时间内节省约一个数量级;
如果您的图像具有幂的两倍大小，请考虑使用FFT技巧。空间域中的卷积（复杂度N^2）相当于傅立叶域中的乘积（复杂度N）。因此，您可以1）对图像和内核进行FFT，2）将结果逐项相乘，以及3）对结果进行FFT。由于FFT算法是快速的（例如，Accelerate框架中的Aple FFT），这一系列操作可以导致性能提升。

你可以在this book中找到更多关于iOS图像处理优化的内容，我也查看了here。

来源

2013-10-14 06:50:03 sansuiso

ios上更快的卷积

回答

相关问题