我应该使用SIMD或矢量扩展还是其他什么?[英] Should I use SIMD or vector extensions or something else?

本文是小编为大家收集整理的关于我应该使用SIMD或矢量扩展还是其他什么?的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我目前正在开发一个开源3D应用程序框架(使用).我自己的数学库的设计类似于 XNA数学库,也带有 simd .但是目前它不是很快,并且在内存对齐方面存在问题,但在另一个问题中更多地涉及.

几天前,我问自己为什么我应该写自己的 sse 代码.优化开启时,编译器还能够生成高优化代码.我还可以使用" vector扩展名 vector Extension gcc .但这一切都不是真正可移植的.

我知道,当我使用自己的SSE代码时,我有更多的控制权,但是这种控制通常是不稳定的.

SSE的一个大问题是动态内存的使用,即在记忆池和面向数据的设计的帮助下,尽可能有限.

现在是我的问题:

  • 我应该使用裸露的SSE吗?也许封装了.

    __m128 v1 = _mm_set_ps(0.5f, 2, 4, 0.25f);
    __m128 v2 = _mm_set_ps(2, 0.5f, 0.25f, 4);
    
    __m128 res = _mm_mul_ps(v1, v2);
    
  • 或编译器应该进行肮脏的工作?

    float v1 = {0.5f, 2, 4, 0.25f};
    float v2 = {2, 0.5f, 0.25f, 4};
    
    float res[4];
    res[0] = v1[0]*v2[0];
    res[1] = v1[1]*v2[1];
    res[2] = v1[2]*v2[2];
    res[3] = v1[3]*v2[3];
    
  • ,还是我应该将SIMD与其他代码一起使用?就像带有SIMD操作的动态容器类一样,需要附加load和store指令.

    Pear3D::Vector4f* v1 = new Pear3D::Vector4f(0.5f, 2, 4, 0.25f);
    Pear3D::Vector4f* v2 = new Pear3D::Vector4f(2, 0.5f, 0.25f, 4);
    
    Pear3D::Vector4f res = Pear3D::Vector::multiplyElements(*v1, *v2);
    

    上面的示例在每个方法中使用float[4]内部使用float[4]和load的虚构类,例如multiplyElements(...).该方法使用SSE内部.

我不想使用另一个库,因为我想了解有关SIMD和大型软件设计的更多信息.但是欢迎图书馆示例.

ps:这不是一个真正的问题.

推荐答案

好吧,如果您想使用SIMD扩展名,则一种很好的方法是使用SSE Interins(当然,一定要远离内线组件,但幸运的是,您并没有将其列为替代方案).但是对于清洁度,您应该将它们封装在带有超载运算符的漂亮矢量类中:

struct aligned_storage
{
    //overload new and delete for 16-byte alignment
};

class vec4 : public aligned_storage
{
public:
    vec4(float x, float y, float z, float w)
    {
         data_[0] = x; ... data_[3] = w; //don't use _mm_set_ps, it will do the same, followed by a _mm_load_ps, which is unneccessary
    }
    vec4(float *data)
    {
         data_[0] = data[0]; ... data_[3] = data[3]; //don't use _mm_loadu_ps, unaligned just doesn't pay
    }
    vec4(const vec4 &rhs)
        : xmm_(rhs.xmm_)
    {
    }
    ...
    vec4& operator*=(const vec4 v)
    {
         xmm_ = _mm_mul_ps(xmm_, v.xmm_);
         return *this;
    }
    ...

private:
    union
    {
        __m128 xmm_;
        float data_[4];
    };
};

现在的好处是,由于匿名联盟(我知道UB,但请向我展示一个带有SSE的平台,在此不起作用),您可以在Neccessary时使用标准的float数组(例如operator[]或初始化(不要使用_mm_set_ps)),仅在适当时使用SSE.借助现代内部编译器,封装可能不成本(我很惊讶VC10如何通过此矢量类优化了一系列计算的SSE说明,不必担心不必要地转移到临时记忆变量中,因为VC8似乎甚至喜欢,甚至喜欢vc8没有封装).

唯一的缺点是,您需要照顾适当的对准,因为未对准的向量不会给您任何东西,甚至可能比非SSE慢.但是幸运的是,__m128的对齐要求将传播到vec4(以及周围的任何类)中,您只需要照顾动态分配,C ++具有良好的手段.您只需要制作一个基类,其operator delete和operator delete函数(当然在所有口味中)都会正确载荷,并从中从中得出您的矢量类别.要将您的类型与标准容器一起使用,您当然还需要专业化std::allocator(也许是std::get_temporary_buffer和std::return_temporary_buffer出于完整的目的),因为它将使用global operator new否则.

.

但是,真正的缺点是,您还需要照顾任何将您的SSE向量作为成员的类的动态分配,这可能很乏味,但是也可以通过从将整个std::allocator专业化弄乱到一个方便的宏中.

詹姆斯温(Jameswynn)的观点是,这些操作通常在一些特殊的重型计算块(例如纹理过滤或顶点变换)中融合在一起,但另一方面,使用这些SSE矢量封装并未在标准float[4]上引入任何开销. - 矢量类的实施.您需要将这些值从内存中获取到寄存器中(无论是X87堆栈还是标量SSE寄存器)才能进行任何计算,所以为什么不一次将EM全部进行(IMHO都不比移动单个单个移动较慢值(如果正确对齐),并平行计算.因此,您可以自由切换一个非SSE的SSE插入,而无需诱导任何开销(如果我的推理是错误的,请纠正我).

但是,如果确保所有具有vec4的课程的对齐方式对您来说太乏味了(IMHO是这种方法的唯一缺点),您还可以定义用于计算的专用SSE-vector类型并使用标准的非SSE矢量进行存储.


编辑:好的,要查看在这里的开销参数(起初看起来很合理),让我们进行一堆计算,看起来很干净,由于超载运算符. :

#include "vec.h"
#include <iostream>

int main(int argc, char *argv[])
{
    math::vec<float,4> u, v, w = u + v;
    u = v + dot(v, w) * w;
    v = abs(u-w);
    u = 3.0f * w + v;
    w = -w * (u+v);
    v = min(u, w) + length(u) * w;
    std::cout << v << std::endl;
    return 0;
}

看看VC10对此有何看法:

...
; 6   :     math::vec<float,4> u, v, w = u + v;

movaps  xmm4, XMMWORD PTR _v$[esp+32]

; 7   :     u = v + dot(v, w) * w;
; 8   :     v = abs(u-w);

movaps  xmm3, XMMWORD PTR __xmm@0
movaps  xmm1, xmm4
addps   xmm1, XMMWORD PTR _u$[esp+32]
movaps  xmm0, xmm4
mulps   xmm0, xmm1
haddps  xmm0, xmm0
haddps  xmm0, xmm0
shufps  xmm0, xmm0, 0
mulps   xmm0, xmm1
addps   xmm0, xmm4
subps   xmm0, xmm1
movaps  xmm2, xmm3

; 9   :     u = 3.0f * w + v;
; 10   :    w = -w * (u+v);

xorps   xmm3, xmm1
andnps  xmm2, xmm0
movaps  xmm0, XMMWORD PTR __xmm@1
mulps   xmm0, xmm1
addps   xmm0, xmm2

; 11   :    v = min(u, w) + length(u) * w;

movaps  xmm1, xmm0
mulps   xmm1, xmm0
haddps  xmm1, xmm1
haddps  xmm1, xmm1
sqrtss  xmm1, xmm1
addps   xmm2, xmm0
mulps   xmm3, xmm2
shufps  xmm1, xmm1, 0

; 12   :    std::cout << v << std::endl;

mov edi, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A
mulps   xmm1, xmm3
minps   xmm0, xmm3
addps   xmm1, xmm0
movaps  XMMWORD PTR _v$[esp+32], xmm1
...

即使没有彻底分析每个指令及其使用,我也很有信心说,除了开始时,没有任何不确定的负载或商店(好吧,我不专业化),这是必不可少的)无论如何,要将它们从内存转化为计算寄存器,最后是必要的,就像以下表达式v一样.它甚至没有将任何内容存储回u和w,因为它们只是我不再使用的临时变量.一切都完美地融合和优化.它甚至设法无缝地将DOT产品的结果无缝地进行以下乘法,而无需离开XMM寄存器,尽管dot函数在haddpss.

因此,即使我通常对编译器的能力有点过分浮游,也必须说,与您通过封装获得的清洁和表达代码相比,将自己的内在功能掌握在特殊功能中并没有真正付费.尽管您可能能够创建杀手级的示例,在这些示例中,掌握内在的示例确实可以节省一些说明,但是再一次,您首先必须超越优化器.


编辑:好吧,本·沃格特(Ben Voigt)指出了联盟的另一个问题(可能不是问题的)内存布局不兼容,也就是说,它违反了严格的别名规则,编译器可以优化说明以使代码无效的方式访问不同的工会成员.我还没有考虑过.我不知道它在实践中是否遇到任何问题,当然需要调查.

如果确实是一个问题,我们不幸的是,我们需要删除data_[4]成员并单独使用__m128.为了初始化,我们现在必须再次诉诸_mm_set_ps和_mm_loadu_ps. operator[]变得更加复杂,可能需要_mm_shuffle_ps和_mm_store_ss的某种组合.但是对于非CONST版本,您必须使用某种代理对象将分配委派给相应的SSE指令.必须研究编译器可以在哪种方式中优化特定情况下的额外开销.

否则您仅将SSE-vector用于计算,并且只需在整个非SSE向量转换界面,然后在计算的外围使用该界面(因为您通常不需要访问冗长计算中的单个组件).这似乎是 glm 处理这个问题的方式.但是我不确定 eigen 如何处理它.

但是,无论您处理它,仍然无需使用操作员超载的好处即可手工制作SSE仪器.

其他推荐答案

我建议您了解表达模板(使用代理对象的自定义操作员实现).这样,您可以避免在每个单独操作周围进行性能杀死的负载/存储,并且在整个计算中仅执行一次.

其他推荐答案

我建议在紧密控制的功能中使用裸露的SIMD代码.由于由于开销而不会将其用于主要矢量乘法,因此此功能可能应按照DOD来获取需要操纵的vector3对象列表.有一个,有很多.

本文地址:https://www.itbaoku.cn/post/627421.html

问题描述

I'm currently develop an open source 3D application framework in (with ). My own math library is designed like the XNA math library, also with SIMD in mind. But currently it is not really fast, and it has problems with memory alignes, but more about that in a different question.

Some days ago I asked myself why I should write my own SSE code. The compiler is also able to generate high optimized code when optimization is on. I can also use the "vector extension" of GCC. But this all is not really portable.

I know that I have more control when I use my own SSE code, but often this control is unnessary.

One big problem of SSE is the use of dynamic memory which is, with the help of memory pools and data oriented design, as much as possible limited.

Now to my question:

  • Should I use naked SSE? Perhaps encapsulated.

    __m128 v1 = _mm_set_ps(0.5f, 2, 4, 0.25f);
    __m128 v2 = _mm_set_ps(2, 0.5f, 0.25f, 4);
    
    __m128 res = _mm_mul_ps(v1, v2);
    
  • Or should the compiler do the dirty work?

    float v1 = {0.5f, 2, 4, 0.25f};
    float v2 = {2, 0.5f, 0.25f, 4};
    
    float res[4];
    res[0] = v1[0]*v2[0];
    res[1] = v1[1]*v2[1];
    res[2] = v1[2]*v2[2];
    res[3] = v1[3]*v2[3];
    
  • Or should I use SIMD with additional code? Like a dynamic container class with SIMD operations, which needs additional load and store instructions.

    Pear3D::Vector4f* v1 = new Pear3D::Vector4f(0.5f, 2, 4, 0.25f);
    Pear3D::Vector4f* v2 = new Pear3D::Vector4f(2, 0.5f, 0.25f, 4);
    
    Pear3D::Vector4f res = Pear3D::Vector::multiplyElements(*v1, *v2);
    

    The above example use a imaginary class with uses float[4] internal and uses store and load in each methods like multiplyElements(...). The methods uses SSE internal.

I don't want to use another library, because I want to learn more about SIMD and large scale software design. But library examples are welcome.

PS: This is not a real problem more a design question.

推荐答案

Well, if you want to use SIMD extensions, a good approach is to use SSE intrinsics (of course stay by all means away from inline assembly, but fortunately you didn't list it as alternative, anyway). But for cleanliness you should encapsulate them in a nice vector class with overloaded operators:

struct aligned_storage
{
    //overload new and delete for 16-byte alignment
};

class vec4 : public aligned_storage
{
public:
    vec4(float x, float y, float z, float w)
    {
         data_[0] = x; ... data_[3] = w; //don't use _mm_set_ps, it will do the same, followed by a _mm_load_ps, which is unneccessary
    }
    vec4(float *data)
    {
         data_[0] = data[0]; ... data_[3] = data[3]; //don't use _mm_loadu_ps, unaligned just doesn't pay
    }
    vec4(const vec4 &rhs)
        : xmm_(rhs.xmm_)
    {
    }
    ...
    vec4& operator*=(const vec4 v)
    {
         xmm_ = _mm_mul_ps(xmm_, v.xmm_);
         return *this;
    }
    ...

private:
    union
    {
        __m128 xmm_;
        float data_[4];
    };
};

Now the nice thing is, due to the anonymous union (UB, I know, but show me a platform with SSE where this doesn't work) you can use the standard float array whenever neccessary (like operator[] or initialization (don't use _mm_set_ps)) and only use SSE when appropriate. With a modern inlining compiler the encapsulation comes at probably no cost (I was rather surprised how well VC10 optimized the SSE instructions for a bunch of computations with this vector class, no fear of unneccessary moves into temporary memory variables, as VC8 seemed to like even without encapsulation).

The only disadvantage is, that you need to take care of proper alignment, as unaligned vectors don't buy you anything and may even be slower than non-SSE. But fortunately the alignment requirement of the __m128 will propagate into the vec4 (and any surrounding class) and you just need to take care of dynamic allocation, which C++ has good means for. You just need to make a base class whose operator new and operator delete functions (in all flavours of course) are overloaded properly and from which your vector class will derive. To use your type with standard containers you of course also need to specialize std::allocator (and maybe std::get_temporary_buffer and std::return_temporary_buffer for the sake of completeness), as it will use the global operator new otherwise.

But the real disadvantage is, that you need to also care for the dynamic allocation of any class that has your SSE vector as member, which may be tedious, but can again be automated a bit by also deriving those classes from aligned_storage and putting the whole std::allocator specialization mess into a handy macro.

JamesWynn has a point that those operations often come together in some special heavy computation blocks (like texture filtering or vertex transformation), but on the other hand using those SSE vector encapsulations doesn't introduce any overhead over a standard float[4]-implementation of a vector class. You need to get those values from memory into registers anyway (be it the x87 stack or a scalar SSE register) in order to do any computations, so why not take em all at once (which should IMHO not be any slower than moving a single value if properly aligned) and compute in parallel. Thus you can freely switch out an SSE-inplementation for a non-SSE one without inducing any overhead (correct me if my reasoning is wrong).

But if the ensuring of alignment for all classes having vec4 as member is too tedious for you (which is IMHO the only disadvantage of this approach), you can also define a specialized SSE-vector type which you use for computations and use a standard non-SSE vector for storage.


EDIT: Ok, to look at the overhead argument, that goes around here (and looks quite reasonable at first), let's take a bunch of computations, which look very clean, due to overloaded operators:

#include "vec.h"
#include <iostream>

int main(int argc, char *argv[])
{
    math::vec<float,4> u, v, w = u + v;
    u = v + dot(v, w) * w;
    v = abs(u-w);
    u = 3.0f * w + v;
    w = -w * (u+v);
    v = min(u, w) + length(u) * w;
    std::cout << v << std::endl;
    return 0;
}

and see what VC10 thinks about it:

...
; 6   :     math::vec<float,4> u, v, w = u + v;

movaps  xmm4, XMMWORD PTR _v$[esp+32]

; 7   :     u = v + dot(v, w) * w;
; 8   :     v = abs(u-w);

movaps  xmm3, XMMWORD PTR __xmm@0
movaps  xmm1, xmm4
addps   xmm1, XMMWORD PTR _u$[esp+32]
movaps  xmm0, xmm4
mulps   xmm0, xmm1
haddps  xmm0, xmm0
haddps  xmm0, xmm0
shufps  xmm0, xmm0, 0
mulps   xmm0, xmm1
addps   xmm0, xmm4
subps   xmm0, xmm1
movaps  xmm2, xmm3

; 9   :     u = 3.0f * w + v;
; 10   :    w = -w * (u+v);

xorps   xmm3, xmm1
andnps  xmm2, xmm0
movaps  xmm0, XMMWORD PTR __xmm@1
mulps   xmm0, xmm1
addps   xmm0, xmm2

; 11   :    v = min(u, w) + length(u) * w;

movaps  xmm1, xmm0
mulps   xmm1, xmm0
haddps  xmm1, xmm1
haddps  xmm1, xmm1
sqrtss  xmm1, xmm1
addps   xmm2, xmm0
mulps   xmm3, xmm2
shufps  xmm1, xmm1, 0

; 12   :    std::cout << v << std::endl;

mov edi, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A
mulps   xmm1, xmm3
minps   xmm0, xmm3
addps   xmm1, xmm0
movaps  XMMWORD PTR _v$[esp+32], xmm1
...

Even without thoroughly analyzing every single instruction and its use, I'm pretty confident to say that there aren't any unneccessary loads or stores, except the ones at the beginning (Ok, I left them uninitialized), which are neccessary anyway to get them from memory into computing registers, and at the end, which is neccessary as in the following expression v is gonna be put out. It didn't even store anything back into u and w, since they are only temporary variables which I don't use any further. Everything is perfectly inlined and optimized out. It even managed to seamlessly shuffle the result of the dot product for the following multiplication, without it leaving the XMM register, although the dot function returns a float using an actual _mm_store_ss after the haddpss.

So even I, being usually a bit oversuspicios of the compiler's abilities, have to say, that handcrafting your own intrinsics into special functions doesn't really pay compared to the clean and expressive code you gain by encapsulation. Though you may be able to create killer examples where handrafting the intrinics may indeed spare you some few instructions, but then again you first have to outsmart the optimizer.


EDIT: Ok, Ben Voigt pointed out another problem of the union besides the (most probably not problematic) memory layout incompatibility, which is that it is violating strict aliasing rules and the compiler may optimize instructions accessing different union members in a way that makes the code invalid. I haven't thought about that yet. I don't know if it makes any problems in practice, it certainly needs investigation.

If it really is a problem, we unfortunately need to drop the data_[4] member and use the __m128 alone. For initialization we now have to resort to _mm_set_ps and _mm_loadu_ps again. The operator[] gets a bit more complicated and might need some combination of _mm_shuffle_ps and _mm_store_ss. But for the non-const version you have to use some kind of proxy object delegating an assignment to the corresponding SSE instructions. It has to be investigated in which way the compiler can optimize this additional overhead in the specific situations then.

Or you only use the SSE-vector for computations and just make an interface for conversion to and from non-SSE vectors at a whole, which is then used at the peripherals of computations (as you often don't need to access individual components inside lengthy computations). This seems to be the way glm handles this issue. But I'm not sure how Eigen handles it.

But however you tackle it, there is still no need to handcraft SSE instrisics without using the benefits of operator overloading.

其他推荐答案

I suggest that you learn about expression templates (custom operator implementations that use proxy objects). In this way, you can avoid doing performance-killing load/store around each individual operation, and do them only once for the entire computation.

其他推荐答案

I would suggest using the naked simd code in a tightly controlled function. Since you won't be using it for your primary vector multiplication because of the overhead, this function should probably take the list of Vector3 objects that need to be manipulated, as per DOD. Where there's one, there is many.