这个正弦近似值会比着色器的CG正弦函数快吗?[英] will this sinus approximation be faster than a shader CG sinus function?

本文是小编为大家收集整理的关于这个正弦近似值会比着色器的CG正弦函数快吗?的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我的功能不是真正的罪过,但是它们比常规处理快得多,它们是简单的副词.

在图形处理器上,这比内置图形窦功能更快:

    float  par (float xx){////// sinus approximation
        half xd =((fmod(abs(xx), 2.4)) - 1.2);
        if ( fmod (abs(xx) , 4.8)  > 2.4) { xd=(-xd*xd)+2.88;}
        else {xd = xd*xd;}
        xd = -xd*0.694444444+1;
        if (  (xx<0) ) { xd=-xd;}
        return xd;
    }

推荐答案

主要答案

绝对不会比上的sin/cos函数更快.图形卡.

着色器说明罪,cos&tan是几乎所有制造的图形卡的单周期指令.您今天当然不能购买它不是单周期的图形卡.

要将您的问题视为 - 在图形卡上,它花费了多个2个数字(MUL指令)与获取窦功能(sin函数) - 单个GPU循环.

编写着色器时,请查看编译器的命令行选项.将有一些选项输出生成的汇编代码,并且大多数编译器甚至为最短路径(指令和周期数)和最长路径提供总计.这些总数不能保证持续时间,因为诸如Fetch之类的事情可能会停滞管道,但是它们回答了您现在要问的问题类型.

着色器指令的确不同于卡,但我认为最长的单个指令是4 GPU周期.

如果您查看着着色器编译器组件输出的功能,您正在调用大量说明,使用大量周期,然后询问是否可以比单个周期指令更快地执行它.

图形芯片的全部目的是它们在运行指令集方面非常快速并且非常平行(但是,这些指令可能在其他处理器上).编程着色器将代码集中在处理器设计的内容上时.着色器编程与您在软件开发中其他地方进行的编程不同,但是一旦您开始考虑计数周期并最大程度地减少了获取摊位,您就会很快开始打开着色器处理的真正功能.

>

祝您好运.

其他推荐答案

补充概念帮助

在开始之前,我应该解释一下,我没有,也从未为GPU制造商工作.我在下面说的某些可能实际上是错误的,但这是我理解为程序员的方式.

下面是现代GPU的图像.此图显示了8个包含8个队列的8个通用管道,因此可以处理64个指令每个时钟单周期操作.

旧的GPU具有固定的不可编程管道,我们对这些管道并不感兴趣. 中GPU有特定的管道来运行向量程序,以及用于像素阴影的不同管道. 现代GPU具有可以运行任何类型程序(包括镶嵌,计算等)的通用管道

仲裁和分配探针,确定应运行哪些程序以及应向他们发送哪些输入,以便每个周期都使用尽可能多的处理器.作为程序员,我们与这些无关,所以这对我来说是一个黑匣子.

我们正在编写控制管道的程序.因此,请想象AA探针已决定将Pipe0用作像素着色器(我认为您的程序正在用颜色做些事情,因为您不必担心舍入,这会导致Verts跳跃).然后,它将选择需要相同程序(请参阅纹理)的8个像素,然后将它们加载到过程缓冲区中.然后,所有8个像素一次按并联一个指令运行,直到程序完成为止,然后将管道归还给AA探针以获得新工作.如果少于8个像素需要该程序,则管道是用一些过程缓冲区空的,并且芯片未充分利用,您对此无能为力,但这就是为什么放大到单个像素对象的原因屏幕上的所有纹理都带有不同的纹理杀死GPU.

因此,在一个周期中,一个计算管道可以对8像素进行8 muls或8个像素的8 sin,但是它必须对每个像素线性地运行每一个指令,这就是因为语句对于着色器程序如此复杂.通过处理条件的像素是处理的,在处理传递像素时,失败的像素仍然必须等待周期.

显然,我说过像素的每个地方,它可能是一个vert或cu元素.

我在这里提到的唯一另一件事就是精确度.当您降低精度时,它允许处理缓冲区更加密集.因此,如果您到处使用一半的精度,而不是GPU处理64个数字,则可以做128,依此类推.

这大致是GPU的工作原理.我当然发现这种体系结构对为什么着色器程序的方式更具意义. 现代图形芯片的体系结构

本文地址:https://www.itbaoku.cn/post/2091117.html

问题描述

I have some functions that are not really sines but they are a lot quicker than conventional processing, they are simple parabole functions.

Will this be faster on a graphics processor than the built-in graphics sinus function:

    float  par (float xx){////// sinus approximation
        half xd =((fmod(abs(xx), 2.4)) - 1.2);
        if ( fmod (abs(xx) , 4.8)  > 2.4) { xd=(-xd*xd)+2.88;}
        else {xd = xd*xd;}
        xd = -xd*0.694444444+1;
        if (  (xx<0) ) { xd=-xd;}
        return xd;
    }

推荐答案

MAIN ANSWER

There is absolutely no way your function will be faster than the built in sin/cos functions on any graphics cards.

The shader instructions sin ,cos & tan are single-cycle instructions on just about EVERY graphics card ever manufactured. You certainly cannot purchase a graphics card today where it isn't a single-cycle.

To put your question in perspective - on a graphics card, it takes the same time to multiple 2 numbers (mul instruction) as it does to get the sinus (sin function) - a single GPU cycle.

When writing your shaders have a look at the command line options for your compiler. There will be options to output the assembly code generated, and most compilers even provide totals for the shortest path (number of instructions and cycles) and the longest path. These totals are not guaranteed durations because things like fetch can stall a pipeline, but they answer the type of question you are now asking.

Shader instruction do vary from card to card, but I think the longest single instruction is 4 GPU cycles.

If you took a look at the shader compiler assembly output for your function you are calling lots of instructions, using lots of cycles, and then asking if it could be executed more quickly than a single cycle instruction.

The whole purpose of Graphics Chips is that they are very fast and very parallel at running their instruction sets (however complex those instructions may be on other processors). When programming shaders focus your code on what the processor is designed to do. Shader programming is a different mind set from the programming you do elsewhere in software development, but once you start thinking about counting cycles, and minimizing fetch stalls, you'll soon start to open the true power of shader processing.

Best of luck.

其他推荐答案

SUPPLEMENTAL CONCEPTUAL HELP

Before I begin, I should explain I do not and have never worked for a GPU manufacturer. Some of what I say below may be factually wrong, but it is how I understand it as a programmer.

Below is an image of a modern GPU. This image shows 8 general purpose pipes each containing 8 queues so it can process 64 instructions single instruction operations per cycle of the clock.

Old GPU had a fixed non-programmable pipeline and we are not really interested in those. Middle GPU had specific pipes to run vector programs, and different pipes for pixel shading. Modern GPU have general purpose pipes that can run any type of program (including tessellation, compute, etc)

The arbitration and allocation probes, decide which pipes should run which programs, and what inputs should be sent to them, so that as much of the processor as possible is being used each cycle. As a programmer we have nothing to do with these, and so this is a total black box to me.

We are writing the programs that control the pipes. So imagine the AA probe has decided to use pipe0 as a pixel shader (I assume your program is doing something with colour as you not worried about rounding, which would cause verts to jump about). It will then pick 8 pixels that require the same program (see texture), and load them into the process buffers. All 8 pixels are then run in parallel one instruction at a time, until the program is completed, and the pipe is given back to the AA probe to be given a new job. If there are less than 8 pixels that need that program, the pipe is run with some of the process buffers empty, and the chip is underutilized there isn't much you can do about this, but it is why zooming out to single pixel objects all with different textures over you screen kills the GPU.

So in one cycle one computational pipe can do 8 muls for 8 pixels or 8 sins for 8 pixels, but it has to run every instruction for every pixel linearly, that is the reason that if statements are so complex for shader programs. pixels that pass the condition are processed, pixels that fail still have to wait the cycles while the passing pixels are processed.

Obviously, every place I have said pixel, it could be a vert, or a CU element.

The only other thing that I can think to mentioned here is precision. When you lower the precision it allows a processing buffer to be stuffed more densely. So if you are using half precision everywhere, instead of the GPU processing 64 numbers per second it can do 128, and so on.

That's roughly how a GPU works. I certainly found understanding the architecture made a lot more sense of why shader programs are the way they are. Architecture of a modern Graphics Chip