如何加快我在VHDL中的数学运算速度?[英] How can I speed up my math operations in VHDL?

本文是小编为大家收集整理的关于如何加快我在VHDL中的数学运算速度?的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我目前在75MHz像素时钟的上升边缘进行了一些计算,以在屏幕上输出720p视频.一些数学(例如一些模型)花费太长(20+ns,而75MHz为13.3ns),因此无法满足我的时间限制.我是FPGA的新手,但我想知道例如,是否有一种方法比当前的像素时钟更快地运行计算,以便将它们完成为75MHz时钟的下一个刻度完成.我正在使用VHDL.

推荐答案

以下是一些技术:

  • 管道式 - 将逻辑分开以在多个时钟周期上操作
  • 多循环路径 - 如果您不需要每个周期的答案,则可以告诉工具可以花更长的时间.需要注意不要告诉工具错误的事情!
  • 再考虑一下 - 例如,您真的需要在非常宽的x上做x mod 3,还是可以使用连续更新的Modulo 3计数器?
  • 使用更好的工具 - 我有实例可以使用昂贵的合成器在深层逻辑路径上遇到时间,而不是使用供应商的合成器在相同的代码上满足时机.

更极端的解决方案涉及更换硅,更快的设备或更新的设备或更新的设备.

其他推荐答案

75 MHz按照今天的FPGA标准已经很慢.

问题是Modulo操作,有效地涉及分裂;划分很慢.

仔细考虑所需的操作,以及是否有任何方法可以重组计算.如果您要计时像素,那就好像您有32位整数要处理;限制值更容易处理.

马丁暗示了一种选项:降低力量.如果您有1280个像素/行,并且需要在第三个上方操作,则无需计算1280 mod 3!计数0,1,2,0,...代替.

另一个,如果您需要8位(或12位)数字的Modulo-3,则将所有可能的值存储在查找表中,这将足够快.

或有时您可以乘以1/3(x" 5555"),而不是除以3,然后乘以3(这是一个添加),然后减去获得Modulo.该管道真的很好,但是由于X" 5555"仅是1/3的近似值,因此您需要在模拟中验证它为每个输入提供正确的输出. (对于16位输入,这不是一个大模拟!)Modulo 9的扩展很容易.

编辑:

您的评论中有两个点:您拥有的另一个选项是使用Spartan的时钟生成器创建X2 Clock(150MHz),这为您每个像素提供2个周期.管道良好的代码应符合150 MHz,而不会发生太多麻烦.

如何如何>到管道!

PROCESS(Clk)
BEGIN
    if(rising_edge(Clk)) then
        for i in 0 to 2 loop
            case i is
                when 0 => temp1 <= a*data;
                when 1 => temp2 <= temp1*b;
                when 2 => result <= temp2*c;
                when others => null;
            end case;
        end loop;
    end if;
END PROCESS;

要意识到的第一件事是循环和案例语句相互取消,因此简化为

PROCESS(Clk)
BEGIN
    if rising_edge(Clk) then
        temp1 <= a*data;
        temp2 <= temp1*b;
        result <= temp2*c;
    end if;
END PROCESS;

哪个是越野车!测试台也是越野车,隐藏了问题.

在周期1,数据,a,b,c中呈现,temp1 = data*a是计算的.
在周期2中,Temp1乘以B的新值,而不是正确的值!
在第3周期中再次相同!

由于测试台设置了输入并使它们保持恒定,因此不会遇到问题!

PROCESS(Clk)
BEGIN
    if rising_edge(Clk) then
        -- cycle 1
        temp1   <= a*data;
        b_copy  <= b;
        c_copy1 <= c;
        -- cycle 2
        temp2   <= temp1*b_copy;
        c_copy2 <= c_copy1;
        -- cycle 3
        result  <= temp2*c_copy2;
    end if;
END PROCESS;

我喜欢评论每个周期;我在周期中使用的每个学期都必须来自前一个周期,无论是计算还是来自副本.

至少这可行,但是可以将其简化为2个周期深度和更少的复制寄存器,因为在此示例中,四个输入是独立的(我假设没有避免溢出的措施).所以:

PROCESS(Clk)
BEGIN
    if rising_edge(Clk) then
        -- cycle 1
        temp1   <= a * data;
        temp2   <= b * c;
        -- cycle 2
        result  <= temp1 * temp2;
    end if;
END PROCESS;

其他推荐答案

通常将FPGA中的复杂数学操作管道列入管道.管道意味着您将操作分为阶段.假设您的乘数需要花费太长的时钟速度.您将乘数分为3个阶段.基本上,您的乘数由三个不同的部分(具有自己的时钟输入)组成.这三个部分将比一部分小,因此它们的延迟将较小,因此您可以为它们使用更快的时钟.

这将是"延迟"的缺点.您的管道系统将使输出具有延迟.在上面的乘数示例中要具有正确的输出,您必须等到输入通过所有3个阶段.但这通常很小(当然取决于您的设计),并且可以忽略.

这是一个很好的(!)帖子: http://vhdlguru.blogspot.com/2011/01/what-is-pipelining-explanation-with.html edit:请参阅Brian的帖子.

还通常,供应商通常在其设计软件中作为IP内核进行优化和管道的数学操作版本.寻找它们.

本文地址:https://www.itbaoku.cn/post/2090954.html

问题描述

I have some calculations going on currently at rising edge of a 75MHz pixel clock to output 720p video on screen. Some of the math (like a few modulo) take too long (20+ns whereas 75MHz is 13.3ns) so my timing constraints are not met. I'm new to FPGAs but I'm wondering if for example there is a way to run the calculations at a faster speed than the current pixel clock in order to have them completed by the next tick of the 75MHz clock. I'm using VHDL by the way.

推荐答案

Here's some techniques:

  • Pipelining - split the logic up to operate over multiple clock cycles
  • multi-cycle path - if you don't need the answer every cycle, you can tell the tools that it's OK for it to take longer. Care is required not to tell the tools the wrong thing though!
  • Think again - for example, do you really need to do x mod 3 on very wide x, or could you use a continuously updated modulo 3 counter?
  • Use better tools - I've had instances where I could meet timing on a deep-logic-path using an expensive synthesizer compared to not meeting timing on the same code using the vendor's synthesizer.

More extreme solutions involve changing the silicon, for a faster device, or a newer device, or a newer, faster device.

其他推荐答案

75 MHz is already quite slow by today's FPGA standards.

The problem is the modulo operation, which effectively involves division; and division is slow.

Think carefully about the operations you need, and if there is any way to reorganise the computation. If you are clocking pixels it's not as if you have 32-bit integers to deal with; restricted values are easier to deal with.

Martin hinted at one option: strength reduction. If you have 1280 pixels/line and need to operate on every third one, you don't need to compute 1280 mod 3! Count 0,1,2,0,... instead.

Another, if you need modulo-3 of an 8-bit (or 12-bit) number is to store all possible values in a lookup table, which will be fast enough.

Or sometimes you can multiply by 1/3 (X"5555") instead of dividing by 3, then multiply by 3 (which is a single addition) and subtract to get the modulo. This pipelines really well, but since X"5555" is only an approximation to 1/3 you need to verify in simulation that it delivers the correct output for every input. (for 16-bit inputs, this isn't a big simulation!) The extension to modulo 9 is easy.

EDIT:

Two points from your comments : Another option you have is to create a X2 clock (150MHz) using the Spartan's clock generators, which gives you 2 cycles per pixel. Well pipelined code should meet 150 MHz without much trouble.

How not to pipeline!

PROCESS(Clk)
BEGIN
    if(rising_edge(Clk)) then
        for i in 0 to 2 loop
            case i is
                when 0 => temp1 <= a*data;
                when 1 => temp2 <= temp1*b;
                when 2 => result <= temp2*c;
                when others => null;
            end case;
        end loop;
    end if;
END PROCESS;

The first thing to realise is that the loop and case statement cancel each other out, so this simplifies to

PROCESS(Clk)
BEGIN
    if rising_edge(Clk) then
        temp1 <= a*data;
        temp2 <= temp1*b;
        result <= temp2*c;
    end if;
END PROCESS;

which is buggy! The testbench also being buggy, hides the problem.

In cycle 1, Data,a,b,c are presented, and temp1 = Data*a is computed.
In cycle 2, temp1 is multiplied by a NEW value of b instead of the correct one!
Same again in cycle 3!

Since the testbench sets the inputs and leaves them constant, it won't catch the problem!

PROCESS(Clk)
BEGIN
    if rising_edge(Clk) then
        -- cycle 1
        temp1   <= a*data;
        b_copy  <= b;
        c_copy1 <= c;
        -- cycle 2
        temp2   <= temp1*b_copy;
        c_copy2 <= c_copy1;
        -- cycle 3
        result  <= temp2*c_copy2;
    end if;
END PROCESS;

I like to comment each cycle; every term I use in a cycle must come from the immediately preceding cycle, either by calculation or from a copy.

At least this works, but it could be reduced to 2 cycles depth and fewer copy registers because in this example, the four inputs are independent (and I am assuming there are no measures required to avoid overflow). So:

PROCESS(Clk)
BEGIN
    if rising_edge(Clk) then
        -- cycle 1
        temp1   <= a * data;
        temp2   <= b * c;
        -- cycle 2
        result  <= temp1 * temp2;
    end if;
END PROCESS;

其他推荐答案

Usually complex math operations in FPGAs are pipelined. Pipelining means you divide your operations to stages. Let's say you have a multiplier which takes too long for your clock speed. You divide your multiplier to 3 stages. Basically your multiplier consists of three different parts (which has their own clock input) chained one after. These three parts will be smaller then one part, so they will have a smaller delay thus you can use a faster clock for them.

A drawback of this will be the 'delay'. Your pipelined system will give output with a latency. In the multiplier example above to have the correct output, you have to wait until your input passes all 3 stages. But this is usually very small (depending on your design of course) and can be ignored.

Here is a good (!) post about this: http://vhdlguru.blogspot.com/2011/01/what-is-pipelining-explanation-with.html EDIT: See Brian's post instead.

Also vendors usually ship optimized and pipelined versions of math operations as IP cores in their design software. Look for them.