PyOpenCL中的时间测量[英] Time measuring in PyOpenCL

本文是小编为大家收集整理的关于PyOpenCL中的时间测量的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我在FPGA和GPU中使用Pyopencl运行一个内核.为了衡量执行时间所需的时间:

t1 = time()
event = mykernel(queue, (c_width, c_height), (block_size, block_size), d_c_buf, d_a_buf, d_b_buf, a_width, b_width)
event.wait()
t2 = time()

compute_time = t2-t1
compute_time_e = (event.profile.end-event.profile.start)*1e-9 

这为我提供了从主机的角度(compute_time)和设备(Compute_time_e)的执行时间.问题是该值大不相同:

compute (host-timed) [s]: 0.0009386539459228516
compute (event-timed) [s]:  9.4528e-05

有人知道这种差异的原因是什么?更重要的是,哪个更准确?

谢谢.

推荐答案

这两个数字看上去都正确.如果我正确阅读本文,主机的测量大约为10倍设备时间 - 对于小内核而言,这不是超级奇怪的,因为它包括传输时间延迟.您的宿主时间测量通过PCB通信,但您的设备时间只是在测量芯片操作.

我认为您的程序时机会像这样分解:

  • Kernel Execution Time: 0.1ms // event-timed
  • Transfer Time: 0.8ms // (host-timed - event-timed)
  • Total Time: 0.9ms // host-timed

如果您对情况感到好奇,请尝试运行在设备上花费更长的内核.随着固定传输时间变为整体时间,您应该开始看到这些数字更加紧密地匹配.

例如:

  • Kernel Execution Time: 900ms
  • Transfer Time: 0.8ms
  • Total Time: 900.8ms

其他推荐答案

您可以从 intels 网站上.它指出,event.profile仅给出了内核的纯硬件执行时间的提示,并列出了数据传输时间(其中包括在第一个测量中).因此,宿主侧墙锁定时间可能会返回不同的结果.但是,还指出,如果您将内核作为OpenCL设备对准CPU,则时间差应该较低(甚至可以忽略不计).

本文地址:https://www.itbaoku.cn/post/2090977.html

问题描述

I am running a kernel using PyOpenCL in a FPGA and in a GPU. In order to measure the time it takes to execute I use:

t1 = time()
event = mykernel(queue, (c_width, c_height), (block_size, block_size), d_c_buf, d_a_buf, d_b_buf, a_width, b_width)
event.wait()
t2 = time()

compute_time = t2-t1
compute_time_e = (event.profile.end-event.profile.start)*1e-9 

This provides me the execution time from the point of view of the host (compute_time) and from the device (compute_time_e). The problem is that this values are very different:

compute (host-timed) [s]: 0.0009386539459228516
compute (event-timed) [s]:  9.4528e-05

Does anyone knows what can be the reason for this differences? And more important, which one is more accurate?

Thank you.

推荐答案

Both those numbers look right to me. If I am reading this correctly, the host is measuring about 10x the device time - which is not super strange for a small kernel because it includes transfer time latency. Your host time measures communicating through the PCB but your device time is just measuring an on-chip operation.

I think your program timing breaks down like this:

  • Kernel Execution Time: 0.1ms // event-timed
  • Transfer Time: 0.8ms // (host-timed - event-timed)
  • Total Time: 0.9ms // host-timed

If you are curious about the situation, try running a kernel that takes much longer on the device. You should start see these numbers match up much more closely as the fixed transfer time becomes less of the overall time.

For example:

  • Kernel Execution Time: 900ms
  • Transfer Time: 0.8ms
  • Total Time: 900.8ms

其他推荐答案

You can learn pretty much from Intels site on OpenCL. It states, that event.profile only gives a hint on the pure hardware execution time of the kernel and leaves out the data transfer times (which is included in your first measurement). Therefore the host-side wall-clock time might return different results. However, it is also stated that if you aim the kernel to the CPU as an OpenCL device, the time difference should become lower (or even negligible).