Skip to content Skip to sidebar Skip to footer

Understanding Shared Memory Use For Improvement In Numba

I'm trying to learn more about the use of shared memory to improve performance in some cuda kernels in Numba, for this I was looking at the Matrix multiplication Example in the Nu

Solution 1:

I made a performance mistake in the code I put in that other answer. I've now fixed it. In a nutshell this line:

tmp = 0.

caused numba to create a 64-bit floating point variable tmp. That triggered other arithmetic in the kernel to be promoted from 32-bit floating point to 64-bit floating point. That is inconsistent with the rest of the arithmetic and also inconsistent with the intent of the demonstration in the other answer. This error affects both kernels.

When I change it in both kernels to

tmp = float32(0.)

both kernels get noticeably faster, and on my GTX960 GPU, your test case shows that the shared code runs about 2x faster than the non-shared code (but see below).

The non-shared kernel also has a performance issue related to memory access patterns. Similar to the indices swap in that other answer, for this particular scenario only, we can rectify this problem simply by reversing the assigned indices:

j, i = cuda.grid(2)

in the non-shared kernel. This allows that kernel to perform approximately as well as it can, and with that change the shared kernel runs about 2x faster than the non-shared kernel. Without that additional change to the non-shared kernel, the performance of the non-shared kernel is much worse.

Post a Comment for "Understanding Shared Memory Use For Improvement In Numba"