Benchmarking GPUs to Tune Dense Linear Algebra

  • Panos: Try to fit everything in registers. Compiler might spill to local memory, which is very slow. This should be avoided. It's a trial & error process to get it right.
  • Occupancy corresponds to the number of warps used out of the total number of available warps. Higher occupancy doesn't necessarily mean better performance. Shared memory access greatly affects performance.
  • CPU-GPU transfers also affect performance. As expected, larger problem sizes suffer less from transfer overheads and lead to a better speedup. (e.g. Fig. 8)
  • Panos: Latest nVidia GPUs have doubled the number of registers.