CS501 GDB Solution Feb 2015

GDB Topic:

“As memory subsystem is one of the most important components of computer and the overall speed of the computer can be improved by improving the performance of memory sub-system. The performance of the memory sub-system depends on two characteristics:  Bandwidth and Latency. But, there are often subtle tradeoffs between latency and bandwidth as giving more emphasis on one characteristic can have an adverse effect on other.”

Being a computer architect, you want to design a memory sub-system for Graphics Processing Unit (GPU). While designing, you will focus on which of the above characteristics more and why?


Designing GPU-Accelerated Applications

GPUs hold tremendous performance potential and have already shown impressive speedups in many applications. Yet, realizing that potential might be challenging. In this section we offer a few simple guidelines to help designing GPU-accelerated applications.

Accelerate performance hotspots. Usually GPUs are used to run only a subset of application code. It is thus instructive to estimate the contribution π of the code intended for acceleration in the application execution time, to guarantee that making it faster will have a tangible effect on the overall performance. Following Amdahl’s law, the maximum application speedup is limited toubiq0814_a.gif, so porting to a GPU a routine consuming 99 percent of the total execution time (π = 0.99), for example, has 10x higher acceleration potential than porting a routine consuming 90 percent alone.

Move more computations to GPUs. GPUs are not as efficient as CPUs when running control flow-intensive workloads. However, longer kernels with some control flow ported to a GPU are often preferable over shorter kernels controlled from CPU code. In iterative algorithms, for example, offloading to the GPU the computations performed in every iteration may be less efficient than moving the whole loop into the kernel. The latter design hides GPU kernel invocation overheads, avoids thread synchronization effects when the kernel terminates, and allows for keeping intermediate data in hardware caches.

Stay inside GPU memory. Memory capacity of discrete GPUs reaches 12GB in high-end systems. However if the working set of a GPU kernel is too large, as is often the case for stream processing applications, for example, the performance becomes limited by the throughput of CPU-GPU data transfers. Achieving performance improvement on GPUs in such applications is possible for compute-intensive workloads with high compute-to-memory access ratio.

Conform to hardware parallelism hierarchy. Ample parallelism in the computing algorithm is an obvious requirement for successful acceleration on GPUs. However having thousands of independent tasks alone might not be sufficient to achieve high performance. It is crucial to design a parallel algorithm, which maps well onto the GPU hierarchical hardware parallelism. For example, it is best to employ data-parallel computations in the threads of the same workgroup. Using these threads to run unrelated tasks is likely to result in much lower performance.

Consider power efficiency. A typical high-end discrete GPU consumes about 200W of power, which is about twice the consumption of a high-end CPU. Therefore, achieving power efficient execution in GPU-accelerated systems implies that the acceleration speedups must exceed 2x if only a GPU is used. In practice, employing both a CPU and a GPU may achieve better power efficiency than running on each processor alone [7].

Understand memory performance. To feed their voracious appetites for data, high-end GPUs employ a massively parallel memory interface, which offers high bandwidth for local access by GPU code. The GPU memory subsystem often becomes the main bottleneck, however, and it is useful to estimate application performance limits imposed by the memory demands of the parallel algorithm itself.

Consider an algorithm, which performs c operations for every m memory accesses, i.e., its compute-to-memory access ratio ubiq0814_b.gif. Assuming that memory accesses dominate the execution time and are completely overlapped with arithmetic operations in GPUs, the algorithm performance cannot exceed A x B, where B is the GPU local memory bandwidth. For example, the vector sum kernel performs 1 addition per 3 4-byte memory operations (2 reads and 1 write) for each output, hence ubiq0814_c.gif. If we use NVIDIA GeForce Titan GPU with B=228 GB/s, the maximum performance is at most 19 GFLOP/s,7 which is 0.5% of the GPU’s raw capacity.

Estimating memory performance of an algorithm may help identify its inherent performance limits and guides the design toward cache-friendly parallel algorithms which optimize memory reuse and leverage fast on-core memory, as we showed in our work