Files

Abstract

For efficient acceleration on FPGA, it is essential for external memory to match the throughput of the processing pipelines. However, the usable DRAM bandwidth decreases significantly if the access pattern causes frequent row conflicts. Memory controllers reorder DRAM commands to minimize row conflicts; however, general-purpose controllers must also minimize latency, which limits the depth of the internal queues over which reordering can occur. For latency-insensitive applications with irregular access pattern, nonblocking caches that support thousands of in-flight misses (miss-optimized memory systems) improve bandwidth utilization by reusing the same memory response to serve as many incoming requests as possible. However, they do not improve the irregularity of the access pattern sent to the memory, meaning that row conflicts will still be an issue. Sending out bursts instead of single memory requests makes the access pattern more sequential; however, realistic implementations trade high throughput for some unnecessary data in the bursts, leading to bandwidth wastage that cancels out part of the gains from regularization. In this paper, we present an alternative approach to extend the scope of DRAM row conflict minimization beyond the possibilities of general-purpose DRAM controllers. We use the thousands of future memory requests that spontaneously accumulate inside the miss-optimized memory system to implement an efficient large-scale reordering mechanism. By reordering single requests instead of sending bursts, we regularize the memory access pattern in a way that increases bandwidth utilization without incurring in any data wastage. Our solution outperforms the baseline miss-optimized memory system by up to 81% and has better worst, average, and best performance than DynaBurst across 15 benchmarks and 30 architectures.

Details

Actions