Large-Scale Graph Processing on FPGAs with Caches for Thousands of Simultaneous Misses
Efficient large-scale graph processing is crucial to many disciplines. Yet, while graph algorithms naturally expose massive parallelism opportunities, their performance is limited by the memory system because of irregular memory accesses. State-of-the-art FPGA graph processors, such as ForeGraph and FabGraph, address the memory issues by using scratchpads and regularly streaming edges from DRAM, but then they end up wasting bandwidth on unneeded data. Yet, where classic caches and scratchpads fail to deliver, FPGAs make powerful unorthodox solutions possible. In this paper, we resort to extreme nonblocking caches that handle tens of thousands of outstanding read misses. They significantly increase the ability of memory systems to coalesce multiple accelerator accesses into fewer DRAM memory requests; essentially, when latency is not the primary concern, they bring the advantages expected from a very large cache at a fraction of the cost. We prove our point with an adaptable graph accelerator running on Amazon AWS f1; our implementation takes into account all practical aspects of such a design, including the challenges involved when working with modern multidie FPGAs. Running classic algorithms (PageRank, SCC, and SSSP) on large graphs, we achieve 3x geometric mean speedup compared to state-of-the-art FPGA accelerators, 1.1-5.8x higher bandwidth efficiency and 3.0-15.3x better power efficiency than multicore CPUs, and we support much larger graphs than the state-of-the-art on GPUs.
WOS:000702275600045
2021-01-01
978-1-6654-3333-4
Los Alamitos
Conference Proceedings Annual International Symposium on Computer Architecture
609
622
REVIEWED
Event name | Event place | Event date |
ELECTR NETWORK | Jun 14-19, 2021 | |