Bugnion, EdouardLarus, James RichardEmami, Seyedmahyar2024-07-022024-07-022024-07-02202410.5075/epfl-thesis-9908https://infoscience.epfl.ch/handle/20.500.14299/208960Verification and testing of hardware heavily relies on cycle-accurate simulation of RTL. As single-processor performance is growing only slowly, conventional, single-threaded RTL simulation is becoming impractical for increasingly complex chip designs and systems. A solution is parallel RTL simulation, where, ideally, simulators can run on hundreds and thousands of parallel cores. However, existing simulators can only exploit tens of cores due to the high cost of synchronization and communication. On a general-purpose machine, synchronization overhead grows as we exploit more parallelism. With enough cores, synchronization cost grows large enough to offset gains from parallelism. An RTL simulation needs to run on parallel machines with near-fixed cost synchronization to ensure scalability. This dissertation presents two solutions to this challenge. First, we present Manticore, a parallel computer designed to accelerate RTL simulation by minimizing synchronization overhead through static scheduling. Manticore uses static bulk-synchronous parallel (BSP) execution to eliminate the fine-grained synchronization overhead that cripples parallel performance on general-purpose machines. Manticore relies on a compiler to statically schedule resources and communication, which is feasible since RTL code contains few divergent code paths. With static scheduling, communication and synchronization cost is minimal, making fine-grained parallelism practical. Moreover, static scheduling dramatically simplifies the processor implementation, significantly increasing the number of cores that fit on a chip. Our 225-core FPGA prototype of Manticore runs at 475 MHz and outperforms the state-of-the-art RTL simulators running on desktop and server computers. However, the widening gap between chip size and processor performance suggests we will soon need a parallel RTL simulator capable of parallelizing RTL simulation across thousands of cores. As future chips incorporate more logic and memory than existing ones, a truly scalable parallel simulator must be able to exploit parallelism beyond a small system to simulate cutting-edge designs. To this end, we study the challenges inherent in running parallel RTL simulation on a multi-thousand-core machine, the Graphcore IPU, constructed from multiple 1472-core packages. We analyzed the IPU's synchronization and communication performance and built Parendi, an RTL simulator for the IPU. It runs RTL simulation across 5888 cores over 4 IPU sockets. Parendi cost-effectively runs large RTL designs up to 4x faster than powerful, state-of-the-art general-purpose machines.enRTL simulationparallel compilersFPGAGraphcore IPUhardware accelerationfull-cycle simulationfine-grained parallelismbulk-synchronous parallelmulticore and manycore architecturesmultiple-instruction multiple-dataHighly Parallel RTL Simulationthesis::doctoral thesis