Straight to the Queue: Fast Load-Store Queue Allocation in Dataflow Circuits

Elakhras, Ayatallah; Sawhney, Riya; Guerrieri, Andrea; Josipovic, Lana; Ienne, Paolo

doi:10.1145/3543622.3573050

conference paper

Straight to the Queue: Fast Load-Store Queue Allocation in Dataflow Circuits

Elakhras, Ayatallah

•

Sawhney, Riya

•

Guerrieri, Andrea

January 1, 2023

Proceedings Of The 2023 Acm/Sigda International Symposium On Field Programmable Gate Arrays, Fpga 2023

31st ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA)

Dynamically scheduled high-level synthesis can exploit high levels of parallelism in poorly-predictable control-dominated applications. Yet, dataflow circuits are often generated by literal conversion of basic blocks into circuits interconnected in such a way as to mimic the program's sequential execution. Although correct and quite effective in many cases, this adherence to control flow still significantly limits exploitable parallelism. Recent research introduced techniques to deliver data tokens directly from producers to consumers and achieved tangible benefits both in circuit complexity and execution time. Unfortunately, while this successfully addressed ordinary data dependencies, the problem of potential dependencies through memory remains open: When no technique can statically disambiguate accesses, circuits must be built with load-store queues (LSQs) which, to reorder accesses safely, need memory accesses to be allocated in the queues in program order. Such in-order allocation still demands control circuitry emulating sequential execution, with its negative impact on parallelization. In this paper, we transform potential memory dependencies into virtual data dependencies and use the new direct token delivery strategy to allocate accesses sequentially into the LSQ. In other words, we exploit more parallelism by constructing control circuitry to emulate exclusively those parts of the control flow strictly necessary for in-order allocation. Our results show that we can achieve up to a 74% reduction in execution time compared to prior work, in some cases, at no area cost.