Vieira, JoaoIenne, PaoloRoma, NunoFalcao, GabrielTomas, Pedro2019-06-182019-06-182019-06-182018-01-0110.1109/SBAC-PAD.2018.00041https://infoscience.epfl.ch/handle/20.500.14299/157637WOS:000462969700027To reduce the average memory access time, most current processors make use of a multilevel cache subsystem. However, despite the proven benefits of such cache structures in the resulting throughput, conventional operations such as copy, simple maps and reductions still require moving large amounts of data to the processing cores. This imposes significant energy and performance overheads, with most of the execution time being spent moving data across the memory hierarchy. To mitigate this problem, a Cache Compute System (CCS) that targets memory-bound kernels such as map and reduce operations is proposed. The developed CCS takes advantage of long cache lines and data locality to avoid data transfers to the processor and exploits the intrinsic parallelism of vector compute units to accelerate a set of 48 operations commonly used in map and reduce patterns. The CCS was validated by integrating it with an MB-Lite soft-core in a Xilinx Virtex-7 VC709 Development Board. When compared to the MB-Lite core, the proposed CCS presents performance improvements in the execution of the commands ranging from 4x to 408x, and energy efficiency gains from 6x to 328x.Computer Science, Hardware & ArchitectureComputer Science, Theory & MethodsComputer Sciencecompute cachesmemory bound operationsvectorizationExploiting Compute Caches for Memory Bound Vector Operationstext::conference output::conference proceedings::conference paper