Files

Abstract

The miniaturization of integrated circuits (ICs) and their higher performance and energy efficiency, combined with new machine learning algorithms and applications, have paved the way to intelligent, interconnected edge devices. In the medical domain, they could revolutionize how healthcare services are delivered to people by enabling continuous monitoring for better prevention of diseases and more personalized treatments. Publicly available devices that record various biosignals already exist (e.g., smartwatches, fitness trackers). These connected devices could create wireless body area networks (WBANs) that could democratize personalized and preventive healthcare services. However, these devices usually require a daily charge which limits their monitoring capability. Additionally, these devices must approach the high quality of clinical devices used in hospitals. This is the key to providing proper monitoring and diagnosis in daily situations that reduce healthcare costs. Therefore, higher efficiency is required, but tasks embedded systems with two opposite goals: low-power operation and high performance. The current trend to reach these goals is toward heterogeneous platforms, including multi-core architectures with heterogeneous cores and hardware accelerators. The latter can be divided into application-specific integrated circuits (ASICs) and domain-specific instruction-set processors (DSIPs). ASICs are very efficient at implementing a particular functionality for a given set of constraints. However, they are inflexible, which limits their use case. Conversely, programmable cores or domain-specific instruction-set processors (DSIPs) offer higher flexibility but often with a penalty in area, performance, and energy consumption. This thesis explores the performance versus flexibility tradeoff at the architecture level to advance the Pareto front of current solutions. This exploration has led to VWR2A, an heterogeneous DSIP architecture template targeting the biomedical domain that integrates high computational density and a low-energy memory hierarchy. Compared to two state-of-the-art programmable architectures targeting the biomedical domain, an ARM Cortex-M4 based SoC and a CGRA, a VWR2A instance displayed an EDP improvement of 104.8× and 19.8×, respectively. In addition, VWR2A enables the generation of designs that narrow the energy and performance gap at the kernel level compared to ASICs. One VWR2A instance has shown similar or better performance on FFT and FIR Filter kernels compared to FFT and matrix processor ASICs. Regarding energy, at the kernel level, the VWR2A instance is still 4.9× less efficient than the FFT ASIC, but consumes 22.7% less energy than the matrix ASIC. However, the flexibility of VWR2A results in significant energy savings at the application level compared to ASIC-based designs, with an EDP improvement of 3.1×. Finally, as VWR2A remains fully programmable, it can also execute control-intensive kernels, present at the application level and usually executed by the CPU. One VWR2A instance optimized for such code has demonstrated higher performance and energy efficiency compared to GPPs. At the application level, the overhead of the programmability (compared to ASICs) is largely compensated with higher code coverage. This results in an EDP improvement of 27.6× when both control-intensive and data-intensive kernels are executed by a VWR2A instance compared to an SoC using a GPP+ASICs combination.

Details

PDF