Low power (mW) and high performance (GOPS) are strong requirements for compute-intensive signal processing in E-health, Internet-of-Things, and wearable applications. This work presents a building block for programmable Ultra-Low Power accelerators, namely a tightly-coupled computing cluster that supports parallel and sequential execution at high energy efficiency over a wide range of workload requirements. The cluster, implemented in 28nm UTBB FD-SOI technology, achieves peak energy efficiency in the near-threshold (NVT) operating region: 193 MOPS/mW at 162 MOPS for parallel workloads, and 90 MOPS/mW at 68 MOPS for sequential workloads at 0.46V and 0.5V, respectively. The energy efficient operating range is wide (0.32V to 1.15V), also meeting the design goal of 1 GOPS within a 10 mW power envelope (at 0.66V).