With the omnipresence of embedded processing in all forms of electronics today, there is a strong trend towards wireless, battery-powered, portable embedded systems which have to operate under stringent energy constraints. Consequently, low power consumption and high energy efficiency have emerged as the two key criteria for embedded microprocessor design. In this thesis we present a range of microarchitectural low-power design techniques which enable the increase of performance for embedded microprocessors and/or the reduction of energy consumption, e.g., through voltage scaling. In the context of cryptographic applications, we explore the effectiveness of instruction set extensions (ISEs) for a range of different cryptographic hash functions (SHA-3 candidates) on a 16-bit microcontroller architecture (PIC24). Specifically, we demonstrate the effectiveness of light-weight ISEs based on lookup table integration and microcoded instructions using finite state machines for operand and address generation. On-node processing in autonomous wireless sensor node devices requires deeply embedded cores with extremely low power consumption. To address this need, we present TamaRISC, a custom-designed ISA with a corresponding ultra-low-power microarchitecture implementation. The TamaRISC architecture is employed in conjunction with an ISE and standard cell memories to design a sub-threshold capable processor system targeted at compressed sensing applications. We furthermore employ TamaRISC in a hybrid SIMD/MIMD multi-core architecture targeted at moderate to high processing requirements (> 1 MOPS). A range of different microarchitectural techniques for efficient memory organization are presented. Specifically, we introduce a configurable data memory mapping technique for private and shared access, as well as instruction broadcast together with synchronized code execution based on checkpointing. We then study an inherent suboptimality due to the worst-case design principle in synchronous circuits, and introduce the concept of dynamic timing margins. We show that dynamic timing margins exist in microprocessor circuits, and that these margins are to a large extent state-dependent and that they are correlated to the sequences of instruction types which are executed within the processor pipeline. To perform this analysis we propose a circuit/processor characterization flow and tool called dynamic timing analysis. Moreover, this flow is employed in order to devise a high-level instruction set simulation environment for impact-evaluation of timing errors on application performance. The presented approach improves the state of the art significantly in terms of simulation accuracy through the use of statistical fault injection. The dynamic timing margins in microprocessors are then systematically exploited for throughput improvements or energy reductions via our proposed instruction-based dynamic clock adjustment (DCA) technique. To this end, we introduce a 6-stage 32-bit microprocessor with cycle-by-cycle DCA. Besides a comprehensive design flow and simulation environment for evaluation of the DCA approach, we additionally present a silicon prototype of a DCA-enabled OpenRISC microarchitecture fabricated in 28 nm FD-SOI CMOS. The test chip includes a suitable clock generation unit which allows for cycle-by-cycle DCA over a wide range with fine granularity at frequencies exceeding 1 GHz. Measurement results of speedups and power reductions are provided.