The main aim of this thesis is to examine the advantages of 3D stacking applied to microprocessors and related integrated microprocessor systems in the architectural level. In the succession of years microprocessors are aiming towards lower power consumption, increased performance, reduced form factor and increased integration. 3D technology is an emerging technology that can provide improvements in all the aforementioned areas. For conventional process scaling, the signal delay time (RC) is expected to increase with technology node mostly from the increasing resistance of the wires. The situation is more exaggerated because of the constant increase of the interconnect length as well as the increase of the number of interconnect layers used. Thus, mainly for microprocessor systems, it is most important to focus primarily on using 3D to reduce wiring. 3D systems can be divided into two basic categories based on the type of layers stacked to form the 3D entity. The first generically includes stacking cache, main memory or devices with similar functions onto a high- performance logic device. This type of stacking is usually referred as “logic + memory” stacking. The second category involves splitting a logic area between two or more layers and is usually referred as “logic + logic” stacking. This thesis commences with an introduction to 3D ICs and continues by demonstrating ways to improve memory organization. It then proceeds with a unique way of “logic + memory” stacking that provides interesting opportunities for FPGA implementations. Such opportunities may best be exploited with the use of DSP blocks within FPGAs. In this context, a novel DSP block to enhance FPGA performance follows. The Thesis continues with a novel type of link especially useful for 3D integration and concludes with a modular “logic + logic” 3D stacked multi-processor platform. More specifically the first chapter consists an introduction to 3D ICs. The second chapter presents a systematic technique to reduce the silicon area required for AVS-enhanced ISEs without compromising I/O bandwidth. The technique combines a search for the lowest cost memory system organization, followed by a data layout phase (formulated as LICCA—a problem akin to graph coloring), and the use of input and output alignment layers placed between the memory system and ISE logic. Optimizing the memory subsystem using this approach reduces the silicon area by around 36% while maintaining the same data bandwidth as a multi-port memory, and without clock frequency degradation. In the next chapter we propose a methodology to generate data accumulation architectures achieving, to our knowledge, the most efficient use of available memory bandwidth. Such architectures require the minimum number of cycles to complete a number of computations while maintaining the same maximum rate of computation completion as state-of-the-art known implementations. The next chapter proposes the stacking of DRAM on top of an FPGA using face-to-face bonding in order to cache future configurations, thereby reducing reconfiguration time. We have established the feasibility of the proposed system and determined that we can cache 289 configurations in our system. The reconfiguration time is 60ns, with a latency of 8.42μs between reconfigurations. We have also evaluated the performance and area costs and benefits of this system on three multimedia benchmarks. The next chapter discusses the FPCT. The FPCT is a radical departure from the traditional DSP blocks that currently enhance the arithmetic functionality of FPGAs. The motivation for the FPCT originates from a set of dataflow transformations that improve arithmetic circuits for ASIC synthesis by maximizing the use of carry- save addition whenever possible. Next, in the following chapter we propose an enhancement to the FPCT, a new DSP block for FPGAs that can perform multi-input addition as well as multiplication. It combines two bypassable 9x9-bit PPGs with two 4-CSlice half-FPCTs and some fixed function partial product compression logic. The experiments show that the DSP block remains competitive with Altera’s Stratix II DSP block in terms of critical path delay and area utilization; while retaining the FPCT’s ability to accelerate multi-input addition operations, unlike existing DSP blocks. In the next chapter, we exploit the large bandwidth offered by the state of the art TSV technology, and utilize it on the inter-tier link design. The proposed inter-tier quasi-serial link achieves five times less area than the traditional synchronous parallel link. This approach can be considered as a low-cost and efficient inter-tier communication solution for 3D NoC designs. The next chapter presents a modular 3D stacked multi-processor platform which is composed of identical dies that are interconnected by TSVs. Stacking identical, fully testable multi-processor dies with 4 processing elements and memory units on each die, leads to an increased yield for the final 3D system, built out of KGD. Moreover the homogeneous integration approach presented in this work can offer a significant reduction of the Non Recurring Engineering cost. Coherent design and testing strategies are proposed and demonstrated to ensure robust operation. A test vehicle, consisting of two layers, has been fabricated using standard UMC 90nm CMOS process. Single dies have been tested to be functional, and then processed for the in-house TSV fabrication and stacking. The proposed 3D system can operate at 400MHz, with a vertical bandwidth of 3.2Gbps. The final chapter offers a contribution summary of the Thesis and the extracted conclusions of this work.