The capability to process high-resolution videos in real-time is becoming more important in a wide variety of applications such as autonomous vehicles, virtual reality or intelligent surveillance systems. The high-accuracy and complex video processing algorithms needed in these applications led to increased challenges for the system design, due to the amount of computations to be processed instantly. Furthermore, video processing algorithms operate on large amounts of data, but storing this data in dense off-chip memories leads to difficulties to meet bandwidth requirements. Hence, embedded memories are usually required to temporally store data on-chip, close to the processing units. However, on-chip embedded memories often dominate most of the silicon real-estate and power budget of modern video processing system-on-chips. Considering the current trend toward videos of higher resolutions and faster frame rates, these challenges are expected to dramatically increase in the future. One of the most important kernels required in modern video processing systems is the depth perception, since depth information is needed for many advanced video processing algorithms. Depth maps can be created using stereo-matching, which denotes the problem of finding dense correspondences in pairs of images. However, computing high-quality depth maps in real-time, on high-resolution images at high-frame rate is challenging due to the computational complexity of stereo-matching algorithms. Furthermore, their need for large memories and bandwidth limits the performance of depth estimation units, increases their power consumption, and renders them challenging for system integration. In this thesis, we develop task-specific solutions from the algorithmic level to the circuit level that accelerate the computation operations and data transfers, and optimize the on-chip data storage of such depth estimation units. First, we present hardware oriented stereo-matching algorithms and their hardware implementations, tailored to increase parallelism while using only on-chip memory to produce high-quality, high-resolution depth maps. Based on that, we propose a multi-camera depth map estimation ASIC implemented in 28nm, which is capable of computing in real-time up to 2K resolution depth maps at 32fps with up to 256-pixel disparity range using two/three cameras. Our design achieves the highest reported disparity range capability at the lowest power consumption and highest frame rate, while computing high-quality depth maps. It also features a stream-in/out interface for easy integration in existing vision systems. Despite having optimized the complexity of the stereo-matching process, a considerable share of the proposed ASIC area and power budget is consumed by the on-chip memory. To address this issue, we focus next on how data can be stored effectively on-chip. An emerging on-chip memory alternative to conventional SRAM is the logic-compatible GC-eDRAM, due to its high-density, low-power, and inherent two-ported operation. In this thesis, we propose a single-well mixed 3T gain-cell implementation in 28nm FD-SOI. Based on this concept, a custom 24kbit GC-eDRAM macro suitable for modern real-time video processing units was fabricated in 28nm FD-SOI, resulting in the highest density logic-compatible embedded memory reported in the literature, with improved data retention time compared to conventional 3T gain-cells, and lower static power compared to conventional SRAM.