As hardware evolves, so do the needs of applications. To increase the performance of an application, there exist two well-known approaches. These are scaling up an application, using a larger multi-core platform, or scaling out, by distributing work to multiple machines. Both approaches are typically implemented using the shared-memory and message-passing programming models. In both programming models, and for a wide range of workloads, there is contention for access to data, and applications need synchronization. On modern multi-core platforms with tens of cores and network interconnects with sub-millisecond latencies, synchronization bottlenecks can significantly limit the scalability of an application. This dissertation studies the scalability of synchronization in shared-memory settings and focuses on the changes brought upon by modern hardware advances in processors and network interconnects. As we show, current approaches to understanding the scalability of applications, as well as prevalent synchronization primitives, are not designed for modern workloads and hardware platforms. We then propose methods and techniques that take advantage of hardware and workload knowledge to better serve application needs. We first look into better understanding the scalability of applications, using low-level performance metrics in both hardware and software. We introduce ESTIMA, an easy-to-use tool for extrapolating the scalability of in-memory applications. The key idea underlying ESTIMA is the use of stalled cycles in both hardware (e.g., cycles spent waiting for cache line fetches or busy locks), as well as software (e.g., cycles spent on synchronization, or on failed Software Transactional Memory transactions). ESTIMA measures stalled cycles on a few cores and extrapolates them to more cores, estimating the amount of waiting in the system. We then focus on locking, a common way of synchronizing data accesses in a shared-memory setting. Typically, contention for data limits the scalability of an application. A poor choice of locking algorithm can further impact scalability. We introduce GLS, a middleware that makes lock-based programming simple and effective. In contrast to classic lock libraries, GLS does not require any effort from the programmer for allocating and initializing locks, nor for selecting the appropriate locking strategy. GLS relies on GLK, a generic lock algorithm that dynamically adapts to the contention level on the lock object, delivering the best performance among simple spinlocks, scalable queue-based locks, and blocking locks. Finally, we study the performance of distributed systems that offer transactions, by exposing shared-memory semantics. Recently, such systems have seen renewed interest due to advancements in networking hardware for datacenters. We show how such systems require significant manual tuning to achieve good performance in a popular category of workloads (OLTP), and argue that doing so is error-prone, as well as time-consuming, and it needs to be repeated when the workload or the hardware change. We then introduce SPADE, a physical design tuner for OLTP workloads on modern RDMA clusters. SPADE automatically decides on data partitioning, index and storage parameters, and the right mix of direct remote data accesses and function shipping to maximize performance. To achieve this, SPADE uses low-level hardware and network performance characteristics gathered through micro-benchmarks.