Walking Four Machines By The Shore

Recent studies have shown that the hardware behavior of database workloads is suboptimal when compared to scientific workloads, and have identified the processor and memory subsystem as the true performance bottlenecks, when running decision-support workloads on various commercial DBMSs. Conceptually, all of today's processors follow the same sequence of logical operations when executing a program. Nevertheless, there are internal implementation details that critically affect the processor's performance, and vary both within and across compute vendor products. To accurately identify the impact of variation in processor and memory subsystem design on DBMS performance, we need to identify the impact of the microarchitectural parameters on the performance of database management systems. This study compares the behavior of a prototype database system built on top of the Shore storage manager across three different processor design philosophies: the Sun UltraSparc (using processors UltraSparc-II and UltraSparc-IIi), the Intel P6 (using an Intel PII Xeon), and a Compaq/DEC Alpha (using a 21164A). The processors exhibit high variations in the processor and memory subsystem design. The prototype system choice is pertinent because the system's hardware behavior was found similar to commercial database systems when executing decision-support workloads. In order to evaluate the different design decisions and trade-offs in the execution engine and memory subsystems of the above processors, we ran several range selections and decision-support queries on a memory-resident TPC-H dataset. The insights gained are indications that, provided that there are no serious hardware implementation concerns, decision-support workloads would exploit the following designs towards higher performance: 1. A processor design that employs (a) out-of-order execution to more aggressively overlap stalls, (b) a high-accuracy branch prediction mechanism, and (c) the opportunity to execute more than one load/store instruction per cycle, and 2. A memory hierarchy with (a) non-inclusive (at least for instructions) caches (b) a large (> 2MB) second-level cache, and (c) a large cache block size (64-128 bytes) without sub-blocking, to exploit spatial locality.

Presented at:
Fourth Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW’01), Monterrey, Mexico, January 2001

 Record created 2015-08-21, last modified 2018-09-13

Rate this document:

Rate this document:
(Not yet reviewed)