Building Efficient Query Engines using High-Level Languages

We are currently witnessing a shift towards the use of high-level programming languages for systems development. These approaches collide with the traditional wisdom which calls for using low-level languages for building efficient software systems. This shift is necessary as billions of dollars are spent annually on the maintenance and debugging of performance-critical software. High-level languages promise faster development of higher-quality software; by offering advanced software features, they help to reduce the number of software errors of the systems and facilitate their verification. Despite these benefits, database systems development seems to be lagging behind as DBMSes are still written in low-level languages. The reason is that the increased productivity offered by high-level languages comes at the cost of a pronounced negative performance impact. In this thesis, we argue that it is now time for a radical rethinking of how database systems are designed. We show that, by using high-level languages, it is indeed possible to build databases that allow for both productivity and high performance. More concretely, in this thesis we follow this abstraction without regret vision and use high-level languages to address the following two problems of database development. First, the introduction of a new storage or memory technology typically requires the development of new versions of most out-of-core algorithms employed by the database system. Given the increasing popularity of hardware specialization, this leads to an arms race for the developers. To make things even worse, there exists no clear methodology for creating such algorithms and we must rely on significant creative effort to serve our need for out-of-core algorithms. To address this issue, we present the OCAS framework for the automatic synthesis of efficient out-of-core algorithms. The developer provides two independent inputs: 1) a memory-hierarchy-oblivious algorithm, expressed using a high-level specification language; and 2) a description of the target memory hierarchy. Using these specifications, our system is then able to automatically synthesize memory-hierarchy and storage-device-aware algorithms for tasks such as joins and sorting. The framework is extensible and quickly synthesizes custom out-of-core algorithms as new storage technologies become available. Second, from a software engineering point of view, years of performance-driven DBMS development have led to complicated, monolithic, low-level code bases, which are hard to maintain and extend. In particular, the introduction of new innovative approaches can be a very time-consuming task. To overcome such limitations, we present LegoBase, a query engine written in the high-level language, Scala. LegoBase realizes the abstraction without regret vision in the domain of analytical query processing. We show how by offering sufficiently powerful abstractions our system allows to easily implement a broad spectrum of optimizations which are difficult to achieve with existing approaches. Then, the key technique to regain efficiency is to apply generative programming and source-to-source compile the entire high-level Scala code to specialized, low-level C code. Our architecture significantly outperforms a commercial in-memory database system and an existing query compiler. LegoBase is the first step towards providing a full DBMS written in a high-level language.

Related material