Design and implementation of an efficient data stream processing system

In standard database scenarios, an end-user assumes that all data (e.g., sensor readings) is stored in a database. Therefore, one can simply submit any arbitrary complex processing in the form of SQL queries or stored procedures to a database server. Data stream oriented applications are typically dealing with huge volumes of data. Storing data and performing off-line processing on this huge dataset can be costly, time consuming and impractical. This work describes our research results while designing and implementing an efficient data management system for online and off-line processing of data streams in the field of environmental monitoring. Our target data sources are wireless sensor networks. Although our focus is on a specific application domain, the results of this thesis are designed in a generic way, so that they can be applied to wide variety of data stream oriented applications. This thesis starts by first presenting the state-of-the-art in data stream processing research specifically window processing concepts, continuous queries, stream filtering query languages and in-network data processing (particular focus on TinyOS-based approaches). We present key existing data stream processing engines, their internal architecture and how they are compared to our platform, namely Global Sensor Network (GSN) middleware. GSN middleware enables fast and flexible deployment and interconnection of sensor networks. It provides simple and uniform access to a comprehensive set of heterogeneous technologies. Additionally, GSN offers zero-programming deployment and data-oriented integration of sensor networks and supports dynamic re-configuration and adaptation at runtime. We present the virtual sensor concept, which offers a high-level view of arbitrary stream data sources, its powerful declarative specification and query tools. Furthermore, we describe design, conceptual, architectural and optimization decisions of GSN platform in detail. In order to achieve high efficiency while processing large volumes of streaming data using window-based continuous queries, we present a set of optimization algorithms and techniques to intelligently group and process different types of continuous queries. While adapting GSN to large scale sensor network deployments, we have encountered several performance bottlenecks. One of the challenges we faced was related to scalable delivery of streaming data for high data rate streams. We found out that we could dramatically improve the performance of a query processor by performing simple grouping of user queries hence sharing both the processing and memory costs among similar queries. Moreover, we encountered a similar performance issue while scheduling continuous queries. Problem of efficiently scheduling the execution of continuous queries with window and sliding parameters is not addressed in depth in literature. This problem becomes severe when one considers large volumes of high data rate streams. In these cases, an efficient query scheduler not only increases the performance at least by an order of magnitude but also, decreases the response time and memory requirements. Finally, we present how our GSN platform can get integrated with an external data sharing and visualization framework namely Microsoft's SenseWeb platform. Microsoft's SenseWeb platform, provides a sensor network data gathering and visualization infrastructure which is globally accessible to the end users. This integration (which is initiated by the Swiss Experiment project and demanded by GSN users) not only shows the scalability of GSN platform when combined with optimized algorithms, but also demonstrates its flexibility.

Related material