Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files

Adaszewski, Stanislaw

doi:10.1371/journal.pone.0103319

Adaszewski, Stanislaw

2014

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Abstract

Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: "no copy" approach - data stay mostly in the CSV files; "zero configuration" - no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results.

Details

Title Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files

Author(s) Adaszewski, Stanislaw

Published in Plos One

Pagination 8

Volume 9

Issue 7

Pages e103319

Date 2014

Publisher San Francisco, Public Library Science

ISSN 1932-6203

DOI https://doi.org/10.1371/journal.pone.0103319

Other identifier(s) View record in Web of Science

Laboratories NON-ACADEMIC
BBP-CORE

Record Appears in Scientific production and competences > Transdisciplinary Entities > BBP - Blue Brain Project > BBP-CORE - The Blue Brain Project
Scientific production and competences > Unattributed publications > NON-ACADEMIC - NON ACADEMIC - Unattributed publications
Scientific production and competences > Non-academic units > NON-ACADEMIC - Unattributed publications
Peer-reviewed publications
Work produced at EPFL
Journal Articles
Published

Record creation date 2014-10-23

Abstract

Details

Actions