Approximate Probabilistic Tree Indexing for Partitionable Data
Today we experience a shift both in storage technologies and in workloads characteristics. Firstly, solid state disks (SSD) (using flash, PCM or other low-level technologies) emerge as viable competitors of hard disk drives (HDD). Secondly, the increasing number of time-based generated workloads change the characteristics of the stored and analyzed data. Capacity and access times of storage devices create a storage tradeoff where SSD are positioned diametrically opposite to HDD: slow random accesses in HDD have been replaced by efficient random accesses - particularly reads - in SSD, but the capacity per dollar offered by HDD is still one or more orders of magnitude higher than the one offered by SSD. As the aggregated SSD capacity increases exponentially, an increasing part of our data resides on SSD. Traditionally, indexing is designed assuming HDD as secondary storage, thus minimizing random accesses at the expense of capacity. Indexing data residing on SSD, however, requires treating capacity as a scarce resource. In this paper we demonstrate that by compromising on indexing accuracy up to some extent we can save substantially on capacity when indexing on ordered or naturally partitionable attributes. We introduce approximate tree indexing, which employs probabilistic data structures (Bloom filters) to trade accuracy for size and produce smaller, yet powerful, tree indexes, which we name Bloom filter trees (BF-Trees). Small size enables frequent reconstructions and allows storing within secondary storage. The resulting decreased accuracy can be counteracted by exploiting the capabilities of SSD (e.g. internal device parallelism) and minimizing unnecessary reads by using Bloom filters with a higher granularity. In a series of experiments with a synthetic workload we show that approximate indexing offers 1.75x-65x smaller disk footprint with competitive response times.