NoSQL databases focus on analytical processing of large scale datasets, offering increased scalability over commodity hardware. They have recently gained a lot of attention from both the industrial and the academic communities, serving some of the most innovative and profitable data-intensive applications of our times.
Many of these engines feature elastic scaling actions. Elasticity is a property that allows for fairly portioned premiums and high-quality performance, directly applying to the philosophy of a cloud- based platform. As such, studying the performance of those engines and paying special attention to their elastic properties has become increasingly important. I have put considerable effort in studying elasticity in NoSQL databases in my research ([1], [2], [3], [4]), especially dealing with virtualized resources, a typical case in cloud-based deployments.
One important aspect that has to be identified and quantified in NoSQL cloud elasticity is the cost of adding or removing nodes in the cluster. Tying cost with time units for this discussion (with indicative numbers referring to the work in [4]), we categorize this cost as follows:
VM initialization: Launching new virtual machines takes up a significant part of the addition phase. Even when the image is cached in the VM container (therefore launching the instance does not impose the additional time penalty of having to transfer the image from the repository), the VM is available within a few minutes (2-3 minutes on average), allowing for OS boot time and DHCP negotiation. Multiple node additions can be performed in parallel on multiple VM containers even when multiple VMs are launched on the same container. VM removal is much faster (i.e., in less than 10 seconds).
Node reconfiguration: This phase involves the creation of various configuration files and their propagation to all nodes in the cluster. This is necessary as there exist a number of settings that have to be available on both new and existing nodes (e.g., domain name resolution options). Given the fact that configuration files are usually small in size, completing this phase occurs relatively fast (an average of 30 seconds), even for large cluster sizes. This phase is necessary during node addition as well as during node removal. It is engine-specific whether this script/parameter injection will take place for new, existing or just some specific nodes of the cluster.
Region re-allocation: This part of the elastic action involves the necessary time for the added nodes to become active members of the cluster serving a part of the data. This consists of the time consumed by launching the services/daemons using the database’s default scripts, the time for the new node to become active during addition and the time for (some) data regions to be allocated to each new node during addition or to an old node during removal. The total time varies for different engines both according to their system design as well as to the number of added/removed nodes and the amount of data transferred.
Data re-balancing: Complete data re-balancing is expensive in terms of extra load on the servers and the network infrastructure. It also invalidates data blocks while read/write operations are performed on the cluster, jeopardizing consistency. Some engines (e.g., HBase and Cassandra) can add nodes without moving the relevant data. In other cases (e.g., Riak), data re-balancing cannot be avoided when bringing new nodes online. The cost of this operation depends heavily on the amount of data that has to be moved, the number of pre-existing nodes and the number of new nodes that have to fetch the existing data.
References:
[1] D. Tsoumakos, I. Konstantinou, C. Boumpouka, S. Sioutas and N. Koziris: Automated, Elastic Resource Provisioning for NoSQL Clusters Using TIRAMOLA. In proceedings of the the 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Delft, The Netherlands, May 13-16, 2013.
[2] I. Konstantinou and E. Angelou and D. Tsoumakos and C. Boumpouka and N. Koziris and S. Sioutas: TIRAMOLA: Elastic NoSQL Provisioning through A Cloud Management Platform. In proceedings of the 2012 ACM SIGMOD/PODS International Conference on Management of Data (Demo Track), Scottsdale, Arizona, USA, May 20-24, 2012.
[3] E. Angelou and N. Papailiou and I. Konstantinou and D. Tsoumakos and N. Koziris: Automatic Scaling of Selective SPARQL Joins Using the TIRAMOLA System. In proceedings of the 4th International Workshop on Semantic Web Information Management, Scottsdale, Arizona, USA, May 20, 2012.
[4] I. Konstantinou and E. Angelou and C. Boumpouka and D. Tsoumakos and N. Koziris: On the Elasticity of NoSQL Databases over Cloud Management Platforms. In proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM 2011), October 24-28, 2011 – Glasgow, Scotland.