HBase vs Cassandra

Ilya Dudkin

20/03/2019

12 min read

On the surface, it may appear that there is no difference between HBase and Cassandra. Just like you might go to a car dealership and see, what appears to be two exact same cars, but in reality, they have different motors and features, the same is true for HBase and Cassandra. In fact, there are a lot of differences, for example, HBase does not have a query language, but Cassandra does. In this article, we will compare Cassandra vs HBase so you can choose the one that is right for you.

Architecture

In terms of architecture, Cassandra’s is masterless while HBase’s is master-based. In layman’s terms, HBase has a single point of failure as opposed to Cassandra. If you are wondering what this means for you, think about how much downtime you can handle. A Cassandra cluster will be there for you 100% of the time. However, since Cassandra is always relocating and duplicating the data, it can lead to consistency issues down the road. Therefore, if you are deeply reliant on data consistency then Hbase would be the much better choice. In Cassandra, all the data replication is done internally, but HBase does it through a third-party technology called HDFS.

This should come as no surprise since HDFS relies on outside technology not just for data duplication but also for things like status management and metadata.

HBase vs Cassandra Performance

The on-server writing paths are pretty similar, the only difference being the name of the data structures. However, when we look closer, we see that HBase has a disadvantage in terms of writing speed since it does not write to the log and cache at the same time. This could be a significant obstacle when providing custom software development. The disadvantages of HBase do not stop there and include the following:

There are all kinds of hoops the client has to jump through in order to write the data in the proper place. It needs to find from the Zookeeper which server has the meta-table, then they need to find out from this server who actually has the table that they need to write on. Only after going through all these processes can the writing process begin. If such writes and reads happen a lot the data is cached, but if the table region is moved to another location, then the client would have to start from square one. Cassandra is much more user-friendly in this regard since it uses hashing for data distribution.

If you need even more proof that Cassandra expedites the writing process keep in mind that when the cached data is sent to a disk it takes HDFS time to literally store the data. This just another time consuming and unnecessary hassle that can be avoided by using Cassandra.

As far as the reads are concerned, if your business requires lots of fast and consistent reads, the HBase would be the better choice. It uses a sole server for the entire writing process, therefore, you can avoid having to compare all of the nodes data versions. Also, the HBase servers have few data structures to go through prior to locating your data. We already mentioned that HBase uses HDFS to store information, therefore it is tempting to come to the conclusion that an HBase read is not effective since it has to retrieve this information every single time. In fact, HBase has a block cache that contains all the data that is used most often and as a bonus, it has bloom filters that include the approximate location of other data which will really speed up the process should this data be needed. Since the index system in both HBase and HDFX has many layers it is more effective than the indexes Cassandra has. You might have read in the literature that Cassandra’s reads are very good and come as a surprise to read that HBase’s is better. However, we must remember that Cassandra’s reads are targeted and most likely inconsistent. Therefore, even though Cassandra can perform many reads per second, the amount of these reads will decline.

Security

It is no secret that NoSQL databases have a lot of security gaps, therefore, we should not be surprised that Cassandra and HBase have their fair share of security flaws as well. The biggest issue is that performance suffers when trying to secure the data. Still, there are some built-in security measures in both of them such as authentication and authorization. Cassandra has a few extra security features: inter-node and client-to-node encryption. This does not mean that HBase is not secure to work with, but it does rely on third-party technology for its security just with some other features.

When we delve into security in more detail, we see that both databases offer some granularity when it comes to access control. Cassandra has row-level access, while HBase goes even deeper offering cell-level access. With Cassandra, there are certain roles that each user is assigned which determine which information will be visible to that particular user. With HBase, every data set has a visibility level that is given to it by the administrators, kind of like a label, and then the administrators tell the users which labels they have access to.

Cassandra and HBase Use cases

Both data models handle time-series data very well which could be very useful for reading the sensors in IoT devices, tracking website data, user behavior and many other uses. Both have a great ability to store and read data. Also, they are scalable: Cassandra has linear scalability while HBase has linear and modular. If you need to scan large amounts of data to produce narrow results, then HBase is better because there is no duplication. This is why, for example, HBase is used for analyzing a text such as finding a single word in a large document.

It would be better to use Cassandra for large amounts of data ingestion because it is a very effective write-oriented database. You can use it to build a very dependable data store that is always available. Also, Cassandra allows you to create synced data centers in various countries and if you combine it with Spark you can increase the scan performance. The biggest difference is the following: if you need web or mobile apps that must always be on and require complex or real-time analytics, then you should go with Cassandra. However, if there is no hurry to analyze the results then you should go with HBase.

Conclusion

Recapping everything that was mentioned so far: Cassandra is very self-sufficient while HBase relies on third-party technology in various aspects. Both Cassandra and HBase have their strong suits and weaknesses and you just have to know what they are so you can choose the right one for your project. As we saw from all this comparing and contrasting is that HBase and Cassandra are pretty different even though they are both very good database models and you should analyze the task at hand in order to determine which one will be best for you. Afterward, you should try to work on fixing some of the security issues that we talked about especially if you will be handling customer data and many regulations have been put in place in various countries which require you to handle information a certain way. Therefore, be sure to pay just as much attention to these laws and regulations as you are paying towards creating your database.