Today’s world is impatient, and at the same time, it’s generating terabytes of data every hour every day. The need of the hour is infinite scalability, high availability, and maximum I/O throughput. If you are into backend engineering or diving into system designs, or simply curious about exploring new technologies, you must have heard about Cassandra and how popular it’s getting daily.
Let’s get to know about Cassandra as a database, what makes it so powerful, and understand where it can be a good fit. Before diving deep, here are some quick highlights about the database:
- It’s a hybrid of the best features of Google’s BigTable & Amazon’s DynamoDB
- It works on a
peer-to-peernetwork where each node is equal. They all can perform both reads & writes. - It allows tunable consistency for both reads & writes. You can make it a consistent DB based on configurations.
- It handles client calls via the CQL: Cassandra Query Language, similar to SQL
- It’s able to linearly scale, does automatic replication & partitioning.
- It’s written in Java and is open-sourced under the Apache Foundation!
Brief History
Cassandra is inspired from Google Big Table(developed in 2005) and Amazon’s DynamoDB(developed in 2007). Cassandra was developed at Facebook by Avinash Lakhsman, one of the authors of DynamoDB and Prashant Malik to power the Inbox Search Feature.
Facebook released it on Google Code in 2008. Shortly after that, Cassandra becomes an Apache Incubator project in 2009 and became a top-level project for the Apache Foundation in 2010. Today, Cassandra is freely available under the Apache License 2.0.
Some say that the chosen name Cassandra was opted by Facebook engineers because Cassandra was a cursed Oracle.
Eventually, Facebook replaced Cassandra with HBase, another NoSQL database, for their Inbox Search project, but they continue to use Cassandra in their Instagram division, which supports over 1 billion monthly active users.
Present Day
Cassandra has gone through more than 10 releases as of now and it’s used by 400+ companies across 40 countries. Cassandra is one of the most widely used NoSQL database and it’s due to cassandra that the NoSQL scene has got a significant boost in the past years. 90% of Fortune 100 companies uses Cassandra even when there are plethora of NoSQL options available.
What is Cassandra?
The technical definition of Cassandra includes a few of the core engineering terminologies that are in itself a subject of research but here’s the definition of Cassandra as a database:
Cassandra is a distributed decentralized column-oriented open-source NoSQL database with high availability which is highly scalable with tunable consistency.
Now let’s break down the definition and understand the terminologies.
Distributed Decentralized System
A distributed system is made up of a collection of nodes. A node is a computing system that can communicate with other nodes and can store data. This way a distributed system can hold a huge volume of data and can be scaled horizontally by simply adding new nodes to the system.
Traditionally we have been building centralized systems where there was a single node responsible for data management & data ownership. It carry out the desired computation via commanding the other nodes in the system. This system is dependent on the master node and is not scalable horizontally. Even with vertical scaling, the master node can easily reach its limit. The severe drawback of a centralized system is that if the central node fails the whole system goes down and the central/master node is then called the Single Point of Failure
A centralized system: the master node with authority commanding the other nodes
To avoid the limitations of a centralized system and SPoF, a distributed decentralized system was introduced. This system is a collection of nodes where the computation is spread across the nodes instead of a single system responsible for doing all the computation. It is basically a peer-to-peer network of nodes and no single node has complete control over the system. The different nodes communicate with each other to achieve a common goal.
A distributed decentralized system: collection of nodes communicating with one another without a central authority
Column Oriented
We are very well aware of the legendary relational databases like MySQL and PostgreSQL, many of us must have used relational databases at least once in software projects. Relational databases are time-proven and have come a long way since the 1980s where the data storage is based on a tabular format combined of rows & columns. A particular piece of data is stored in a cell.
Here’s an example of a sample product table:
| id | brand | color | publisher | author | model_id | title | pages |
|---|---|---|---|---|---|---|---|
| 101 | Sony | black | Bravia | ||||
| 201 | Penguin | Daniel Kahneman | Thinking Fast and Slow | 450 |
So as we can see that schema formation in a relational table is not very flexible. If we want to store a variety of product that has several mismatching columns then we have to store null for the cells that do not have any value. Keeping NULL may seem a way out but that’s problematic. Each empty cell takes some storage (say 1 byte) and now considering you have 10 billion of empty cells in the above table, then you are wasting almost 10 GB of storage.
Column Value Pair at our rescue!
Now the above data can be simplified using the column value pair storage as follow:
| id | attribute_name | attribute_value |
|---|---|---|
| 101 | brand | Sony |
| 101 | color | black |
| 101 | model_id | Bravia |
| 201 | publisher | Penguin |
| 201 | author | Daniel Kahneman |
| 201 | title | Thinking Fast and Slow |
| 201 | pages | 450 |
Now we are not having any NULL or empty values in the above structure and hence no wastage of storage. We created a new row for every attribute for the given id. And now let’s see what we have gained:
- Disk space is minimized by not saving empty cells
- Not tightly bounded to table schema, the flexibility of adding columns in future
High Availability
In the modern world, we expect systems to be always available to serve us. This 24/7 availability is not always guaranteed. In the world of software engineering, engineers offer Service Level Agreements (SLAs) that guarantee the minimum availability levels. The most popular of the all, 99.999%, is the five-nines of availability. The availability percentage is calculated as:
Percentage of availability = (total elapsed time – the sum of downtime)/total elapsed time
Cassandra is designed as a peer-to-peer distributed system where all nodes are same the data is distributed among the nodes in the cluster. Since there is no single point of failure, Cassandra guarantees high availability of data by implementing a fault-tolerant storage system equipped with failure detection. Cassandra achieves fault tolerance via replication.
Highly Scalable
As already discussed, Cassandra utilizes a distributed system under the hood and hence it can scale exponentially just by increasing the number of nodes. It’s very common thing in Cassandra to add a node or remove a node from the cluster during regular work hours without waiting for the system load to drop during the night.
Another key feature of Cassandra is that it works on commodity hardware and does not require any specialized hardware components.
Tunable Consistency
Now, this is a really interesting one. Compared to the relational database, as soon as you write a data row in the table it guarantees that you will get the written data instantly if you try to fetch it.
This is not the case with Cassandra. Cassandra is a distributed network of nodes where when you write data to any of the nodes, the system replicates it to the other existing nodes. In this case, if you try to fetch the data written, then the system does not guarantee that the node which is acknowledging your read request has the latest written data with it.
You can tune the consistency level of Cassandra but then it comes at the cost of availability. Because different reads/writes may have different needs in terms of consistency, you can specify the consistency at read/write-time.
Where Cassandra Fits?
Cassandra isn’t your goto when you want strict consistency or complex joins. But when uptime, scale, and speed matter more—Cassandra shines. It’s a distributed NoSQL database built for high availability and write heavy workloads.
Here’s where it fits well:
- Always-On Systems: Its peer-to-peer setup means no single point of failure. Nodes can drop and the system keeps running.
- Horizontal Scale: Need more capacity? Just add nodes. No downtime, no drama.
- High Write Throughput: Tailored for time-series data, event ingestion, and other heavy write operations.
- Eventual Consistency: By Design Cassandra follows the AP model of the CAP theorem. That means availability and partition tolerance take priority—consistency catches up later.
Cassandra makes sense when you’re pulling in massive volumes of real-time data and can live with eventual consistency. A few solid fits:
- Product Catalogs: Especially when products vary by region or frequently update.
- Analytics Pipelines: Storing user activity, events, or session data at scale.
- Logging & Audit Trails: Centralizing logs from multiple services in real time.
- Supply Chain & Logistics: Tracking shipments, inventory, and movement across locations.
- Telemetry Systems: Think health data, weather sensors, or IoT metrics, where volume and timeliness matter more than instant accuracy.
Finally, Cassandra isn’t trying to be your next relational database. It’s purpose-built—for the right use case, it performs exceptionally. Use it when availability, scale, and speed are non-negotiable, and you’re okay with letting consistency catch its breath.
Cassandra Mental Model
Coming from a relation model and diving straight to Cassandra may not be the ideal path. Cassandra requires a completely new mindset and a clean slate approach towards data-modelling. Here’s the key highlights to keep in mind:
- There are rows & columns in relational DBs. Cassandra though exposes rows and columns, under the hood it operates closer to a
key-value store. - While in relational DBs we can tune performance post-design, in Cassandra, one should be aware of the data access patterns to be able to design efficient data storage and retrieval schemas.
- Unlike relational databases, Cassandra does not support
JOINs, complexindexes, orfull-text search. While some indexing options exist, they come with significant cost. - Choosing the right
primary keyandclustering columnsis critical as they define how your data is partitioned and sorted. Normalizationis an anti-pattern in Cassandra, Data duplication is encouraged in Cassandra, denormalization improves read efficiency and aligns with how data is distributed across nodes!
Common Vocabulary
Cassandra has got a lot of terminologies which may sound foreign when getting started. Here’s a list of the basic onces. A few related to the internals & architecture are skipped for the future writings.
| Term | Description |
|---|---|
| Cluster | A collection of nodes that together store the entire data set. |
| Node | A single machine in the cluster that holds part of the data. |
| Datacenter | A logical grouping of nodes, often corresponding to a physical data center. Useful for replication and fault tolerance. |
| Keyspace | The outermost container for data in Cassandra, similar to a database in RDBMS. It defines replication strategy and factor. |
| Table | A collection of rows (similar to RDBMS tables), but designed with partitioning and denormalization in mind. |
| Row | A single record identified by a primary key. Rows in the same partition are stored together. |
| Column | The smallest unit of data. A row can have many columns, and they can vary per row. |
| Partition | The unit of data distribution. All rows with the same partition key are stored together. |
| Primary Key | Uniquely identifies a row in a table. Consists of a partition key and optional clustering columns. |
| Partition Key | Determines on which node the data will be stored. |
| Clustering Columns | Determine how data is ordered within a partition. |
| Composite Key | A primary key with both partition and clustering components. |
| Static Columns | Columns shared by all rows in a partition. |
| Wide Row | A partition that contains many rows; common in time-series data. |
You can find the extensive list of Cassandra terminologies here: Cassandra Glossary.
Wrap Up
This brings us to the end of our introduction to the Cassandra database. In the next writings, we’ll dive into Cassandra’s architecture and explore its real-world use cases, including scenarios where Cassandra may not be the best fit. Understanding both strengths and limitations is key to master it effectively.
Stay healthy, stay blessed!