Data Aggregation or Analytics? Think Columns

A Short Primer On Columnar Approach To Data

In RDBMS, data is represented in the form of tuples or rows, while a NoSQL form of database often stores data in the form of columns. Column-centric databases like Cassandra or HBase is the talk of the town today.

With traditional relational model, the data records or entries are represented as rows. It is best suited for typical online transactional systems or application that needs rich transaction ACID properties for its data. In column form of database, the unstructured or semi-structured data or information is aggregated into number of columns or family of columns for further analysis. Column based data store is more suited for data analytics and offers highest performance as there are no strict constraints of data normalization rules.

The concept of analytics is about applying intelligence to the dormant read-only data and therefore collecting such legacy data and persisting it into organized set of fewer columns makes it easier to query and process. These columns are stored as rows with each column having a certain key-value pair of attributes often nested in nature. This type of data organization helps in processing analytical queries faster. A good small example can be of e-commerce domain when dealing with certain category of merchandise, say footwear. Footwear category data can be aggregated and columnized with attributes like brand, type, size, color, style, weather-prone,etc. It becomes simple and fast to analyse category of information in a columnar way. The volume of data can itself reach in thousands or even millions.

Column based database also provides the concept of TTL or time to live for its column data. Products in e-commerce tend to get obsolete or outdated very fast. One can set the TTL on such columns and it will be automatically discarded after the specified time has elapsed. Another feature column based database provides is its ability to compress data that can increase the capacity of its storage. Compression can be configured to apply on closely related segments of data so that the reads and writes are fast. Modern NoSQL databases uses advanced compression techniques and algorithms that can greatly reduce disk IO.

One more important aspect of columns based database is that certain attributes of single family of columns (represented as rows) may vary or be omitted. For example, a footwear column type ‘formals’ may have more attributes than say type ‘flip-flops’. In short the schema of single family of columns may not be consistent. This kind of flexible schema or data model allows for complex data structures to be designed with ease. The design takes an application perspective than a schema perspective. And therefore, you do not have to break your head on applying different normalization rules.

A typical column based DB has the following characteristics.

  • It works on large volume of unstructured or semi-structured data
  • The design model has a more flexible and design is application driven than schema driven
  • Column values are typically stored as key-value attributes that can be nested to create a hierarchy
  • Data model can be or is usually multi-dimensional
  • Column is a unit of storage
  • It is best suited for aggregation and analytics use cases
  • Columns may have a time to live constraint
  • Columns relationships are represented using aggregation

So when you want to perform data aggregation or analytics, think a column based NoSQL database approach!