Monday, April 7, 2014

What is Big Data?

As per Wikipedia's definition, Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

There’s no clear cut definition for ‘big data’ - it is a very subjective term. Most people would consider a data set of terabytes or more to be ‘big data’. One reasonable definition is that it is data which can’t easily be processed on a single machine.

The big data challenges include capture, curating, storage, search, sharing, transfer, analysis and visualization. 

To understand big data in more precisely, we should look into the characteristic of big data, i.e. 3 V's - 

Big Data: 3 V's

1.  Volume:

As the name suggests big data technology depends on a massive amount of data in order to get better intelligence and more valuable information. The technology is few years old and according to IBM in 2012, the data gathered every day equals 2.5 Exabyte (2.5 quintillion bytes). Such enormous amount of data will definitely require very advanced computational power as well as storage resources to be handled, stored, and analyzed in a reasonable amount of time. Moreover, the gathered information is rapidly increasing in detail, thus in size. 

According to Harvard business review, “Big Data: The Management Revolution, by Andrew McAfee”, the size of data is expected to be doubled every 40 months depending on the high penetration rates of the wireless technology market. 

2.  Velocity:

Big data technology requires a very high computational resources as well as storage in order to handle large data and complex sets of unstructured data. The data can be generated and stored in many ways, yet the company ability to store, retrieve and process these data sets affects the company agility. 

A famous example was demonstrated by a group of researchers from the MIT media lab on the black Friday (the start of Christmas shopping in the United States). In an experiment the MIT media lab group collected information from the Location Based Service over Smartphones to detect how many cars entered Macy’s parking lot. Using such information they were able to estimate the size of Macy’s sales before Macy’s itself was able to detect it. 

3.  Variety:

Unlike the traditional analytics, the big data theoretically has an infinite number of forms. The data are collected in tremendous number of ways and every single operation or action represents a value to the business. No one can count the number of operations that are carried over the web and electronic devices every single moment all over the globe. For instance, every post and interaction on Facebook, tweet, shared image, text message, GPS signal and many other forms of electronic interaction counts and adds valuable information. 

This variety of data in most cases produces large amounts of unstructured data sets. The biggest issue that comes with such enormous and unstructured database is the noisy image of the data. Subsequently, in order to get the proper information and superior value the big data will poses much more mining.

The digital data is growing like tsunami. As per the forecast done by
IDC, it is projected to grow about 40 Zettabytes by 2020.

Courtesy: Hadoop Summit April 2-3, 2014, Amsterdam, Netherlands

The key Big Data technologies are as follows -
  • Hadoop - MapReduce framework, including Hadoop Distributed File System (HDFS)
  • NoSQL (Not Only SQL) data stores
  • MPP (Massively Parallel Processing) databases
  • In-memory database processing
Typical Big Data Problems -
  • Perform sentiment analysis on 12 terabytes of daily Tweets
  • Predict power consumption from 350 billion annual metere readings
  • Identify potential fraud in a business's 5 million daily transactions


  1. Nice Post Ambuj...Very useful....!

  2. Thanks Srinivas for your feedback and stay tuned to upcoming post.