Friday, May 23, 2014

Hadoop EcoSystem - a list growing >

Hadoop EcoSystem:

As we know there are many other projects based around core components of Hadoop, often reffered to as the "Hadoop Ecosystem". Below is the exhaustive list which continues to be grown.......
  • Distributed Filesystem
    • Hadoop Distributed File System (Apache Software Foundation)
    • HDFS is a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS High Availability feature addresses this problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.
    • Amazon S3 file system
    • Google File System (Google Inc.)
    • Ceph (Inktank, Red Hat)
    • GlusterFS (Red Hat)
    • Lustre (OpenSFS & Lustre)
  • Distributed Programming
    • MapReduce (Apache Software Foundation)
    • Apache Pig
    • JAQL
    • Apache Spark
    • Stratosphere
    • Netflix PigPen
    • AMPLab SIMR
    • Facebook Corona
    • Apache Twill
    • Damballa Parkour
    • Apache Hama
    • Datasalt Pangool
    • Apache Tez
    • Apache DataFu
    • Pydoop
  • NoSQL Databases
    • Column Data Model
      • Apache HBase
      • Apache Cassandra
      • Hypertable
      • Apache Accumulo
    • Document Data Model
      • MongoDB
      • RethinkDB
      • ArangoDB
    • Stream Data Model
      • EventStore
    • Key-value Data Model
      • Redis DataBase
      • Linkedin Voldemort
      • RocksDB
      • OpenTSDB
    • Graph Data Model
      • ArangoDB
      • Neo4j
  • NewSQL Databases
    • TokuDB
    • HandlerSocket
    • Akiban Server
    • Drizzle
    • Haeinsa
    • SenseiDB
    • Sky
    • BayesDB
    • InfluxDB
  • SQL-on-Hadoop
    • Apache Hive
    • Apache HCatalog
    • AMPLAB Shark
    • Apache Drill
    • Cloudera Impala
    • Facebook Presto
    • Datasalt Splout SQL
    • Apache Tajo
    • Apache Phoenix
  • Data Ingestion
    • Apache Flume
    • Apache Sqoop
    • Facebook Scribe
    • Apache Chukwa
    • Apache Storm
    • Apache Kafka
    • Netflix Suro
    • Apache Samza
    • Cloudera Morphline
    • HIHO
  • Service Programming
    • Apache Thrift
    • Apache Zookeeper
    • Apache Avro
    • Apache Curator
    • Apache karaf
    • Twitter Elephant Bird
    • Linkedin Norbert
  • Scheduling
    • Apache Oozie
    • Linkedin Azkaban
    • Apache Falcon
  • Machine Learning
    • Apache Mahout
    • WEKA
    • Cloudera Oryx
    • MADlib
  • Bechmarking
    • Apache Hadoop Benchmarking
    • Yahoo Gridmix3
    • PUMA Benchmarking
    • Berkeley SWIM Benchmark
    • Intel HiBench
  • Security
    • Apache Sentry
    • Apache Knox Gateway
  • System Deployment
    • Apache Ambari
    • Apache Whirr
    • Cloudera HUE
    • Buildoop
    • Apache Bigtop
    • Apache Helix
    • Hortonworks HOYA
    • Brooklyn
    • Marathon
    • Apache Mesos
  • Applications
    • Revolution R
    • Apache Nutch
    • Sphnix Search Server
    • Apache OODT
    • HIPI Library
    • PivotalR
  • Development Frameworks
    • Spring XD
  • Misselenious
    • Talend
    • Apache Tika
    • Twitter Finagle
    • Apache Giraph
    • Concurrent Cascading
    • S4 Yahoo
    • Intel GraphBuilder
    • Spango BI
    • Jedox Palo
    • Twitter Summingbird
    • Apache Kiji
    • Tableau
    • D3.JS

Wednesday, May 14, 2014

Hadoop at a glance

Apache Hadoop, at its core, consists of two components – Hadoop Distributed File System and Hadoop MapReduce. HDFS is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process huge amounts of data in parallel on large clusters of compute nodes. Other Hadoop-related projects (also called EcoSystems) at Apache include Hive, Pig, HBase, Yarn, Mahout, Oozie, Sqoop, Avro, Cascading, ZooKeeper, Flume, Drill, etc.

Other competing technologies of Haddop are - Google Dremel, HPCC Systems, Apache Storm.

Google Dremel is a distributed system developed at Google for interactively querying large datasets and powers Google's BigQuery service. 

HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems. 

Apache Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language.


Hadoop distributions are provided by growing numbers of companies. They provide products that include Apache Hadoop, a derivative work thereof, commercial support, and/or tools and utilities related to Hadoop. Some major hadoop distribution companies are - Cloudera, Hortonworks, MapR, Amazon Web services, Intel, EMC, IBM, etc.

Wednesday, April 23, 2014

Challenges of Big Data; Is Hadoop meeting the Big Data Challenge?

Are we living in the era of "Big Data”? Yes, of course. In today's technology-fuelled world where computing power has significantly increased, electronic devices are more commonplace, accessibility to the Internet has improved, and users have been able to transmit and collect more data than ever before. Organizations are producing data at an astounding rate. It is reported that Facebook alone collects 250 terabytes a day.

According to Thompson Reuters News Analytics, digital data production has more than doubled from almost 1 million petabytes (equal to about 1 billion terabytes) in 2009 to a projected 7.9 zettabytes (a zettabyte is equal to 1 million petabytes) in 2015, and an estimated 35-40 zettabytes in 2020. Other research organizations offer even higher estimates!

As organizations have begun to collect and produce massive amounts of data, they have recognized the advantages of data analysis. But they have also struggled to manage the massive amounts of information that they have. This has led to new challenges.


Businesses realize that tremendous benefits can be gained in analyzing Big Data related to business competition, situational awareness, productivity, science, and innovation. 


Apache Hadoop meets the challenges of Big Data by simplifying the implementation of data-intensive, highly parallel distributed applications. It allows analytical tasks to be divided into fragments of work and distributed over thousands of computers, providing fast analytics time and distributed storage of massive amounts of data. 

Hadoop provides a cost-effective way for storing huge quantities of data. It provides a scalable and reliable mechanism for processing large amounts of data over a cluster of commodity hardware. And it provides new and improved analysis techniques that enable sophisticated analytical processing of multi- structured data.

Hadoop is different from previous distributed approaches in the following ways:



    In addition, Hadoop provides a simple programming approach that abstracts the complexity evident in previous distributed implementations. As a result, Hadoop provides a powerful mechanism for data analytics, which consists of the following:
  • Vast amount of storage — Hadoop enables applications to work with thousands of computers and petabytes of data. Over the past decade, computer professionals have realized that low-cost "commodity" systems can be used together for high-performance computing applications that once could be handled only by supercomputers. Hundreds of "small" computers may be configured in a cluster to obtain aggregate computing power that can exceed by far that of single supercomputer at a cheaper price. Hadoop can leverage clusters in excess of thousands of machines, providing huge storage and processing power at a price that an enterprise can afford.
  • Distributed processing with fast data access — Hadoop clusters provide the capability to efficiently store vast amounts of data while providing fast data access. Prior to Hadoop, parallel computation applications experienced difficulty distributing execution between machines that were available on the cluster. This was because the cluster execution model creates demand for shared data storage with very high I/O performance. Hadoop moves execution toward the data. Moving the applications to the data alleviates many of the high performance challenges. In addition, Hadoop applications are typically organized in a way that they process data sequentially. This avoids random data access (disk seek operations), further decreasing I/O load.
  • Reliability, failover, and scalability — In the past, implementers of parallel applications struggled to deal with the issue of reliability when it came to moving to a cluster of machines. Although the reliability of an individual machine is fairly high, the probability of failure grows as the size of the cluster grows. It will not be uncommon to have daily failures in a large (thousands of machines) cluster. Because of the way that Hadoop was designed and implemented, a failure (or set of failures) will not create inconsistent results. Hadoop detects failures and retries execution (by utilizing different nodes). Moreover, the scalability support built into Hadoop's implementation allows for seamlessly bringing additional (repaired) servers into a cluster, and leveraging them for both data storage and execution. For most Hadoop users, the most important feature of Hadoop is the clean separation between business programming and infrastructure support. For users who want to concentrate on business logic, Hadoop hides infrastructure complexity, and provides an easy-to-use platform for making complex, distributed computations for difficult problems.