Big Data overview?

We hear about Big Data everywhere nowadays. But what exactly is Big Data and why it is important to us?

Before going into details of Big Data, let us first talk about data. About 90% of the data in the world today has been created in last two years only. According to Google, every 2 days we create as much information as we did in years up to 2003.There are around 200 millions tweets sent everyday. Facebook is getting around 6 billion messages per day. RDBMS can not handle this much data. That is where Big Data comes into picture.

Big Data is about handling 3V(Volume,Velocity,Variety). In 2012, Gartner updated its definition as: "Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."

O’Reilly defines big data as: “Data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”


Volume - Big Data can handle data in Terabyte, Petabyte and more which RDBMS can't handle.

Velocity - Velocity describes the frequency at which data is generated, captured and shared.  Big Data is capable of handling data coming very fast through various sources such as online systems, sensors, social media, web click-stream capture, and other channels

Variety - Big data comprise of all types data – structured, semi-structured, and unstructured data(such as text, sensor data, audio, video, click streams, log files and more)

History

Big Data came into picture in 2004 when Google published a paper on a process called Map-Reduce.

Map - Queries are split and distributed across parallel nodes and processed in parallel.
Reduce - Results from Map are aggregated and presented in required format to user.

An implementation of Map-Reduce is Apache Hadoop.

Let us understand Map-Reduce through an example of Word Count.

Requirement - Let's say our requirement is to read text file and count occurrences of words. These words are separated by Tab in the text file.
Solution - Write Map Reduce jobs. Each Mapper takes a line as input and break it into words.Output will be key-value pair of word and 1. Reducer job will then aggregate count of words and put it in text file. You can see the example(Map/Reducer) job in detail at url http://hadoop.apache.org/docs/stable/mapred_tutorial.html
e.g. File1 has content like -  Hello World Bye World and File2 has content like Hello Hadoop Goodbye Hadoop, then output of Mapper job will be -

Hello 1
World 1
Bye 1
World 1
Hello 1
Hadoop 1
Goodbye 1
Hadoop 1

Output of Reducer job will be -
Hello 2
World 2
Bye 1
Hadoop 2
Goodbye 1

Big Data Technologies

Some of the Big Data technologies are -

  • Hadoop
  • Hive
  • Spark
  • Shark
  • Pig
  • Zookeeper
Hadoop - Hadoop is the open source java based framework which can be used for data intensive(> TB and PB) applications. It can be run on commodity hardwares. It was originally developed by Dough Cutting who named it after his son's favorite toy. It uses Google File System and Google's Map Reduce as it base. Some of the features of Hadoop are -

  1. Hadoop = HDFS(Hadoop Distributed File System) + Map Reduce
  2. Runs on commodity hardware
  3. Can handle Structured(e.g. data stored in RDBMS),Semi-Structured(logs data) and Unstructured data(Facebook,Twitter data, video,comments,blogs etc)
  4. Handles massive amount of data through parallelism(through Map-Reduce)
  5. Open source project written in java
  6. Mainly used for batch processing jobs and not for real time jobs as response time is not immediate
  7. Provides reliability through replication
Hadoop does the computation through dividing a task into subtasks and assigning subtasks to nodes in the cluster. Data is stored on the computing nodes through HDFS.

Use case for Hadoop

  • Used by media sites having millions of articles for last couple of hundred years
  • Social networking sites like Facebook,Twitter etc.
  • Telecom and weather industry etc.
  • Technology industry like yahoo which runs Hadoop on cluster of around 10,000 nodes on linux machine
Hadoop is not suitable when

  • When real time analysis is required
  • When processing require lot of files with small data
  • Low latency is required

Hadoop Architecture

Hadoop has concept of master nodes and worker nodes. NameNode and DataNode work at HDFS layer and Job Tracker and Task Tracker work at Map-Reduce layer.

Master Node -

  • Keeps Job Tracker and Name Node.
  • Job Tracker keeps track of which job is distributed to which node
  • NameNode keeps track of which part of data is stored at which node
  • High Availability can be ensured by maintaining secondary name node which can work as primary namenode in case of failover
  • Data awareness between Job Tracker and Task Tracker.e.g. if Node A is having X data and Node B is having Y data, then Job Tracker will assign a job related with X data to Node A and related with Y data to Node B thus avoiding network latency.

Workers Node - 

  • Keeps TaskTracker and DataNode
  • Keep track of the task going on in the worker node
  • Sends heartbeat to the Job Tracker to ensure that task is being worked upon
  • In case Job Tracker doesn't receive heartbeat, job is rescheduled and assigned to other worker node
  • Data is stored on the node where computation is going to happen thus avoiding network latency
  • Same data can be stored on multiple nodes through replication to avoid any data loss in case any worker node goes down
Hadoop can be installed on windows machine having JDK and Cygwin


Apache Hive

  • Data warehousing infrastructure based on Hadoop
  • Provides HQL(Hive Query Language) based on SQL
  • These HQL can be run on top on Hadoop thus limiting developers from writing Map Reduce jobs
  • Hive internally converts HQL into Map Reduce jobs
  • Though it is based on SQL, but it doesn't strictly follow SQL standards
  • Best used for batch jobs and not real time jobs
  • Lacks support of transactions and materialized views

Limitations

  • Response time is slow as comparative to Hadoop Map Reduce jobs as Hive is a wrapper on top of Map Reduce jobs and Hive queries are internally converted into Map Reduce jobs

Spark

  • Open source cluster computing system
  • Enables in-memory computing which makes query to run faster as comparatively to disk based computation
  • Spark claims to be 100 times faster as comparative to Hadoop Map-Reduce jobs due to its in-memory execution. We found it to be 5-6 times faster in one of our practical test. 
  • Built on top of Scala language

Shark

  • Shark = Spark + Hive
  • Runs 100 times faster as comparative to Hive

Limitation of Spark/Shark

  • One of the limitation of Spark and Shark, we found is that it requires nearly 3 times memory of the data to be processed.e.g. if we want to process 1 GB data, we might need nearly 3 GB of memory.

Apache Pig

  • Platform similar to Hive for writing Map Reduce jobs
  • Supports parallelism for handling large data sets
  • Language used to write jobs is Pig Latin
  • Originally developed at yahoo for adhoc way of creating and writing Map Reduce jobs. Now it is a part of Apache.
  • Pig is procedural as comparative to Sql which is declarative
  • Allows programmer to control flow of their execution by explicitly declaring the execution plan

Apache ZooKeeper

  • Apache project providing centralized service for configuration information, synchronization,naming registry etc.
  • Allows distributed services to coordinate with each other through a centralized service
  • Supports high availability through redundant services.

Comments


  1. Testing an application is become essential for any product to get an effective result. Your post helps you to gain more info on Testing domain
    Big Data Course in Chennai
    Big Data Training Chennai

    ReplyDelete

Post a Comment

Popular posts from this blog

Brief guide to ecommerce

New age career opportunities in Information Technology after graduation

CQ - How to set replication automatically from Author instance to publisher instance