I'm a software developer with a great interest in data analysis and statistics.
What Is Big Data Analysis?
The term "Big Data" itself explains that the data is huge in size, like in GB, TB, and PT of data. Due to latest technologies, devices and advancement in social media sites, day to day activity from these systems generates enormous amounts of data. This data cannot be handled or processed with the traditional relational databases. Big Data refers to datasets which can not be stored, managed and analyzed using traditional database software tools or one traditional computer.
Sources of Big Data
- Power Grid
- Social media sites
- Stock exchanges
- Telecommunication industries
- IoT (sensors etc.)
Characteristics of Big Data
To handle Big Data it is very important to understand its characteristics. The properties of the Big Data are represented by the 4 V's:
Volume is related to the size of the data. How big is the data? Based on the volume of data we can consider it as Big Data or not.
Velocity is the frequency of the source data that needs to be processed. Data can be sent daily, hourly or in real time like social media data.
Different types of data available, it can be structured, semi-structured or unstructured. Based on the format of data, it is divided into three categories.
- Structured Data: All relational databases are examples of structured data as they have defined the structure with defined datatypes of the fields in the tables.
- Semi-Structured Data: XML and JSON format come under semi-structured data as they may have defined hierarchy of the elements but they might not have defined data types always.
- Unstructured Data: Any word or PDF documents, text files or Media/Server logs are unstructured data.
Veracity is about the trustworthiness of the data. It is an obvious thing that we will have some level of discrepancies in the data received.
Big Data Life Cycle
In general, the following processes are involved in analyzing Big Data.
- Data Manipulation
- Data cleaning
- Data integration like annotating with different data sources.
- Computing and analyzing it by applying various methods of data analysis.
- Visualizing the results in the form of dashboards or graphs.
Benefits of Analyzing Big Data
It is not just about how big the data is. But how to use/analyze it for predicting the future and making smart business decisions. It can help to drive the business by analyzing the product data and can help for launching a new product or improving company services by analyzing customer feedback data. It helps to make any managerial decision in any business for making and planning better business strategies.
- Cost Saving: Big Data technologies like cluster/cloud-based computing in Apache Hadoop or Apache Spark save us from buying high configured machines for processing Big Data.
- Time Reduction: The high speed in memory computations reduce time to process data and it enables us to make quick decisions
- New Product Development: It helps knowing customer needs and satisfaction level for the next product that is going to be developed.
- Understanding the market trends: Like knowing customer buying patterns or the most purchased items can help to know the market demands.
- Sentiment analysis: Mining the opinion of the customers from various social media sites can help any product or service industries to improve their offerings.
Big Data Technologies & Tools
Let us talk about the technologies and tools available to solve different kinds of problems in Big Data analytics in brief.
Apache Hadoop & Ecosystem
It is an open source framework to process/compute data in parallel. It is a standard platform for processing big data. It was originated from Google’s MapReduce and Google File System papers.
It is an ecosystem project which contains many other projects like Pig, Hive, Kafka, etc.
Others are analytics tools are Apache Spark and Apache Storm.
This is more advanced than Apache Hadoop and is a multi-purpose engine too. This is a general Purpose Data Access Engine for fast, large-scale data processing. It is designed for Iterative, In-Memory: computations and interactive data mining. It provides multi-language support: for Java, Scala, Python, and R. It has various in-built libraries that enables data workers to rapidly iterate over data for ETL, Machine Learning, SQL and Stream processing.
There are many other ways of processing Big Data, Above are 2 basic frameworks.
E.g. Apache Hive for data warehousing, Apache Pig for querying the Big Data, Apache Drill, Apache Impala, Apache Spark SQL and Presto and so many.
Apache SystemML, Apache Mahout and Apache Spark’s MLlib are very useful for applying various machine learning algorithm to the data.
It runs on top of Hadoop and supports HiveQL to query Big Data.
Is for people who do not know how to program in languages like Java and Scala. They can use Pig to analyze the data easily, it provides query access to the data.
It helps to transfer structured dataset from relational databases to Hadoop.
Facebook has developed open source query engine called Presto, which can handle petabytes of data and unlike Hive, it is not dependent on the MapReduce paradigm and can fetch the data in no time.
It is an open source platform for distributed stream data processing in batches.
It is a very fast, durable,fault-tolerant and publish-subscribe kind of messaging system.
Apache™ Ambari is a platform to provision, manage and monitor Hadoop clusters. Ambari provides core services for operations, development and extensions points for both.
It is a web-based notebook for data engineers, data analysts, and data scientists. It brings interactive data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop and Spark.
Big Data Databases
If we want to store Big Data in a database we should have parallel database or databases with shared-nothing architecture like Vertica, Netezza, Aster, Greenplum, etc.
Google bigtable,Amazon dynamo,Apache Hbase,Apache cassandra etc. are example of NoSQL databases.
To be able to search this Big Data faster, there are many solutions/engines available like Apache Solr or Elastic Search. Apache Solr is a powerful search engine.
Hadoop is an open source framework for distributed processing of large dataset across the clusters of machines. It provides distributed storage(File system) as well as distributed computing on a cluster.
Following diagram describes four basic components of Hadoop.
It is a programming paradigm to process Big Data in parallel on the clusters of commodity hardware with reliability and fault-tolerance.
The pattern is about breaking down a problem into small pieces of work. Map, Reduce and Shuffle are three basic operations of MapReduce.
- Map: This takes the input data and converts into a set of data where each and every line of input is broken down into a key-value pair (tuple).
- Reduce: This task takes input from the Map phase's output and combines(aggregates) data tuples into smaller sets based on keys.
- Shuffle: Is the process of transferring the data from mappers to the reducers.
Each and every node of the cluster consists of one master JobTracker and one slave TaskTracker.
- JobTracker is responsible for resource management and tracking resources’ availability. It schedules the job tasks on the slaves. It is a single point of failure, meaning if it goes down, all running jobs are halted.
- TaskTrackers executes the tasks as assigned by the master and provide task status to master periodically.
HDFS (Hadoop Distributed File System)
It is the file system provided by Hadoop. It is based on Google File System (GFS) and it runs on the cluster of machines in a reliable and fault-tolerant manner.It has a master/slave architecture.
- NameNode: It manages the metadata of the file system and it stored the location of the data.
- DataNode: The actual data is stored on the DataNode.
- Secondary NameNode: The NameNode also copies the metadata to the Secondary NameNode to have the backup so when the NameNode goes down, Secondary NameNode takes the place of NameNode.
A file in HDFS is split into chunks data called blocks and those blocks are then stored on the DataNode.The NameNode keeps the mapping of blocks to DataNodes.HDFS provides a shell interface with a list of commands to interact with the file system.
YARN (Yet Another Resource Negotiator)
It is the feature of Hadoop 2, is a resource manager.
- Multi-tenancy: Allows multiple engines to use Hadoop which can simultaneously access the same dataset.
- Cluster utilization: YARN scheduling utilizes the clusters resource in an optimized way.
- Scalability: YARN focuses on scheduling when the clusters expand.
- Compatibility: Existing MapReduce applications which are developed with Hadoop 1 can run on YARN without any disruption.
How Does the MapReduce Architecture Work
A user can submit a job by giving the following params.
- The location of input and output files.
- The jar file containing the classes of the map & reduce implementation
- The job configuration by setting different parameters for a specific job.
The Hadoop job client then submits the job and configuration to the JobTracker which again distributes the code/configuration to the slaves, schedules the tasks and monitor them.
TaskTrackers on different nodes execute the task as per the MapReduce implementation and outputs the data in HDFS.
Hadoop - Environment Setup
Java must be required, check if the system has Java installed by the following command:
It will give you version detail if installed already, if not follow the following steps to install java on your machine.
- Download java from the link and extract it. http://www.oracle.com/technetwork/java/javase/downloads/index.html
- Move it to the /usr/local/ or your preferred location for making it available to all the users.
- Set PATH and JAVA_HOME environment variable:
export JAVA_HOME=/usr/local/jdk1.7.0_71 export PATH=$PATH:$JAVA_HOME/bin
- Verify the Java installation:
Download and extract the latest Hadoop version from the Apache software foundation.
Following are the modes we can operation Hadoop in.
- Standalone Mode: In this mode, Hadoop can be run as a single java process in local.
- Pseudo-Distributed Mode: In this mode, each Hadoop daemon like HDFS, YARN, MapReduce can be run as a separate Java process by distributed simulation on a single machine, In development, this mode is preferred mostly.
- Fully Distributed Mode: In this mode, the Hadoop runs on a cluster of at least more than one machine.
Standalone Hadoop Installation
- We have already downloaded and extracted Hadoop, we can move it to the preferred location or at /usr/local/ and we need to set up an environment variable as follows:
- Verify the Hadoop version installed by the following command.
- If you get the version detail, it means standalone mode is working fine.
- You can now run the examples jar, we will see this later.
Pseudo-Distributed Hadoop Installation
- Replace the java path in Hadoop-env.sh file by replacing JAVA_HOME value with the location of java installation directory in your machine.
- core-site.xml contains the configurations for Hadoop instance port number, memory limit, location of data storage, size of buffers for reading/writing. Edit the file and add the following configuration.
- hdfs-site.xml file contains configurations for replication factor, paths of namenode and datanodes path on your local machines. Open the file and add the following configurations in it as per the requirements.
- yarn-site.xml is used to configure YARN into Hadoop.
- mapred-site.xml is used to specify configurations related to MapReduce framework we need to use.
Hadoop Installation Verification
- Set up the namenode using following commands.
HDFS namenode -format
- Verify Hadoop dfs by using following commands.
- Verify Yarn Script.
- Access Hadoop on a browser on http://localhost:50070/
- Verify all the applications running on the cluster on http://localhost:8088/
What is HDFS?
HDFS is a distributed file system based on Google’s File System(GFS). It runs on commodity hardware. It provides storage for the applications running on top of Hadoop.
HDFS follows Master/Slave architecture and consists of the following elements in its architecture.
How does it work?
It takes data in the form of files and splits them into various chunks called blocks and distributes them across the cluster on various data nodes. It also replicates each piece of data to another server rack too so that in case of failure the data can be recovered.
The file in HDFS is divided into segments, called blocks. The default size of the block is 64MB, the minimum amount of data can be stored in a segment. It can be changed in the HDFS configuration.
It is the commodity hardware and it acts as a master and following is the list of its tasks.
- It manages the file system namespace. It stores the metadata of the files stored on the slaves. It stores data in RAM and not on the disk.
- It regulates the access of data to clients.
- It also executes file operations like renaming a file, opening file, etc.
fileSystem Image is kept in the main memory of Name Node (it contains metadata information).
New entries come, it is captured in edit log. secondary name node takes the copy of edit log and fileSystem image from NameNode at periodically then merge it, create a new copy and upload it to NameNode.
All DataNodes run on commodity hardware which acts as slaves. DataNode responsibilities:
- Performs read/write operations.
- They also perform operations like block creation, deletion, and replication according to the request from Namenode.
Features of HDFS
Here in HDFS, Fault tolerance refers to the ability to handle unfavorable situations. When any machine of the cluster goes down due to some failure, a client can easily access the data due to the replication facility of HDFS. HDFS replicates data blocks to another rack too so the user can access that data from another rack when a machine goes down.
So just like fault-tolerance, it is a highly available file system, a user can access the data wherever he wants from the nearest machines in the cluster even if any of the machines have failed. A NameNode which contains the metadata of the files stored on the DataNodes keeps on storing data on secondary NameNode too for a backup purpose in case of NameNode failure, so when a NameNode fails it can completely recover from a secondary NameNode, is called NameNode high availability.
HDFS can store data in range 100PB. It is a distributed reliable storage of data. It makes data reliable by creating replicas of the blocks. So no data loss happens in critical conditions.
This is the most important feature of HDFS.
Minimal Data Motion
Hadoop moves the code to the data on HDFS which reduces network I/O and saves bandwidth.
HDFS stores data on various machines so when the requirements increase we can scale the cluster.
- Vertical Scalability: It is about adding more resources like CPU, memory, and disk on the existing nodes of the cluster.
- Horizontal Scalability: It is about adding more machines to the clusters on the fly without any downtime.
When you freshly installed Hadoop and configure HDFS, open namenode and execute the following command. It formats the HDFS.
hadoop namenode -format
The following command will start the distributed file system.
Listing files in HDFS
hadoop fs -ls <args>
This will list the files in the given path.
Inserting Data in HDFS
- The following command to create an input directory.
hadoop fs -mkdir <input dir path>
- To insert the data from the local file system to HDFS.
hadoop fs -put <local input file path that to be put on HDFS> <input dir path>
- Verify the file using ls command.
hadoop fs -mkdir <input dir path>
Retrieving Data in HDFS
- View the data using cat commands.
hadoop fs -cat <path of the file to be viewed>
- Get the file from HDFS to the local file system.
hadoop fs -get <HDFS file path> <local file system path where the file to be save>
Shutting Down the HDFS
This command will stop the HDFS.
- ls <path>: Lists the content of the given directory.
- lsr <path>: Recursively displays the content of subdirectories too.
- du <path>: Displays the disk usage.
- dus <path>: Prints the summary of the disk usage
- mv <src> <dest>: Moves the file or directory to destination.
- cp <src> <dest>: Copies the file of directory to destination.
- rm <path>: Removes the given file or empty directory
- rmr <path>: Recursively Removes the given file or empty directory.
- put <local src> <dest>: copies the file or directory from the local to the HDFS.
- copyFromLocal <local src> <dest>: same as -put
- moveFromLocal <local src> <dest>: Moves the file or directory from the local to HDFS.
- get <src> <local dest>: Copies file or directory from HDFS to local.
- getmerge <src> <local dest>: It retrieves the files and then merge them in a single file.
- cat <filename>: To view the content of the file.
- copyToLocal <src> <local dest> : same as -get
- moveToLocal <src> <local dest>: It works like -get but then deletes from the HDFS.
- mkdir <path>: It makes a directory in the HDFS.
- touchz <path>: To create an empty file at the given HDFS path.
- test -[ezd] <path> : To test if the given HDFS path exists or not,will return 0 if does not exists else 1.
- stat [format] <path>: It prints the path information.
- tail [-f] <filename>: It displays the last content of the file.
- chmod: To change the file or directories permissions.
- chown: To set the owner of the file or directories.
- chgrp: To set the owner group.
- help <command-name> : To display the usage detail of any of the commands.
This content is accurate and true to the best of the author’s knowledge and is not meant to substitute for formal and individualized advice from a qualified professional.
© 2019 Sam Shepards