Hadoop Version 2 comes from  the Hadoop 0.23 branch, with several notable changes to Hadoop’s architecture that addresses scalability, failover and integration concerns. Major components have been re-written to enable support for additional features like High Availability, and MapReduce 2.0 (YARN), as well as enable Hadoop to scale out past 4,000 machines per cluster.

MAJOR COMPONENTS OF HADOOP V2

HDFS Federation – A major advance in Hadoop V2, HDFS federation brings important measures of scalability and reliability to Hadoop.

HDFS is the Hadoop file system and comprises two major components: namespaces and blocks storage service. The namespace service manages operations on files and directories, such as creating and modifying files and directories. The block storage service implements data node cluster management, block operations and replication. 

Hadoop 2 is a federation within the Hadoop Distributed File System (HDFS) that facilitates HDFS to scale as well as enable multitenant operations, both of which are critical requirements in enterprise class Data Warehousing.

How does the Federation work?

The Federation in Hadoop V2 works by allowing cluster nodes to participate in file storage for multiple namespaces (for instance, directories, blocks, files, etc), and also for namespaces to scale horizontally with additional “namenodes.” This way Federation helps Hadoop to scale beyond 4,000 nodes and also function as the primitive that allows enterprises to share storage and processing nodes among a community of trusted users through chopping up of the farm and sharing of the resources at both at rest and at run time.

HIGHLIGHTS

  • Support for POSIX-style file system extended attributes;
  • Enables clients browse an fsimage via the WebHDFS API, using the OfflineImageViewer;
  • NFS gateway comes with a number of supportability improvements and bug fixes. The Hadoop portmapper is no longer required to run the gateway, and the gateway is now able to reject connections from unprivileged ports; &
  • The SecondaryNameNode, JournalNode, and DataNode web UIs have been modernized with HTML5 and JavaScript.

YARN: It’s the other major advance in Hadoop 2 that brings significant performance improvements for some applications, supports additional processing models, and implements a more flexible execution engine. Further the YARN-based architecture of HDP 2.0 enables the modern data architecture by providing one enterprise Hadoop that deploy integrates with existing, and future, data center technologies.

Large amounts of Data from multiple stores are stored in HDFS. To process and process and analyze the same, you can run MapReduce framework jobs with Pig and Hive. However to process with other framework applications such as Graph or Streaming, data needs to be taken out of HDFS, for example, into Cassandra or HBase. It’s here that Hadoop 2.0 provides YARN (which means ‘Yet Another Resource Negotiator) API‘s to write other frameworks to run on top of HDFS which enables running Non-MapReduce Big Data Applications on Hadoop. Spark, MPI, Giraph, and HAMA are few of the applications written or ported to run within YARN.

HIGHLIGHTS

  • Manages the data load across all the nodes in a Hadoop cluster;
  • Possess the ability to process Terabytes and Petabytes of data available in HDFS using Non-MapReduce applications such as MPI, GIRAPH;
  • YARN’s REST APIs now supports write/modify operations. Users can submit and kill applications through REST APIs;
  • Provides better resource management resulting in improved cluster efficiency and application performance. This feature also improves the MapReduce Data Processing and enables Hadoop usage in other data processing applications;
  • Timeline store in YARN, used for storing generic and application-specific information for applications, supports authentication through Kerberos; &
  • Fair Scheduler supports dynamic hierarchical user queues that are created dynamically at runtime under any specified parent-queue.

RESOURCE MANAGER: It’s the ultimate authority that arbitrates resources among all the applications in the system.

HIGHLIGHTS

  • Manages global assignment of compute resources to applications and the per-application. The ApplicationMaster utility in Resource Manager manages the application’s scheduling and coordination; &
  • Splits up the two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons: a global Resource Manager and per-application ApplicationMaster.

CLUSTER SET-UP ON CONFIGURATION

To install and configure Hadoop on the master and slave nodes, the following steps needs to be followed:

    1. Install JDK Hotspot version 1.6 or above.
    2. Download the Hadoop installation file to each of the master and slave nodes.
    3. Extract the installer to the appropriate directory on each cluster node and modify the hadoop_env.sh script to point to the location of the environment variable JAVA_HOME.
    4. Set up the following Hadoop environment variables by setting these values in the .bashrc file on all nodes. Configure these values to suit your specific environment:

export JAVA_HOME=/home/ubuntu/installs/jdk1.6.0_45

export HADOOP_PREFIX=/home/ubuntu/installs/hadoop/hadoop-2.4.0

export PATH=”/home/ubuntu/installs/hadoop/hadoop-2.4.0/etc/hadoop:/home/ubuntu/installs/hadoop/hadoop-2.4.0/bin:/home/ubuntu/installs/jdk1.6.0_45/bin:~/installs/hadoop/hadoop-2.4.0/sbin:$PATH”

export HADOOP_HOME=/home/ubuntu/installs/hadoop/hadoop-2.4.0

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

5)Perform some basic configuration to configure the Hadoop cluster. On the master and slave nodes navigate to $HADOOP_HOME/etc/hadoop and modify the following files to configure yarn, set up the dfs replication factor and to setup mapreduce.

  1. a) core-site.xml 

<configuration>

  <property>

    <name>fs.default.name</name>

    <value>hdfs://<master_private_ip>:9000</value>

  </property>

  <property>

    <name>hadoop.tmp.dir</name>

    <value>/home/ubuntu/installs/hadoop/hadoop-2.4.0/tmp</value>

  </property>

</configuration>

  1. b) yarn-site.xml

<configuration>

  <property>

    <name>yarn.nodemanager.aux-services</name>

    <value>mapreduce_shuffle</value>

  </property>

  <property>

    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

    <value>org.apache.hadoop.mapred.ShuffleHandler</value>

  </property>

  <property>

    <name>yarn.resourcemanager.resource-tracker.address</name>

    <value><master_private_ip>:8025</value>

  </property>

  <property>

    <name>yarn.resourcemanager.scheduler.address</name>

    <value><master_private_ip>:8030</value>

  </property>

  <property>

    <name>yarn.resourcemanager.address</name>

    <value><master_private_ip>:8040</value>

  </property>

 </configuration>

  1. c) hdfs-site.xml

 <configuration>

   <property>

     <name>dfs.replication</name>

     <value>2</value>

   </property>

   <property>

     <name>dfs.permissions</name>

     <value>false</value>

   </property>

 </configuration>

  1. d) mapred-site.xml

<configuration>

 <property>

   <name>mapreduce.framework.name</name>

   <value>yarn</value>

 </property>

</configuration>

6) On the master node only, modify the file “slaves” in HADOOP_HOME/etc/hadoop to list the private ip addresses of the slave nodes so that the node manager and data nodes will run only on the 2 slave nodes.

Slaves file:

10.75.31.5

10.89.4.229

7) The last step before testing the installation is to format the file system on the namenode which will run on the master node in this setup scenario which can be performed by executing the command:

HADOOP_HOME/bin/hadoop -format namenode

CONCLUSION

Rapidly maturing as a versatile and secure platform, Hadoop V2 more interactive and specialized processing models will only further position the Hadoop ecosystem as the dominant big data analysis platform.