Skip to main content

Hadoop fully distributed mode installation on ubuntu

Fully Distributed Mode (Multi Node Cluster)
This post descibes how to install and configure Hadoop clusters ranging from a few nodes to extremely large clusters. To play with Hadoop, you may first want to install it on a single machine (see Single Node Setup).
Hadoop Fully Distributed Mode Installation on Ubuntu 14.04
On All machines - (HadoopMaster, HadoopSlave1, HadoopSlave2)
Step 1 - Update. Open a terminal (CTRL + ALT + T) and type the following sudo command. It is advisable to run this before installing any package, and necessary to run it to install the latest updates, even if you have not added or removed any Software Sources.

$ sudo apt-get update
Step 2 - Installing Java 7.

$ sudo apt-get install openjdk-7-jdk
Step 3 - Install open-ssh server. It is a cryptographic network protocol for operating network services securely over an unsecured network. The best known example application is for remote login to computer systems by users.

$ sudo apt-get install openssh-server
Step 4 - Edit /etc/hosts file.
$ sudo gedit /etc/hosts
/etc/hosts file. Add all machines IP address and hostname. Save and close.
192.168.2.14    HadoopMaster
192.168.2.15    HadoopSlave1
192.168.2.16    HadoopSlave2
Step 5 - Create a Group. We will create a group, configure the group sudo permissions and then add the user to the group. Here 'hadoop' is a group name and 'hduser' is a user of the group.

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
Step 6 - Configure the sudo permissions for 'hduser'.
$ sudo visudo
Since by default ubuntu text editor is nano we will need to use CTRL + O to edit.
ctrl+O
Add the permissions to sudoers.
hduser ALL=(ALL) ALL
Use CTRL + X keyboard shortcut to exit out. Enter Y to save the file.
ctrl+x
Step 7 - Creating hadoop directory.
$ sudo mkdir /usr/local/hadoop
Step 8 - Change the ownership and permissions of the directory /usr/local/hadoop. Here 'hduser' is an Ubuntu username.
$ sudo chown -R hduser /usr/local/hadoop
$ sudo chmod -R 755 /usr/local/hadoop
Step 9 - Creating /app/hadoop/tmp directory.
$ sudo mkdir /app/hadoop/tmp
Step 10 - Change the ownership and permissions of the directory /app/hadoop/tmp. Here 'hduser' is an Ubuntu username.
$ sudo chown -R hduser /app/hadoop/tmp
$ sudo chmod -R 755 /app/hadoop/tmp
Step 11 - Switch User, is used by a computer user to execute commands with the privileges of another user account.
$ su hduser
Step 12 - Generating a new SSH public and private key pair on your local computer is the first step towards authenticating with a remote server without a password. Unless there is a good reason not to, you should always authenticate using SSH keys.
$ ssh-keygen -t rsa -P ""
Step 13 - Now you can add the public key to the authorized_keys
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Step 14 - Adding hostname to list of known hosts. A quick way of making sure that 'hostname' is added to the list of known hosts so that a script execution doesn't get interrupted by a question about trusting computer's authenticity.
$ ssh hostname 
Only on HadoopMaster Machine

Step 15 - Switch User, is used by a computer user to execute commands with the privileges of another user account.
$ su hduser
Step 16 - ssh-copy-id is a small script which copy your ssh public-key to a remote host; appending it to your remote authorized_keys.
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@192.168.2.15
Step 17 - ssh is a program for logging into a remote machine and for executing commands on a remote machine. Check remote login works or not.
$ ssh 192.168.2.15
Step 18 - Exit from remote login.
$ exit 
Same steps 16, 17 and 18 for other machines (HadoopSalve2).
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@192.168.2.16
$ ssh 192.168.2.16
$ exit
Step 19 - Change the directory to /home/hduser/Desktop , In my case the downloaded hadoop-2.6.4.tar.gz file is in /home/hduser/Desktop folder. For you it might be in /downloads folder check it.
$ cd /home/hduser/Desktop/
Step 20 - Untar the hadoop-2.6.4.tar.gz file.
$ tar xzf hadoop-2.6.4.tar.gz
Step 21 - Move the contents of hadoop-2.6.4 folder to /usr/local/hadoop
$ mv hadoop-2.6.4/* /usr/local/hadoop
Step 22 - Edit $HOME/.bashrc file by adding the java and hadoop path.
$ sudo gedit $HOME/.bashrc
$HOME/.bashrc file. Add the following lines
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native"

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Step 23 - Reload your changed $HOME/.bashrc settings
$ source $HOME/.bashrc
Step 24 - Change the directory to /usr/local/hadoop/etc/hadoop
$ cd $HADOOP_HOME/etc/hadoop
Step 25 - Edit hadoop-env.sh file.
$ sudo gedit hadoop-env.sh
Step 26 - Add the below lines to hadoop-env.sh file. Save and Close.
# remove comment and change java_HOME 
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Step 27 - Edit core-site.xml file.
$ sudo gedit core-site.xml
Step 28 - Add the below lines to core-site.xml file. Save and Close.
<property>
<name>fs.default.name</name>
<value>hdfs://HadoopMaster:9000</value>
</property>

<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
Step 29 - Edit hdfs-site.xml file.
$ sudo gedit hdfs-site.xml
Step 30 - Add the below lines to hdfs-site.xml file. Save and Close.
<property>
<name>dfs.name.dir</name>
<value>/app/hadoop/tmp/namenode</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>/app/hadoop/tmp/datanode</value>
</property>

<property>
<name>dfs.replication</name>
<value>2</value>
</property>

<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>false</value>
</property>

<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>

<property>
<name>dfs.namenode.http-address</name>
<value>HadoopMaster:50070</value>
<description>Your NameNode hostname for http access.</description>
</property>

<property>
<name>dfs.namenode.secondary.http-address</name>
<value>HadoopMaster:50090</value>
<description>Your Secondary NameNode hostname for http access.</description>
</property>
Step 31 - Edit yarn-site.xml file.
$ sudo gedit yarn-site.xml
Step 32 - Add the below lines to yarn-site.xml file. Save and Close.
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
<description>Long running service which executes on Node Manager(s) and provides MapReduce Sort and Shuffle functionality.</description>
</property>

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
<description>Enable log aggregation so application logs are moved onto hdfs and are viewable via web ui after the application completed. The default location on hdfs is '/log' and can be changed via yarn.nodemanager.remote-app-log-dir property</description>
</property>

<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>HadoopMaster:8030</value>
</property>

<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>HadoopMaster:8031</value>
</property>

<property>
<name>yarn.resourcemanager.address</name>
<value>HadoopMaster:8032</value>
</property>

<property>
<name>yarn.resourcemanager.admin.address</name>
<value>HadoopMaster:8033</value>
</property>

<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>HadoopMaster:8088</value>
</property>
Step 33 - Edit mapred-site.xml file.
$ sudo gedit mapred-site.xml
Step 34 - Add the below lines to mapred-site.xml file. Save and Close.
<property>
<name>mapred.job.tracker</name>
<value>HadoopMaster:9001</value>
</property>

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Step 35 - Edit slaves file.
$ sudo gedit slaves
Step 36 - Add the below line to slaves file. Save and Close.
192.168.2.15
192.168.2.16
Step 37 - Secure copy or SCP is a means of securely transferring computer files between a local host and a remote host or between two remote hosts. Here we are transferring configured hadoop files from master to slave nodes.
$ scp -r /usr/local/hadoop/* hduser@192.168.2.15:/usr/local/hadoop
$ scp -r /usr/local/hadoop/* hduser@192.168.2.16:/usr/local/hadoop
Step 38 - Here we are transferring configured .bashrc file from master to slave nodes.
$ scp -r $HOME/.bashrc hduser@192.168.2.15:$HOME/.bashrc
$ scp -r $HOME/.bashrc hduser@192.168.2.16:$HOME/.bashrc
Step 39 - Change the directory to /usr/local/hadoop/sbin
$ cd /usr/local/hadoop/sbin
Step 40 - Format the datanode.
$ hadoop namenode -format
Step 41 - Start NameNode daemon and DataNode daemon.
$ start-dfs.sh
Step 42 - Start yarn daemons.
$ start-yarn.sh
OR
Instead of steps 41 and 42 you can use below command. It is deprecated now.
$ start-all.sh
Step 43 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.
$ jps
Only on slave machines - (HadoopSlave1 and HadoopSlave2)
hduser@HadoopSlave1:~$ jps
hduser@HadoopSlave2:~$ jps
Only on HadoopMaster Machine

Once the Hadoop cluster is up and running check the web-ui of the components as described below
NameNode Browse the web interface for the NameNode; by default it is available at
http://HadoopMaster:50070/
ResourceManager Browse the web interface for the ResourceManager; by default it is available at
http://HadoopMaster:8088/
Step 44 - Make the HDFS directories required to execute MapReduce jobs.
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/hduser
Step 45 - Copy the input files into the distributed filesystem.
$ hdfs dfs -put /usr/local/hadoop/etc/hadoop /user/hduser/input
Step 46 - Run some of the examples provided.
Hadoop Fully Distributed Mode Installation on Ubuntu 14.04
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar grep /user/hduser/input /user/hduser/output 'dfs[a-z.]+'
Step 47 - Examine the output files.
$ hdfs dfs -cat /user/hduser/output/*
Step 48 - Stop NameNode daemon and DataNode daemon.
$ stop-dfs.sh
Step 49 - Stop Yarn daemons.
$ stop-yarn.sh
OR
Instead of steps 48 and 49 you can use below command. It is deprecated now.

Comments

Popular posts from this blog

Apache Spark WordCount scala example

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark WordCount Scala Example Step 1 - Change the directory to /usr/local/spark/sbin. $ cd /usr/local/spark/sbin Step 2 - Start all spark daemons. $ ./start-all. sh Step 3 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions. $ jp...

Hive hiveserver2 and Web UI usage

Hive hiveserver2 and Web UI usage HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results (a more detailed intro here). The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC. Step 1 - Change the directory to /usr/local/hive/bin $ cd $HIVE_HOME/bin Step 2 - Start hiveserver2 daemon $ hiveserver2 OR $ hive --service hiveserver2 & Step 3 - You can browse to hiveserver2 web ui at following url http: //localhost:10002/hiveserver2.jsp Step 4 - You can see the hive logs in /tmp/hduser/hive. log To kill hiveserver2 daemon $ ps -ef | grep -i hiveserver2 $ kill - 9 29707 OR $ rm -rf /var/run/hive/hive...

Apache Spark Shell Usage

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark Shell Usage The Spark shell provides an easy and convenient way to prototype certain operations quickly, without having to develop a full program, packaging it and then deploying it. Step 1 - Change the directory to /usr/local/hadoop/sbin. $ cd /usr/local/hadoop/sbin Step 2 - Start all hadoop daemons. $ ./start-all. sh ...