Skip to main content

Hadoop Commissioning and Decommissioning data node

Commissioning new DataNode to existing Hadoop Cluster
Given below are the steps to be followed for adding new nodes to a Hadoop cluster.
Hadoop Commissioning and Decommissioning DataNode
On All machines - (HadoopMaster, HadoopSlave1, HadoopSlave2, HadoopSlave3)
Step 1 - Edit /etc/hosts file.
$ sudo gedit /etc/hosts
/etc/hosts file. Add all machines IP address and hostname. Save and close.
192.168.2.14    HadoopMaster
192.168.2.15    HadoopSlave1
192.168.2.16    HadoopSlave2
192.168.2.17    HadoopSlave3
Only on new machine - (HadoopSlave3)
Step 2 - Update. Open a terminal (CTRL + ALT + T) and type the following sudo command. It is advisable to run this before installing any package, and necessary to run it to install the latest updates, even if you have not added or removed any Software Sources.

$ sudo apt-get update
Step 3 - Installing Java 7.

$ sudo apt-get install openjdk-7-jdk
Step 4 - Install open-ssh server. It is a cryptographic network protocol for operating network services securely over an unsecured network. The best known example application is for remote login to computer systems by users.

$ sudo apt-get install openssh-server
Step 5 - Create a Group. We will create a group, configure the group sudo permissions and then add the user to the group. Here 'hadoop' is a group name and 'hduser' is a user of the group.

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
Step 6 - Configure the sudo permissions for 'hduser'.
$ sudo visudo
Since by default ubuntu text editor is nano we will need to use CTRL + O to edit.
ctrl+O
Add the permissions to sudoers.
hduser ALL=(ALL) ALL
Use CTRL + X keyboard shortcut to exit out. Enter Y to save the file.
ctrl+x
Step 7 - Creating hadoop directory.
$ sudo mkdir /usr/local/hadoop
Step 8 - Change the ownership and permissions of the directory /usr/local/hadoop. Here 'hduser' is an Ubuntu username.
$ sudo chown -R hduser /usr/local/hadoop
$ sudo chmod -R 755 /usr/local/hadoop
Step 9 - Creating /app/hadoop/tmp directory.
$ sudo mkdir /app/hadoop/tmp
Step 10 - Change the ownership and permissions of the directory /app/hadoop/tmp. Here 'hduser' is an Ubuntu username.
$ sudo chown -R hduser /app/hadoop/tmp
$ sudo chmod -R 755 /app/hadoop/tmp
Step 11 - Switch User, is used by a computer user to execute commands with the privileges of another user account.
$ su hduser
Step 12 - Generating a new SSH public and private key pair on your local computer is the first step towards authenticating with a remote server without a password. Unless there is a good reason not to, you should always authenticate using SSH keys.
$ ssh-keygen -t rsa -P ""
Step 13 - Now you can add the public key to the authorized_keys
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Step 14 - Adding hostname to list of known hosts. A quick way of making sure that 'localhost' is added to the list of known hosts so that a script execution doesn't get interrupted by a question about trusting localhost's authenticity.
$ ssh hostname 
Only on HadoopMaster Machine

Step 15 - Switch User, is used by a computer user to execute commands with the privileges of another user account.
$ su hduser
Step 16 - ssh-copy-id is a small script which copy your ssh public-key to a remote host; appending it to your remote authorized_keys.
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@192.168.2.17
Step 17 - ssh is a program for logging into a remote machine and for executing commands on a remote machine. Check remote login works or not.
$ ssh 192.168.2.17
Step 18 - Exit from remote login.
$ exit 
Step 19 - Change the directory to /usr/local/hadoop/etc/hadoop
$ cd $HADOOP_HOME/etc/hadoop
Step 20 - Edit slaves file.
$ sudo gedit slaves
Step 21 - Add the below line to slaves file. Save and Close.
192.168.2.15
192.168.2.16
192.168.2.17
Step 22 - Secure copy or SCP is a means of securely transferring computer files between a local host and a remote host or between two remote hosts. Here we are transferring configured hadoop files from master to slave nodes.
$ scp -r /usr/local/hadoop/* hduser@192.168.2.17:/usr/local/hadoop
$ scp -r $HADOOP_HOME/etc/hadoop/slaves hduser@192.168.2.15:/usr/local/hadoop/etc/hadoop
$ scp -r $HADOOP_HOME/etc/hadoop/slaves hduser@192.168.2.16:/usr/local/hadoop/etc/hadoop
Step 23 - Here we are transferring configured .bashrc file from master to slave nodes.
$ scp -r $HOME/.bashrc hduser@192.168.2.17:$HOME/.bashrc
Only on new machine - (HadoopSlave3)
Step 24 - Change the directory to /usr/local/hadoop
$ cd /usr/local/hadoop
Step 25 - Start datanode daemon
$ /sbin/hadoop-daemon.sh start datanode
Step 26 - Start NodeManager daemon
$ /sbin/yarn-daemon.sh start nodemanager 
Step 27 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.
$ jps
Decommissioning existing DataNode from Hadoop Cluster
We can remove a node from a cluster on the fly, while it is running, without any data loss. HDFS provides a decommissioning feature, which ensures that removing a node is performed safely. To use it, follow the steps as given below:
Hadoop Commissioning and Decommissioning DataNode
Only on HadoopMaster Machine

Step 1 - Change the directory to /usr/local/hadoop/etc/hadoop
$ cd $HADOOP_HOME/etc/hadoop
Step 2 - Edit hdfs-site.xml file.
$ sudo gedit hdfs-site.xml
Step 3 - Add the below lines to hdfs-site.xml file. Save and Close.
<property>
<name>dfs.hosts.exclude</name>
<value>/usr/local/hadoop/hdfs_exclude.txt</value>
<description>DFS exclude</description>
</property>
Step 4 - Change the directory to /usr/local/hadoop
$ cd $HADOOP_HOME
Step 5 - Create hdfs_exclude.txt file and open for editing
$ gedit hdfs_exclude.txt
Step 6 - Add the following line to hdfs_exclude.txt file. Save and close.
192.168.2.17
Step 7 - Change the directory to /usr/local/hadoop/sbin
$ cd $HADOOP_HOME/sbin
Step 8 - Refresh all nodes.
$ hadoop dfsadmin -refreshNodes
Only on new machine - (HadoopSlave3)
Step 9 - Check if NodeManager is still running by jps. If it is still running stop it.
$ jps
Step 10 - Change the directory to /usr/local/hadoop/sbin
$ cd $HADOOP_HOME/sbin
Step 11 - Stop NodeManager daemon.
$ yarn-daemon.sh stop nodemanager
After the decommission process has been completed, the decommissioned hardware can be safely shut down for maintenance. Run the report command to dfsadmin to check the status of decommission. The following command will describe the status of the decommission node and the connected nodes to the cluster.
Only on HadoopMaster Machine

$ /sbin/hadoop dfsadmin -report

Comments

Popular posts from this blog

Apache Spark WordCount scala example

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark WordCount Scala Example Step 1 - Change the directory to /usr/local/spark/sbin. $ cd /usr/local/spark/sbin Step 2 - Start all spark daemons. $ ./start-all. sh Step 3 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions. $ jp...

Hive hiveserver2 and Web UI usage

Hive hiveserver2 and Web UI usage HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results (a more detailed intro here). The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC. Step 1 - Change the directory to /usr/local/hive/bin $ cd $HIVE_HOME/bin Step 2 - Start hiveserver2 daemon $ hiveserver2 OR $ hive --service hiveserver2 & Step 3 - You can browse to hiveserver2 web ui at following url http: //localhost:10002/hiveserver2.jsp Step 4 - You can see the hive logs in /tmp/hduser/hive. log To kill hiveserver2 daemon $ ps -ef | grep -i hiveserver2 $ kill - 9 29707 OR $ rm -rf /var/run/hive/hive...

Apache Spark Shell Usage

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark Shell Usage The Spark shell provides an easy and convenient way to prototype certain operations quickly, without having to develop a full program, packaging it and then deploying it. Step 1 - Change the directory to /usr/local/hadoop/sbin. $ cd /usr/local/hadoop/sbin Step 2 - Start all hadoop daemons. $ ./start-all. sh ...