Skip to main content

Spark 1.6.1 cluster mode installation on ubuntu 14.04

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
Pre Requirements
1) A machine with Ubuntu 14.04 LTS operating system installed.
2) Apache Spark 1.6.1 Software (Download Here)
3) Scala 2.10.5 Software (Download Here)
NOTE
Spark is not a replacement of Hadoop. Spark is a part of the hadoop eco system. Spark can use Hadoop's distributed file system (HDFS) and also submit jobs on YARN. In order to make use of hadoop's components, you need to install Hadoop first then spark (How to install Hadoop on Ubuntu 14.04). The downloaded spark must be compatible with hadoop version. Please notice hadoop version before downloading spark.
Spark Cluster Mode Installation on Ubuntu 14.04
Spark Cluster Mode Installation
This post descibes how to install and configure spark clusters ranging from a few nodes to extremely large clusters. To play with Spark, you may first want to install it on a single machine (see, Standalone Mode Setup).
Spark Cluster Mode Installation on Ubuntu 14.04
On All machines - (Sparkmaster, Sparkslave1, Sparkslave2, Sparkslave3)
Installation Steps
Step 1 - Update. Open a terminal (CTRL + ALT + T) and type the following sudo command. It is advisable to run this before installing any package, and necessary to run it to install the latest updates, even if you have not added or removed any Software Sources.

$ sudo apt-get update
Step 2 - Installing Java 7.

$ sudo apt-get install openjdk-7-jdk
Step 3 - Install open-ssh server. It is a cryptographic network protocol for operating network services securely over an unsecured network. The best known example application is for remote login to computer systems by users.

$ sudo apt-get install openssh-server
Step 4 - Create a Group. We will create a group, configure the group sudo permissions and then add the user to the group. Here 'hadoop' is a group name and 'hduser' is a user of the group.

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
Step 5 - Configure the sudo permissions for 'hduser'.
$ sudo visudo
Since by default ubuntu text editor is nano we will need to use CTRL + O to edit.
ctrl+O
Add the permissions to sudoers.
hduser ALL=(ALL) ALL
Use CTRL + X keyboard shortcut to exit out. Enter Y to save the file.
ctrl+x
Step 6 - Edit /etc/hosts file.
$ sudo gedit /etc/hosts
/etc/hosts file. Add all machines IP address and hostname. Save and close.
10.0.0.1 sparkmaster
10.0.0.2 sparkslave1
10.0.0.3 sparkslave2
10.0.0.4 sparkslave3
Step 7 - Change the ownership and permissions of the directory /usr/local/spark. Here 'hduser' is an Ubuntu username.
$ sudo chown -R hduser /usr/local/spark
$ sudo chmod -R 755 /usr/local/spark
Step 8 - Creating scala directory.
$ sudo mkdir /usr/local/scala
Step 9 - Change the ownership and permissions of the directory /usr/local/scala. Here 'hduser' is an Ubuntu username.
$ sudo chown -R hduser /usr/local/scala
$ sudo chmod -R 755 /usr/local/scala
Step 10 - Creating /app/spark/tmp directory.
$ sudo mkdir /app/spark/tmp
Step 11 - Change the ownership and permissions of the directory /app/spark/tmp. Here 'hduser' is an Ubuntu username.
$ sudo chown -R hduser /app/spark/tmp
$ sudo chmod -R 755 /app/spark/tmp
Step 12 - Switch User, is used by a computer user to execute commands with the privileges of another user account.
$ su hduser
Step 13 - Generating a new SSH public and private key pair on your local computer is the first step towards authenticating with a remote server without a password. Unless there is a good reason not to, you should always authenticate using SSH keys.
$ ssh-keygen -t rsa -P ""
Step 14 - Now you can add the public key to the authorized_keys
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Step 15 - Adding hostname to list of known hosts. A quick way of making sure that 'hostname' is added to the list of known hosts so that a script execution doesn't get interrupted by a question about trusting computer's authenticity.
$ ssh hostname 
Only on Sparkmaster Machine

Step 16 - Change the directory to /home/hduser/Desktop , In my case the downloaded spark-1.6.1-bin-hadoop2.6.tgz file is in /home/hduser/Desktop folder. For you it might be in /downloads folder check it.
$ cd /home/hduser/Desktop/
Step 17 - Untar the spark-1.6.1-bin-hadoop2.6.tgz file.
$ tar xzf spark-1.6.1-bin-hadoop2.6.tgz
Step 18 - Move the contents of spark-1.6.1-bin-hadoop2.6 folder to /usr/local/spark
$ mv spark-1.6.1-bin-hadoop2.6/* /usr/local/spark
Step 19 - Untar the scala-2.10.5.tgz file. In my case the downloaded scala-2.10.5.tgz file is in /home/hduser/Desktop folder. For you it might be in /downloads folder check it.
$ tar xzf scala-2.10.5.tgz
Step 20 - Move the contents of scala-2.10.5 folder to /usr/local/scala
$ mv scala-2.10.5/* /usr/local/scala
Step 21 - Edit $HOME/.bashrc file by adding the spark and scala path.
$ sudo gedit $HOME/.bashrc
$HOME/.bashrc file. Add the following lines
export SCALA_HOME=/usr/local/scala
export SPARK_HOME=/usr/local/spark
export PATH=$SPARK_HOME/bin:$JAVA_HOME/bin:$SCALA_HOME/bin:$PATH
Step 22 - Reload your changed $HOME/.bashrc settings
$ source $HOME/.bashrc
Step 23 - Change the directory to /usr/local/spark/conf
$ cd /usr/local/spark/conf
Step 24 - Copy the spark-env.sh.template to spark-env.sh
$ cp spark-env.sh.template spark-env.sh
Step 25 - Edit spark-env.sh file
$ sudo gedit spark-env.sh
Step 26 - Add the below lines to spark-env.sh file. Save and Close.
export SCALA_HOME=/usr/local/scala
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=2
export SPARK_MASTER_IP=10.0.0.1
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_DIR=/app/spark/tmp
Step 27 - Copy the spark-defaults.conf.template to spark-defaults.conf
$ cp spark-defaults.conf.template spark-defaults.conf
Step 28 - Edit spark-defaults.conf file
$ sudo gedit spark-defaults.conf
Step 29 - Add the below line to spark-defaults.conf file. Save and Close.
spark.master                     spark://10.0.0.1:7077
Step 30 - Copy the slaves.template to slaves
$ cp slaves.template slaves
Step 31 - Edit slaves file.
$ sudo gedit slaves
Step 32 - Add the below line to slaves file. Save and Close.
10.0.0.2
10.0.0.3
10.0.0.4
Step 33 - ssh-copy-id is a small script which copy your ssh public-key to a remote host; appending it to your remote authorized_keys.
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@10.0.0.2
Step 34 - ssh is a program for logging into a remote machine and for executing commands on a remote machine. Check remote login works or not.
$ ssh 10.0.0.2
Step 35 - Exit from remote login.
$ exit 
Same steps 33, 34 and 35 for other machines (Sparkslave2, Sparkslave3).
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@10.0.0.3
$ ssh 10.0.0.3
$ exit 
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@10.0.0.4
$ ssh 10.0.0.4
$ exit 
Step 36 - Secure copy or SCP is a means of securely transferring computer files between a local host and a remote host or between two remote hosts. Here we are transferring configured spark files from master to slave nodes.
$ scp -r /usr/local/spark/* hduser@10.0.0.2:/usr/local/spark
$ scp -r /usr/local/spark/* hduser@10.0.0.3:/usr/local/spark
$ scp -r /usr/local/spark/* hduser@10.0.0.4:/usr/local/spark
Step 37 - Here we are transferring scala files from master to slave nodes.
$ scp -r /usr/local/scala/* hduser@10.0.0.2:/usr/local/scala
$ scp -r /usr/local/scala/* hduser@10.0.0.3:/usr/local/scala
$ scp -r /usr/local/scala/* hduser@10.0.0.4:/usr/local/scala
Step 38 - Here we are transferring configured .bashrc file from master to slave nodes.
$ scp -r $HOME/.bashrc hduser@10.0.0.2:$HOME/.bashrc
$ scp -r $HOME/.bashrc hduser@10.0.0.3:$HOME/.bashrc
$ scp -r $HOME/.bashrc hduser@10.0.0.4:$HOME/.bashrc
Step 39 - Change the directory to /usr/local/spark/sbin
$ cd /usr/local/spark/sbin
Step 40 - Start Master and all Worker Daemons.
$ ./start-all.sh
Step 41 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.
$ jps
Once the spark is up and running check the web-ui of the components as described below
http://10.0.0.1:8080/
Only on slave machines - (Sparkslave1, Sparkslave2, and Sparkslave3)
Step 42 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.
$ jps
Only on Sparkmaster Machine

Step 43 - Stop Master and all Worker Daemons.
$ ./stop-all.sh

Comments

Popular posts from this blog

Apache Spark WordCount scala example

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark WordCount Scala Example Step 1 - Change the directory to /usr/local/spark/sbin. $ cd /usr/local/spark/sbin Step 2 - Start all spark daemons. $ ./start-all. sh Step 3 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions. $ jp...

Hive hiveserver2 and Web UI usage

Hive hiveserver2 and Web UI usage HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results (a more detailed intro here). The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC. Step 1 - Change the directory to /usr/local/hive/bin $ cd $HIVE_HOME/bin Step 2 - Start hiveserver2 daemon $ hiveserver2 OR $ hive --service hiveserver2 & Step 3 - You can browse to hiveserver2 web ui at following url http: //localhost:10002/hiveserver2.jsp Step 4 - You can see the hive logs in /tmp/hduser/hive. log To kill hiveserver2 daemon $ ps -ef | grep -i hiveserver2 $ kill - 9 29707 OR $ rm -rf /var/run/hive/hive...

Apache Spark Shell Usage

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark Shell Usage The Spark shell provides an easy and convenient way to prototype certain operations quickly, without having to develop a full program, packaging it and then deploying it. Step 1 - Change the directory to /usr/local/hadoop/sbin. $ cd /usr/local/hadoop/sbin Step 2 - Start all hadoop daemons. $ ./start-all. sh ...