Apache Spark is an open source cluster computing framework.
Originally developed at the University of California, Berkeley's AMPLab,
the Spark codebase was later donated to the Apache Software Foundation,
which has maintained it since. Spark provides an interface for
programming entire clusters with implicit data parallelism and
fault-tolerance.
Pre Requirements
1) A machine with Ubuntu 14.04 LTS operating system installed.
2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)
3) Apache Spark 1.6.1 Software (Download Here)
4) Scala 2.10.5 Software (Download Here)
NOTE
Spark is not a replacement of Hadoop. Spark is a part of the hadoop eco system. Spark can use Hadoop's distributed file system (HDFS) and also submit jobs on YARN. In order to make use of hadoop's components, you need to install Hadoop first then spark (How to install Hadoop on Ubuntu 14.04). The downloaded spark must be compatible with hadoop version. Please notice hadoop version before downloading spark.
Spark With YARN Configuration
Spark provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use provided launch scripts. It is also possible to run these daemons on a single machine for testing.
Installation Steps
Step 1 - Update. Open a terminal (CTRL + ALT + T) and type the following sudo command. It is advisable to run this before installing any package, and necessary to run it to install the latest updates, even if you have not added or removed any Software Sources.
Step 2 - Installing Java 7.
Step 3 - Install open-ssh server.
It is a cryptographic network protocol for operating network services
securely over an unsecured network. The best known example application
is for remote login to computer systems by users.
Step 4 - Create a Group. We will create a group, configure the group sudo permissions and then add the user to the group. Here 'hadoop' is a group name and 'hduser' is a user of the group.
Step 5 - Configure the sudo permissions for 'hduser'.
Since by default ubuntu text editor is nano we will need to use CTRL + O to edit.
Add the permissions to sudoers.
Use CTRL + X keyboard shortcut to exit out. Enter Y to save the file.
Step 6 - Creating spark directory.
Step 7 - Change the ownership and permissions of the directory /usr/local/spark. Here 'hduser' is an Ubuntu username.
Step 8 - Creating scala directory.
Step 9 - Change the ownership and permissions of the directory /usr/local/scala. Here 'hduser' is an Ubuntu username.
Step 10 - Creating /app/spark/tmp directory.
Step 11 - Change the ownership and permissions of the directory /app/spark/tmp. Here 'hduser' is an Ubuntu username.
Step 12 - Switch User, is used by a computer user to execute commands with the privileges of another user account.
Step 13 - Change the directory to /home/hduser/Desktop , In my case the downloaded spark-1.6.1-bin-hadoop2.6.tgz file is in /home/hduser/Desktop folder. For you it might be in /downloads folder check it.
Step 14 - Untar the spark-1.6.1-bin-hadoop2.6.tgz file.
Step 15 - Move the contents of spark-1.6.1-bin-hadoop2.6 folder to /usr/local/spark
Step 16 - Untar the scala-2.10.5.tgz file. In my case the downloaded scala-2.10.5.tgz file is in /home/hduser/Desktop folder. For you it might be in /downloads folder check it.
Step 17 - Move the contents of scala-2.10.5 folder to /usr/local/scala
Step 18 - Edit $HOME/.bashrc file by adding the spark and scala path.
$HOME/.bashrc file. Add the following lines
Step 19 - Reload your changed $HOME/.bashrc settings
Step 20 - Change the directory to /usr/local/spark/conf
Step 21 - Copy the spark-env.sh.template to spark-env.sh
Step 22 - Edit spark-env.sh file
Step 23 - Add the below lines to spark-env.sh file. Save and Close.
Step 24 - Copy the spark-defaults.conf.template to spark-defaults.conf
Step 25 - Edit spark-defaults.conf file
Step 26 - Add the below line to spark-defaults.conf file. Save and Close.
Step 27 - Copy the slaves.template to slaves
Step 28 - Edit slaves file.
Step 29 - Add the below line to slaves file. Save and Close.
Step 30 - Change the directory to /usr/local/spark/sbin
Step 31 - Start Master and all Worker Daemons.
Step 32 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.
Once the spark is up and running check the web-ui of the components as described below
Step 33 - Stop Master and all Worker Daemons.
Deploying jobs to YARN
2 Types of Deployment Modes
1) Client Mode
2) Cluster Mode
You can see submitted jobs on hadoop's tasktracker web UI at http://localhost:8080
Please share this blog post and follow me for latest updates on
Pre Requirements
1) A machine with Ubuntu 14.04 LTS operating system installed.
2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)
3) Apache Spark 1.6.1 Software (Download Here)
4) Scala 2.10.5 Software (Download Here)
NOTE
Spark is not a replacement of Hadoop. Spark is a part of the hadoop eco system. Spark can use Hadoop's distributed file system (HDFS) and also submit jobs on YARN. In order to make use of hadoop's components, you need to install Hadoop first then spark (How to install Hadoop on Ubuntu 14.04). The downloaded spark must be compatible with hadoop version. Please notice hadoop version before downloading spark.
Spark With YARN Configuration
Spark provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use provided launch scripts. It is also possible to run these daemons on a single machine for testing.
Installation Steps
Step 1 - Update. Open a terminal (CTRL + ALT + T) and type the following sudo command. It is advisable to run this before installing any package, and necessary to run it to install the latest updates, even if you have not added or removed any Software Sources.
$ sudo apt-get update
$ sudo apt-get install openjdk-7-jdk
$ sudo apt-get install openssh-server
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
$ sudo visudo
ctrl+O
hduser ALL=(ALL) ALL
ctrl+x
$ sudo mkdir /usr/local/spark
$ sudo chown -R hduser /usr/local/spark
$ sudo chmod -R 755 /usr/local/spark
$ sudo mkdir /usr/local/scala
$ sudo chown -R hduser /usr/local/scala
$ sudo chmod -R 755 /usr/local/scala
$ sudo mkdir /app/spark/tmp
$ sudo chown -R hduser /app/spark/tmp
$ sudo chmod -R 755 /app/spark/tmp
$ su hduser
$ cd /home/hduser/Desktop/
$ tar xzf spark-1.6.1-bin-hadoop2.6.tgz
$ mv spark-1.6.1-bin-hadoop2.6/* /usr/local/spark
$ tar xzf scala-2.10.5.tgz
$ mv scala-2.10.5/* /usr/local/scala
$ sudo gedit $HOME/.bashrc
export SCALA_HOME=/usr/local/scala export SPARK_HOME=/usr/local/spark export PATH=$SPARK_HOME/bin:$JAVA_HOME/bin:$SCALA_HOME/bin:$PATH
$ source $HOME/.bashrc
$ cd /usr/local/spark/conf
$ cp spark-env.sh.template spark-env.sh
$ sudo gedit spark-env.sh
export SCALA_HOME=/usr/local/scala export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 export SPARK_WORKER_MEMORY=1g export SPARK_WORKER_INSTANCES=2 export SPARK_MASTER_IP=127.0.0.1 export SPARK_MASTER_PORT=7077 export SPARK_WORKER_DIR=/app/spark/tmp # Options read in YARN client mode export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop export SPARK_EXECUTOR_INSTANCES=2 export SPARK_EXECUTOR_CORES=2 export SPARK_EXECUTOR_MEMORY=1G export SPARK_DRIVER_MEMORY=1G export SPARK_YARN_APP_NAME=Spark
$ cp spark-defaults.conf.template spark-defaults.conf
$ sudo gedit spark-defaults.conf
spark.master spark://127.0.0.1:7077
$ cp slaves.template slaves
$ sudo gedit slaves
localhost
$ cd /usr/local/spark/sbin
$ ./start-all.sh
Step 32 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.
$ jps
http://127.0.0.1:8080/
Step 33 - Stop Master and all Worker Daemons.
$ ./stop-all.sh
2 Types of Deployment Modes
1) Client Mode
2) Cluster Mode
$ ./bin/spark-submit --class com.WordCount2 --master yarn --deploy-mode client --executor-cores 1 --num-executors 1 /home/hduser/Desktop/1.6\ SPARK/WordCount.jar $ ./bin/spark-submit --class com.WordCount2 --master yarn --deploy-mode cluster --executor-cores 1 --num-executors 1 /home/hduser/Desktop/1.6\ SPARK/WordCount.jar
Please share this blog post and follow me for latest updates on
Comments
Post a Comment