Apache Spark is an open source cluster computing framework.
Originally developed at the University of California, Berkeley's AMPLab,
the Spark codebase was later donated to the Apache Software Foundation,
which has maintained it since. Spark provides an interface for
programming entire clusters with implicit data parallelism and
fault-tolerance.
Pre Requirements
1) A machine with Ubuntu 14.04 LTS operating system
2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)
3) Apache Spark 1.6.1 pre installed (How to install Spark on Ubuntu 14.04)
Spark Shell Usage
The Spark shell provides an easy and convenient way to prototype certain operations quickly, without having to develop a full program, packaging it and then deploying it.
Step 1 - Change the directory to /usr/local/hadoop/sbin.
Step 2 - Start all hadoop daemons.
Step 3 - Change the directory to /usr/local/spark/sbin.
Step 4 - Start all spark daemons.
Step 5 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.
Step 6 - Copy a in.txt file from local system for HDFS. This file may be any regular text file that contain text paragraphs.
Step 7 - The following command is used to open Spark shell.
Step 8 - Create an RDD. First, we have to read the input file using Spark-Scala API and create an RDD.
Step 9 - Execute Word count Transformation. Our aim is to count the words in a file.
Step 10 - Current RDD. While working with the RDD, if you want to know about current RDD, then use the following command. It will show you the description about current RDD and its dependencies for debugging.
Step 11 - Caching the Transformations. Use the following command to store the intermediate transformations in memory.
Step 12 - Applying the Action.
Applying an action, like store all the transformations, results into a
text file. The String argument for saveAsTextFile(" ") method is the
absolute path of output folder.
Step 13 - Un-persist RDD. If you want to UN-persist the storage space of particular RDD, then use the following command.
Step 14 - Web UI. if you want to see the storage space that is used for this application, then use the following URL in your browser.
Step 15 - Exit from spark shell. It's the time say good bye to spark-shell.
Pre Requirements
1) A machine with Ubuntu 14.04 LTS operating system
2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)
3) Apache Spark 1.6.1 pre installed (How to install Spark on Ubuntu 14.04)
Spark Shell Usage
The Spark shell provides an easy and convenient way to prototype certain operations quickly, without having to develop a full program, packaging it and then deploying it.
Step 1 - Change the directory to /usr/local/hadoop/sbin.
$ cd /usr/local/hadoop/sbin
$ ./start-all.sh
Step 3 - Change the directory to /usr/local/spark/sbin.
$ cd /usr/local/spark/sbin
$ ./start-all.sh
$ jps
Step 6 - Copy a in.txt file from local system for HDFS. This file may be any regular text file that contain text paragraphs.
$ hdfs dfs -copyFromLocal /home/hduser/Desktop/in.txt /user/hduser/
$ spark-shell
Step 8 - Create an RDD. First, we have to read the input file using Spark-Scala API and create an RDD.
scala> val inputfile = sc.textFile("/user/hduser/in.txt")
Step 9 - Execute Word count Transformation. Our aim is to count the words in a file.
scala> val counts = inputfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_);
Step 10 - Current RDD. While working with the RDD, if you want to know about current RDD, then use the following command. It will show you the description about current RDD and its dependencies for debugging.
scala> counts.toDebugString
scala> counts.cache()
scala> counts.saveAsTextFile("/user/hduser/scalaout")
scala> counts.unpersist()
http://localhost:4040/jobs/
scala> exit
Comments
Post a Comment