Apache Spark is an open source cluster computing framework.
Originally developed at the University of California, Berkeley's AMPLab,
the Spark codebase was later donated to the Apache Software Foundation,
which has maintained it since. Spark provides an interface for
programming entire clusters with implicit data parallelism and
fault-tolerance.
Pre Requirements
1) A machine with Ubuntu 14.04 LTS operating system
2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)
3) Apache Spark 1.6.1 pre installed (How to install Spark on Ubuntu 14.04)
Spark WordCount Java Example
Step 1 - Add these 2 spark jar files to your java project. Add following jars.
WordCount.java
Step 3 - Change the directory to /usr/local/hadoop/sbin.
Step 4 - Start all hadoop daemons.
Step 5 - Change the directory to /usr/local/spark/sbin.
Step 6 - Start all spark daemons.
Step 7 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.
Step 8 - Run your WordCount program by submitting java project jar file to spark. Creating jar file is left to you.
Run on cluster
OR
Run locally
Output. Now you can see the output files. Browse hadoop's HDFS UI at http://localhost:9000

Executed application. Now you can see the completed application on spark UI at http://127.0.0.1:7077

Step 9 - Dont forget to stop hadoop daemons
Step 10 - Dont forget to stop spark daemons
Pre Requirements
1) A machine with Ubuntu 14.04 LTS operating system
2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)
3) Apache Spark 1.6.1 pre installed (How to install Spark on Ubuntu 14.04)
Spark WordCount Java Example
Step 1 - Add these 2 spark jar files to your java project. Add following jars.
spark-assembly-1.6.1-hadoop2.6.0.jar spark-core_2.10-0.9.0-incubating.jar
import java.util.Arrays; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import scala.Tuple2; public class WordCount { private static final FlatMapFunction<String, String> WORDS_EXTRACTOR = new FlatMapFunction<String, String>() { private static final long serialVersionUID = 1L; public Iterable<String> call(String s) throws Exception { System.out.println("text file"+s); return Arrays.asList(s.split(" ")); } }; private static final PairFunction<String, String, Integer> WORDS_MAPPER = new PairFunction<String, String, Integer>() { private static final long serialVersionUID = 1L; public Tuple2<String, Integer> call(String s) throws Exception { return new Tuple2<String, Integer>(s, 1); } }; private static final Function2<Integer, Integer, Integer> WORDS_REDUCER = new Function2<Integer, Integer, Integer>() { private static final long serialVersionUID = 1L; public Integer call(Integer a, Integer b) throws Exception { return a + b; } }; public static void main(String[] args) { SparkConf conf1 = new SparkConf().setAppName("com.WordCount") .setMaster("spark://127.0.0.1:7077") .set("spark.akka.heartbeat.interval", "100") .set("spark.local.ip", "127.0.0.1"); JavaSparkContext context = new JavaSparkContext(conf1); JavaRDD<String> file = context .textFile("hdfs://localhost:9000/user/hduser/dcin",2); JavaRDD<String> words = file.flatMap(WORDS_EXTRACTOR); JavaPairRDD<String, Integer> pairs = words.mapToPair(WORDS_MAPPER); JavaPairRDD<String, Integer> counter = pairs.reduceByKey(WORDS_REDUCER); counter.saveAsTextFile("hdfs://localhost:9000/user/hduser/sparkout1/"); System.out.println("end"); } }
$ cd /usr/local/hadoop/sbin
$ ./start-all.sh
Step 5 - Change the directory to /usr/local/spark/sbin.
$ cd /usr/local/spark/sbin
$ ./start-all.sh
Step 7 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.
$ jps
Run on cluster
$ ./bin/spark-submit --class com.WordCount --master spark://127.0.0.1:7077 /home/hduser/Desktop/1.6\ SPARK/WordCount.jar
Run locally
$ ./bin/spark-submit --class com.WordCount --master local[2] /home/hduser/Desktop/1.6\ SPARK/WordCount.jar
Executed application. Now you can see the completed application on spark UI at http://127.0.0.1:7077
Step 9 - Dont forget to stop hadoop daemons
$ ./stop-all.sh
$ ./stop-all.sh
Comments
Post a Comment