Skip to main content

Apache Spark WordCount Java Example

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
Pre Requirements
1) A machine with Ubuntu 14.04 LTS operating system
2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)
3) Apache Spark 1.6.1 pre installed (How to install Spark on Ubuntu 14.04)
Spark WordCount Java Example
Step 1 - Add these 2 spark jar files to your java project. Add following jars.

spark-assembly-1.6.1-hadoop2.6.0.jar
spark-core_2.10-0.9.0-incubating.jar
WordCount.java
import java.util.Arrays;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;

import scala.Tuple2;
public class WordCount {
 private static final FlatMapFunction<String, String> WORDS_EXTRACTOR = new FlatMapFunction<String, String>() {
  private static final long serialVersionUID = 1L;

  public Iterable<String> call(String s) throws Exception {
   System.out.println("text file"+s);
   return Arrays.asList(s.split(" "));
  }
 };

 private static final PairFunction<String, String, Integer> WORDS_MAPPER = new PairFunction<String, String, Integer>() {
  private static final long serialVersionUID = 1L;

  public Tuple2<String, Integer> call(String s) throws Exception {
   return new Tuple2<String, Integer>(s, 1);
  }
 };

 private static final Function2<Integer, Integer, Integer> WORDS_REDUCER = new Function2<Integer, Integer, Integer>() {
  private static final long serialVersionUID = 1L;

  public Integer call(Integer a, Integer b) throws Exception {
   return a + b;
  }
 };

 public static void main(String[] args) {

  SparkConf conf1 = new SparkConf().setAppName("com.WordCount")
    .setMaster("spark://127.0.0.1:7077")
    .set("spark.akka.heartbeat.interval", "100")
    .set("spark.local.ip", "127.0.0.1");

  JavaSparkContext context = new JavaSparkContext(conf1);

  JavaRDD<String> file = context
    .textFile("hdfs://localhost:9000/user/hduser/dcin",2);
  
  JavaRDD<String> words = file.flatMap(WORDS_EXTRACTOR);
  JavaPairRDD<String, Integer> pairs = words.mapToPair(WORDS_MAPPER);
  JavaPairRDD<String, Integer> counter = pairs.reduceByKey(WORDS_REDUCER);
  counter.saveAsTextFile("hdfs://localhost:9000/user/hduser/sparkout1/");
    System.out.println("end");

 }
}
Step 3 - Change the directory to /usr/local/hadoop/sbin.
$ cd /usr/local/hadoop/sbin
Step 4 - Start all hadoop daemons.
$ ./start-all.sh
Apache Spark WordCount Java Example
Step 5 - Change the directory to /usr/local/spark/sbin.
$ cd /usr/local/spark/sbin
Step 6 - Start all spark daemons.
$ ./start-all.sh
Apache Spark WordCount Java Example
Step 7 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.
$ jps
Step 8 - Run your WordCount program by submitting java project jar file to spark. Creating jar file is left to you.
Run on cluster
$ ./bin/spark-submit --class com.WordCount --master spark://127.0.0.1:7077 /home/hduser/Desktop/1.6\ SPARK/WordCount.jar
OR
Run locally
$ ./bin/spark-submit --class com.WordCount --master local[2] /home/hduser/Desktop/1.6\ SPARK/WordCount.jar
Output. Now you can see the output files. Browse hadoop's HDFS UI at http://localhost:9000
Apache Spark WordCount Java Example
Executed application. Now you can see the completed application on spark UI at http://127.0.0.1:7077
Apache Spark WordCount Java Example
Step 9 - Dont forget to stop hadoop daemons
$ ./stop-all.sh
Step 10 - Dont forget to stop spark daemons
$ ./stop-all.sh

Comments

Popular posts from this blog

Apache Spark WordCount scala example

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark WordCount Scala Example Step 1 - Change the directory to /usr/local/spark/sbin. $ cd /usr/local/spark/sbin Step 2 - Start all spark daemons. $ ./start-all. sh Step 3 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions. $ jp...

Hive hiveserver2 and Web UI usage

Hive hiveserver2 and Web UI usage HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results (a more detailed intro here). The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC. Step 1 - Change the directory to /usr/local/hive/bin $ cd $HIVE_HOME/bin Step 2 - Start hiveserver2 daemon $ hiveserver2 OR $ hive --service hiveserver2 & Step 3 - You can browse to hiveserver2 web ui at following url http: //localhost:10002/hiveserver2.jsp Step 4 - You can see the hive logs in /tmp/hduser/hive. log To kill hiveserver2 daemon $ ps -ef | grep -i hiveserver2 $ kill - 9 29707 OR $ rm -rf /var/run/hive/hive...

Apache Spark Shell Usage

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark Shell Usage The Spark shell provides an easy and convenient way to prototype certain operations quickly, without having to develop a full program, packaging it and then deploying it. Step 1 - Change the directory to /usr/local/hadoop/sbin. $ cd /usr/local/hadoop/sbin Step 2 - Start all hadoop daemons. $ ./start-all. sh ...