Skip to main content

Hive User Defined Aggregate Functions (UDAF) Java Example

User Defined Aggregate Functions (UDAF) Java Example
Step 1 - Add these jar files to your java project.
hive-exe*.jar
$HIVE_HOME/lib/*.jar
$HADOOP_HOME/share/hadoop/mapreduce/*.jar
$HADOOP_HOME/share/hadoop/common/*.jar
Max.java
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
@SuppressWarnings("deprecation")
public class Max extends UDAF {
 public static class MaxIntUDAFEvaluator implements UDAFEvaluator {
  private IntWritable output;
  public void init()
  {
   output = null;
  }
  public boolean iterate(IntWritable maxvalue) // Process input table
  {
   if (maxvalue == null)
   {
    return true;
   }
   if (output == null)
   {
    output = new IntWritable(maxvalue.get());
   }
   else
   {
    output.set(Math.max(output.get(), maxvalue.get()));
   }
   return true;
  }
  public IntWritable terminatePartial()
  {
   return output;
  }
  public boolean merge(IntWritable other)
  {
   return iterate(other);
  }
  public IntWritable terminate() // final result
  {
   return output;
  }
 }
}
Step 2 - Compile and create a jar file of your java project. Creating a jar file is left to you.
Step 3 - Create a Numbers_List.txt file
Numbers_List.txt
Step 4 - Add these following lines to Numbers_List.txt file
10
12
23
55
66
77
88
99
22
13
16
Step 5 - Change the directory to /usr/local/hive/bin
$ cd $HIVE_HOME/bin
Step 6 - Enter into hive shell
$ hive
Step 7 - Create a table Num_list, load Numbers_List.txt data into the table and verify. Save and close.
hive> CREATE TABLE Num_list(Num int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n';
hive> LOAD DATA LOCAL INPATH '/home/hduser/Desktop/HIVE/Numbers_List.txt' OVERWRITE INTO TABLE Num_list;
hive> SELECT * FROM Num_list;
Step 8 - Add jar file in distributed cache, create a function and execute udaf function.
hive> ADD JAR /home/hduser/Desktop/HIVE/MaxUDAF.jar;
hive> CREATE TEMPORARY FUNCTION max AS 'Max';
hive> SELECT max(Num) FROM Num_list;

Comments

Popular posts from this blog

Apache Spark WordCount scala example

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark WordCount Scala Example Step 1 - Change the directory to /usr/local/spark/sbin. $ cd /usr/local/spark/sbin Step 2 - Start all spark daemons. $ ./start-all. sh Step 3 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions. $ jp...

Hive hiveserver2 and Web UI usage

Hive hiveserver2 and Web UI usage HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results (a more detailed intro here). The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC. Step 1 - Change the directory to /usr/local/hive/bin $ cd $HIVE_HOME/bin Step 2 - Start hiveserver2 daemon $ hiveserver2 OR $ hive --service hiveserver2 & Step 3 - You can browse to hiveserver2 web ui at following url http: //localhost:10002/hiveserver2.jsp Step 4 - You can see the hive logs in /tmp/hduser/hive. log To kill hiveserver2 daemon $ ps -ef | grep -i hiveserver2 $ kill - 9 29707 OR $ rm -rf /var/run/hive/hive...

Apache Spark Shell Usage

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark Shell Usage The Spark shell provides an easy and convenient way to prototype certain operations quickly, without having to develop a full program, packaging it and then deploying it. Step 1 - Change the directory to /usr/local/hadoop/sbin. $ cd /usr/local/hadoop/sbin Step 2 - Start all hadoop daemons. $ ./start-all. sh ...