Skip to main content

Hive User Defined Functions (UDF) Java Example

Hive User Defined Functions (UDF) Java Example
Generally Hive having some Built-in functions,we can use that Built-in functions for our Hive program with out adding any extra code but some times user requirement is not available in that built-in functions at that time user can write some own custom user defined functions called UDF (user defined function).
There are three types of UDFs
1) Regular UDFs
2) User Defined Aggregate Functions - UDAFs (See,here)
3) User Defined Table Generating Functions - UDTFs (See,here)
Here is the simple steps of How To Write Hive UDF Example In Java.
Step 1 - Add these jar files to your java project.
hive-exe*.jar
$HIVE_HOME/lib/*.jar
$HADOOP_HOME/share/hadoop/mapreduce/*.jar
$HADOOP_HOME/share/hadoop/common/*.jar
AutoIncrementUDF.java
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
@UDFType(stateful = true)
public class AutoIncrementUDF extends UDF {
 int ctr;
 public int evaluate() {
  ctr++;
  return ctr;
 }
}
Step 2 - Compile and create a jar file of your java project. Creating a jar file is left to you.
Step 3 - You can add jar file in ways
1) Using Hive Shell
Step 4 - Change the directory to /usr/local/hive/bin
$ cd $HIVE_HOME/bin
Step 5 - Enter into hive shell
$ hive
hive> ADD JAR /home/hduser/Desktop/HIVE/AutoIncrementUDF.jar;
OR
2) hive-site.xml
hive-site.xml
<property>
    <name>hive.aux.jars.path</name>
    <value>file:///home/hduser/Desktop/HIVE/AutoIncrementUDF.jar</value>
</property>
OR
3) hive-env.sh
hive-env.sh
export HIVE_AUX_JARS_PATH="/home/hduser/Desktop/HIVE/AutoIncrementUDF.jar"
Step 6 - Create a function
hive> CREATE TEMPORARY FUNCTION incr AS 'AutoIncrementUDF';  
OR
Step 6 - Create a function
hive> CREATE PERMANENT FUNCTION incr AS 'AutoIncrementUDF';
Step 7 - Create a data.csv file
data.csv
Step 8 - Add these following lines to data.csv file Save and close.
row1,c1,c2
row2,c1,c2
row3,c1,c2
row4,c1,c2
row5,c1,c2
row6,c1,c2
row7,c1,c2
row8,c1,c2
row9,c1,c2
row10,c1,c2
Step 9 - Create a table t1, load data.csv data into the table and verify.
hive> CREATE TABLE t1 (id STRING, c1 STRING, c2 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> LOAD DATA LOCAL INPATH '/home/hduser/Desktop/HIVE/data.csv' OVERWRITE INTO TABLE t1;
hive> SELECT * FROM t1;
Step 10 - Create a table increment_table1, execute UDF and verfiy.
hive> CREATE TABLE increment_table1 (id INT, c1 STRING, c2 STRING, c3 STRING);
hive> INSERT OVERWRITE TABLE increment_table1 SELECT incr() AS inc, id, c1, c2 FROM t1;
hive> SELECT * FROM increment_table1;
Please share this blog post and follow me for latest updates on

Comments

Popular posts from this blog

Apache Spark WordCount scala example

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark WordCount Scala Example Step 1 - Change the directory to /usr/local/spark/sbin. $ cd /usr/local/spark/sbin Step 2 - Start all spark daemons. $ ./start-all. sh Step 3 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions. $ jp...

Hive hiveserver2 and Web UI usage

Hive hiveserver2 and Web UI usage HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results (a more detailed intro here). The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC. Step 1 - Change the directory to /usr/local/hive/bin $ cd $HIVE_HOME/bin Step 2 - Start hiveserver2 daemon $ hiveserver2 OR $ hive --service hiveserver2 & Step 3 - You can browse to hiveserver2 web ui at following url http: //localhost:10002/hiveserver2.jsp Step 4 - You can see the hive logs in /tmp/hduser/hive. log To kill hiveserver2 daemon $ ps -ef | grep -i hiveserver2 $ kill - 9 29707 OR $ rm -rf /var/run/hive/hive...

Apache Spark Shell Usage

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark Shell Usage The Spark shell provides an easy and convenient way to prototype certain operations quickly, without having to develop a full program, packaging it and then deploying it. Step 1 - Change the directory to /usr/local/hadoop/sbin. $ cd /usr/local/hadoop/sbin Step 2 - Start all hadoop daemons. $ ./start-all. sh ...