Skip to main content

Hive User Defined Table Generating Functions (UDTF) Java Example

Hive User Defined Table Generating Functions
Step 1 - Add these jar files to your java project.
hive-exe*.jar
$HIVE_HOME/lib/*.jar
$HADOOP_HOME/share/hadoop/mapreduce/*.jar
$HADOOP_HOME/share/hadoop/common/*.jar
Myudtf.java
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
public class Myudtf extends GenericUDTF {
 private PrimitiveObjectInspector stringOI = null;
 @Override
 public StructObjectInspector initialize(ObjectInspector[] args)
   throws UDFArgumentException {
  if (args.length != 1) {
   throw new UDFArgumentException(
     "NameParserGenericUDTF() takes exactly one argument");
  }
  if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE
    && ((PrimitiveObjectInspector) args[0]).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) {
   throw new UDFArgumentException(
     "NameParserGenericUDTF() takes a string as a parameter");
  }
  // input inspectors
  stringOI = (PrimitiveObjectInspector) args[0];
  // output inspectors -- an object with three fields!
  List<String> fieldNames = new ArrayList<String>(2);
  List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(2);
  fieldNames.add("id");
  fieldNames.add("phone_number");
  fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
  fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
  return ObjectInspectorFactory.getStandardStructObjectInspector(
    fieldNames, fieldOIs);
 }

 public ArrayList<Object[]> processInputRecord(String id) {
  ArrayList<Object[]> result = new ArrayList<Object[]>();
  // ignoring null or empty input
  if (id == null || id.isEmpty()) {
   return result;
  }
  String[] tokens = id.split("\\s+");
  if (tokens.length == 2) {
   result.add(new Object[] { tokens[0], tokens[1] });
  }
  else if (tokens.length == 3) {
   result.add(new Object[] { tokens[0], tokens[1] });
   result.add(new Object[] { tokens[0], tokens[2] });
  }
  return result;
 }
 @Override
 public void process(Object[] record) throws HiveException {
  final String id = stringOI.getPrimitiveJavaObject(record[0]).toString();
  ArrayList<Object[]> results = processInputRecord(id);
  Iterator<Object[]> it = results.iterator();
  while (it.hasNext()) {
   Object[] r = it.next();
   forward(r);
  }
 }
 @Override
    public void close() throws HiveException {
  // do nothing
    }
}

 
Step 2 - Compile and create a jar file of your java project. Creating a jar file is left to you.
Step 3 - Create a phn_num.txt file
phn_num.txt
Step 4 - Add these following lines to phn_num.txt file. Save and close.
123,phone1,phone2
123,phone3
124,phone1,phone2
125,phone1,phone2
125,phone3
126,phone1
126,phone2,phone3
Step 5 - Change the directory to /usr/local/hive/bin
$ cd $HIVE_HOME/bin
Step 6 - Enter into hive shell
$ hive
Step 7 - Create a table phone, load phn_num.txt data into the table and verify. Save and close.
hive> CREATE TABLE phone(id String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n';
hive> LOAD DATA LOCAL INPATH '/home/hduser/Desktop/HIVE/phn_num.txt' OVERWRITE INTO TABLE phone;
hive> SELECT * FROM phone;
Step 8 - Add jar file in distributed cache, create a function and execute udtf function.
hive> ADD JAR /home/hduser/Desktop/HIVE/UDTF.jar;
hive> CREATE TEMPORARY FUNCTION fun2 AS 'Myudtf';
hive> SELECT fun2(id) FROM phone;

Comments

Popular posts from this blog

Apache Spark WordCount scala example

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark WordCount Scala Example Step 1 - Change the directory to /usr/local/spark/sbin. $ cd /usr/local/spark/sbin Step 2 - Start all spark daemons. $ ./start-all. sh Step 3 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions. $ jp...

Hive hiveserver2 and Web UI usage

Hive hiveserver2 and Web UI usage HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results (a more detailed intro here). The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC. Step 1 - Change the directory to /usr/local/hive/bin $ cd $HIVE_HOME/bin Step 2 - Start hiveserver2 daemon $ hiveserver2 OR $ hive --service hiveserver2 & Step 3 - You can browse to hiveserver2 web ui at following url http: //localhost:10002/hiveserver2.jsp Step 4 - You can see the hive logs in /tmp/hduser/hive. log To kill hiveserver2 daemon $ ps -ef | grep -i hiveserver2 $ kill - 9 29707 OR $ rm -rf /var/run/hive/hive...

Apache Spark Shell Usage

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark Shell Usage The Spark shell provides an easy and convenient way to prototype certain operations quickly, without having to develop a full program, packaging it and then deploying it. Step 1 - Change the directory to /usr/local/hadoop/sbin. $ cd /usr/local/hadoop/sbin Step 2 - Start all hadoop daemons. $ ./start-all. sh ...