Today I explored how to write your own custom UDF  in HIVE. There are two ways to create your UDF, one is  simple and the other is complex. The simple one allows you to create UDF with primitive Hadoop datatypes such as  Text, IntWritable, LongWritable, DoubleWritable, etc. extending UDF class.  For richer data types Map, List, Set you need to use the complex one extending GenericUDF class.

Simple Hive UDF

File: HiveUDFSimpleSample.java

package com.deb.experiments.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
@Description(
 name="SimpleUDFExample",
 value="returns 'hello x', where x is whatever you give it (STRING)",
 extended="SELECT addHello('world') from foo limit 1;"
)
public class HiveUDFSimpleSample extends UDF {
public Text evaluate(Text input) {
 if(input == null) return null;
   return new Text("Hello " + input.toString());
 }
}

JUnit Test for this SimpleUDF :

public class SimpleUDFExampleTest {
  @Test 
  public void testUDF() {
     HiveUDFSimpleSample example = new HiveUDFSimpleSample(); 
     Assert.assertEquals("Hello world", example.evaluate(new Text("world")).toString()); 
  } 
}

This HiveUDFSimpleSample class gets a Text input and return a Text with Hello appended to it in the front. (Note: This is not String it is Hadoop.io.Text Object)

To run  we have to create a jar of this project and import it in hive,  run the following commands :

$cd path/to/project
$ mvn package assembly:single

I have used maven build automation tool to build the project. This will create a jar “project1.0-SNAPSHOT-jar-with-dependencies.jar”

Copy this jar to your hive server and use it from there.

Open Hive shell :

$ hive
$ ADD JAR project-1.0-SNAPSHOT-jar-with-dependencies.jar;
$ CREATE TEMPORARY FUNCTION addHello as 'com.deb.experiments.udf.HiveUDFSimpleSample';
$ SELECT addHello(name) from table;

output:

Hello John 
Hello Harry

The package dependencies we need are as follows:
(inlcude them in your pom.xml)

<dependency>
 <groupId>org.apache.hadoop</groupId>
 <artifactId>hadoop-client</artifactId>
 <version>2.0.0-mr1-cdh4.3.1</version>
 <scope>provided</scope>
 </dependency>
 <dependency>
 <groupId>org.apache.hive</groupId>
 <artifactId>hive-exec</artifactId>
 <version>0.10.0-cdh4.3.1</version>
 <scope>provided</scope>
 </dependency>

and repository

<repository>
 <id>cloudera</id>
 <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
 </repository>

and for building a fat jar you need to add this lines in your pom.xml

<build>
 <pluginManagement>
<plugins>
<plugin>
 <artifactId>maven-assembly-plugin</artifactId><configuration>
 <archive>
 <manifest>
 <mainClass>com.deb.experiments.RawMapreduce</mainClass>
 </manifest>
 </archive>
 <descriptorRefs>
 <descriptorRef>jar-with-dependencies</descriptorRef>
 </descriptorRefs>
 </configuration>
 </plugin>
</plugins>
</pluginManagement>
</build>

Complex : Extending GenericUDF class

This api requires you to manually manage object inspectors for the function arguments, and verify the number and types of the arguments you receive. An object inspector provides a consistent interface for underlying object types so that different object implementations can all be accessed in a consistent way from within hive. File : HiveUDFSampleEqual (This function takes two String as arguments and returns true if they are equal else return false )

package com.deb.experiments.udf;

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector;
import org.apache.hadoop.io.BooleanWritable;
import org.apache.hadoop.io.Text;
import java.util.List;

 public class HiveUDFSampleEqual extends GenericUDF {

 private StringObjectInspector elementOI1;
 private StringObjectInspector elementOI2;
 private transient ObjectInspectorConverters.Converter[] converters;

 @Override
 public String getDisplayString(String[] args) {
 return "arrayContainsExample()"; // this should probably be better
 }

 @Override
 public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
 if (arguments.length != 2) {
     throw new UDFArgumentLengthException("This Example only takes 2 arguments: String, String");
 }

 // 1. Check we received the right object types.
 ObjectInspector a = arguments[0];
 ObjectInspector b = arguments[1];

 if (!(a instanceof StringObjectInspector) || !(b instanceof StringObjectInspector)) {
      throw new UDFArgumentException("The two arguments must be string ");
 }
 this.elementOI1 = (StringObjectInspector) a;
 this.elementOI2 = (StringObjectInspector) b;

 // create Object Inspector converter to convert objects 
 converters = new ObjectInspectorConverters.Converter[arguments.length];
 for (int i = 0; i < arguments.length; i++) {
     converters[i] = ObjectInspectorConverters.getConverter(arguments[i],
                              PrimitiveObjectInspectorFactory.writableStringObjectInspector);
 }

 // the return type of our function is a boolean, so we provide the correct object inspector
 return PrimitiveObjectInspectorFactory.writableBooleanObjectInspector;
 }

 @Override
 public Object evaluate(DeferredObject[] arguments) throws HiveException {
 // get the list and string from the deferred objects using the object inspectors
 // check for nulls

 if (arguments[0].get() == null || arguments[1].get() == null) {
    return null;
 }
 Text str1 = (Text) converters[0].convert(arguments[0].get());
 Text str2 = (Text) converters[1].convert(arguments[1].get());
 // see if our list contains the value we need
    if (str1.equals(str2)){
        return new BooleanWritable(true);
    }
   return new BooleanWritable(false);
 }
}

Code analysis :
1. The UDF is initialized using default constructor
2. Check for right number of arguments, and check their types.
3. Check whether arguments are null are not.
4. Store the arguments by casting them to corresponding ObjectInspector for use in evaluate(), and ObjectInspector converters for converting the objects used in evaluate().
5. Return an object inspector so Hive can read the result of the function.
6. Evaluate function checks whether  arguments are null or not and accordingly returns result.
7. Converts the object to Text objects
8. Returns true if both arguments are equal else returns false.

Run :

$ hive
$ ADD JAR project-1.0-SNAPSHOT-jar-with-dependencies.jar;
$ CREATE TEMPORARY FUNCTION stringEqual as 'com.deb.experiments.udf.HiveUDFSampleEqual';
$ SELECT stringEqual(name, "John") from table limit 2;
true
false

This post might give you little idea to create a Hive UDF.