What is Hive

Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing structured and semi-structured data. Hive abstracts the complexity of Hadoop MapReduce. Basically, it provides a mechanism to project structure onto the data and perform queries written in HQL (Hive Query Language) that are similar to SQL statements. Internally, these queries or HQL gets converted to map reduce jobs by the Hive compiler. Therefore, you don’t need to worry about writing complex MapReduce programs to process your data using Hadoop. It is targeted towards users who are comfortable with SQL. Apache Hive supports Data Definition Language (DDL), Data Manipulation Language (DML) and User Defined Functions (UDF).

SQL + Hadoop MapReduce = HiveQL

Apache Hive Tutorial: Advantages of Hive

Useful for people who aren’t from a programming background as it eliminates the need to write complex MapReduce program.
Extensible and scalable to cope up with the growing volume and variety of data, without affecting performance of the system.
It is as an efficient ETL (Extract, Transform, Load) tool.
Hive supports any client application written in Java, PHP, Python, C++ or Ruby by exposing its Thrift server. (You can use these client – side languages embedded with SQL for accessing a database such as DB2, etc.).
As the metadata information of Hive is stored in an RDBMS, it significantly reduces the time to perform semantic checks during query execution.

As shown in the above image, the Hive Architecture can be categorized into the following components:

Hive Clients: Hive supports application written in many languages like Java, C++, Python etc. using JDBC, Thrift and ODBC drivers. Hence one can always write hive client application written in a language of their choice.
Hive Services: Apache Hive provides various services like CLI, Web Interface etc. to perform queries. We will explore each one of them shortly in this Hive tutorial blog.
Processing framework and Resource Management:Internally, Hive uses Hadoop MapReduce framework as de facto engine to execute the queries. Hadoop MapReduce framework is a separate topic in itself and therefore, is not discussed here.
Distributed Storage: As Hive is installed on top of Hadoop, it uses the underlying HDFS for the distributed storage. You can refer to the HDFS blog to learn more about it.

Now, let us explore the first two major components in the Hive Architecture:

1. Hive Clients:

Apache Hive supports different types of client applications for performing queries on the Hive. These clients can be categorized into three types:

Thrift Clients: As Hive server is based on Apache Thrift, it can serve the request from all those programming language that supports Thrift.
JDBC Clients: Hive allows Java applications to connect to it using the JDBC driver which is defined in the class org.apache.hadoop.hive.jdbc.HiveDriver.
ODBC Clients: The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive server.)

2. Hive Services:

Hive provides many services as shown in the image above. Let us have a look at each of them:

Hive CLI (Command Line Interface): This is the default shell provided by the Hive where you can execute your Hive queries and commands directly.
Apache Hive Web Interfaces: Apart from the command line interface, Hive also provides a web based GUI for executing Hive queries and commands.
Hive Server: Hive server is built on Apache Thrift and therefore, is also referred as Thrift Server that allows different clients to submit requests to Hive and retrieve the final result.
Apache Hive Driver: It is responsible for receiving the queries submitted through the CLI, the web UI, Thrift, ODBC or JDBC interfaces by a client. Then, the driver passes the query to the compiler where parsing, type checking and semantic analysis takes place with the help of schema present in the metastore. In the next step, an optimized logical plan is generated in the form of a DAG (Directed Acyclic Graph) of map-reduce tasks and HDFS tasks. Finally, the execution engine executes these tasks in the order of their dependencies, using Hadoop.
Metastore: You can think metastore as a central repository for storing all the Hive metadata information. Hive metadata includes various types of information like structure of tables and the partitions along with the column, column type, serializer and deserializer which is required for Read/Write operation on the data present in HDFS. The metastore comprises of two fundamental units:

A service that provides metastore access to other Hive services.
Disk storage for the metadata which is separate from HDFS storage.

Big Data Analysis

Search This Blog

What is Hive

Apache Hive Tutorial: Advantages of Hive

1. Hive Clients:

2. Hive Services:

Comments

Post a Comment

Popular posts from this blog

Apache Spark WordCount scala example

Hive hiveserver2 and Web UI usage

Apache Spark Shell Usage