Skip to main content

Hive hiveQL Examples

Hive hiveQL examples
Hive is a database technology that can define databases and tables to analyze structured data. The theme for structured data analysis is to store the data in a tabular manner, and pass queries to analyze it. This post explains how to create Hive database and hiveQL queries. Hive contains a default database named default.
Step 1 - Change the directory to /usr/local/hive/bin
$ cd $HIVE_HOME/bin
Step 2 - Enter into hive shell
$ hive
Database
Create Database is a statement used to create a database in Hive. The usage of SCHEMA and DATABASE are same.
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
OR
hive> CREATE SCHEMA userdb;
The following query is used to verify a databases list
hive> SHOW DATABASES;
Table
Create Table is a statement used to create a table in Hive.
CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
> salary String, destination String)
> COMMENT 'Employee details'
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ' '
> LINES TERMINATED BY '\n'
> STORED AS TEXTFILE;
Create sample.txt file
sample.txt
Add following lines to sample.txt file. Save and close.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Krian 40000 Hr Admin
1205 Kranthi 30000 Op Admin
Load
We can insert data using the LOAD DATA statement. While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are two ways to load data: one is from local file system and second is from Hadoop file system.
LOAD DATA LOCAL INPATH '/home/hduser/Desktop/HIVE/sample.txt' OVERWRITE INTO TABLE employee;
Verify
select * from employee;
Alter
How to alter the attributes of a table such as changing its table name, changing column names, adding columns, and deleting or replacing columns.
hive> ALTER TABLE employee RENAME TO emp;
hive> ALTER TABLE employee CHANGE name ename String;
hive> ALTER TABLE employee CHANGE salary salary Double;
Add Columns
hive> ALTER TABLE employee ADD COLUMNS (
> dept STRING COMMENT 'Department name');
Replace Columns
hive> ALTER TABLE employee REPLACE COLUMNS (ename String name String);
Partitioning
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data. Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
hive> ALTER TABLE employee
> ADD PARTITION (year=̢۪2013̢۪)
> location '/2013/part2013';

hive> ALTER TABLE employee
> ADD PARTITION (year=̢۪2012̢۪)
> location '/2012/part2012';

hive> ALTER TABLE employee PARTITION (year=̢۪2013̢۪)
> RENAME TO PARTITION (Yoj=̢۪2013̢۪);

ALTER TABLE employee DROP [IF EXISTS]
> PARTITION (year=̢۪2013̢۪);
Built-in operators of Hive
There are four types of operators in Hive
1. Relational Operators
2. Arithmetic Operators
3. Logical Operators
4. Complex Operators
SELECT * FROM employee WHERE Id=1205;
SELECT * FROM employee WHERE Salary>=40000;
SELECT 20+30 ADD FROM temp;
SELECT * FROM employee WHERE Salary>40000 && Dept=TP;
View
How to create and manage views. Views are generated based on user requirements. You can save any result set data as a view. The usage of view in Hive is same as that of the view in SQL.
hive> CREATE VIEW emp_30000 AS SELECT * FROM employee WHERE salary>30000;
select * from emp_30000;
hive> DROP VIEW emp_30000;
Index
An Index is nothing but a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table.
hive> CREATE INDEX inedx_salary ON TABLE employee(salary) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
hive> DROP INDEX index_salary ON employee;
HiveQL Syntax
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
Select and Where Query
Create a employee.txt file
employee.txt
Add following lines to employee.txt file. Save and close.
1201 Gopal 45000 TechnicalManager TP
1202 Manisha 45000 ProofReader PR
1203 Masthanvali 40000 TechnicalWriter TP
1204 Krian 40000 HrAdmin HR
1205 Kranthi 30000 OpAdmin Admin
Create a table called employee
hive> CREATE TABLE IF NOT EXISTS employee (id int, name String, salary String, destination String, dept String) COMMENT 'Employee details' ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
Load employee.txt data into table employee.
hive> LOAD DATA LOCAL INPATH '/home/hduser/Desktop/HIVE/employee.txt' OVERWRITE INTO TABLE employee;
hive> SELECT * FROM employee WHERE salary>30000;
OrderBy
The ORDER BY clause is used to retrieve the details based on one column and sort the result set by ascending or descending order.
SELECT Id, Name, Dept FROM employee ORDER BY DEPT;
GroupBy
The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records.
SELECT Dept,count(*) FROM employee GROUP BY DEPT;
Drop a table
hive> DROP TABLE IF EXISTS employee;
View tables
hive> SHOW TABLES;
Drop a database
hive> DROP DATABASE IF EXISTS userdb;
hive> DROP DATABASE IF EXISTS userdb CASCADE;
OR
hive> DROP SCHEMA userdb;
View databases
hive> SHOW databases;
Joins
JOINS is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database. It is more or less similar to SQL JOINS.
Create a table customers
hive> CREATE TABLE IF NOT EXISTS customers (id int, name String, age int, address String, salary String) COMMENT 'customer details' ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
Create a customers.txt file
customers.txt
Add following lines to customers.txt file. Save and close.
1 Ramesh 32 Ahmedabad 2000.00
2 Khilan 25 Delhi 1500.00
3 kaushik 23 Kota 2000.00
4 Chaitali 25 Mumbai 6500.00
5 Hardik 27 Bhopal 8500.00
6 Komal 22 MP 4500.00
7 Muffy 24 Indore 10000.00
Load customers.txt data into customers table.
LOAD DATA LOCAL INPATH '/home/hduser/Desktop/HIVE/customers.txt' OVERWRITE INTO TABLE customers;
Create a table orders
CREATE TABLE IF NOT EXISTS orders (oid int, odate String, id int, amount String) COMMENT 'order details' ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
Create a orders.txt file
orders.txt
Add following lines to customers.txt file. Save and close.
102 2009-10-08-00:00:00 3 3000
100 2009-10-08-00:00:00 3 1500
101 2009-11-20-00:00:00 2 1560
103 2008-05-20-00:00:00 4 2060
Load orders.txt data into orders table.
LOAD DATA LOCAL INPATH '/home/hduser/Desktop/HIVE/orders.txt' OVERWRITE INTO TABLE orders;
JOIN
SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.id = o.id);
LEFT OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.ODATE FROM CUSTOMERS c LEFT OUTER JOIN ORDERS o ON (c.ID = o.ID);
RIGHT OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.ODATE FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.ID);
FULL OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.ODATE FROM CUSTOMERS c FULL OUTER JOIN ORDERS o ON (c.ID = o.ID);
Please share this blog post and follow me for latest updates on

Comments

Popular posts from this blog

Apache Spark WordCount scala example

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark WordCount Scala Example Step 1 - Change the directory to /usr/local/spark/sbin. $ cd /usr/local/spark/sbin Step 2 - Start all spark daemons. $ ./start-all. sh Step 3 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions. $ jp...

Hive hiveserver2 and Web UI usage

Hive hiveserver2 and Web UI usage HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results (a more detailed intro here). The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC. Step 1 - Change the directory to /usr/local/hive/bin $ cd $HIVE_HOME/bin Step 2 - Start hiveserver2 daemon $ hiveserver2 OR $ hive --service hiveserver2 & Step 3 - You can browse to hiveserver2 web ui at following url http: //localhost:10002/hiveserver2.jsp Step 4 - You can see the hive logs in /tmp/hduser/hive. log To kill hiveserver2 daemon $ ps -ef | grep -i hiveserver2 $ kill - 9 29707 OR $ rm -rf /var/run/hive/hive...

Apache Spark Shell Usage

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark Shell Usage The Spark shell provides an easy and convenient way to prototype certain operations quickly, without having to develop a full program, packaging it and then deploying it. Step 1 - Change the directory to /usr/local/hadoop/sbin. $ cd /usr/local/hadoop/sbin Step 2 - Start all hadoop daemons. $ ./start-all. sh ...