Skip to main content

Hive WordCount hiveQL Execution

Hive WordCount hiveQL Example
Step 1 - Change the directory to /usr/local/hadoop/sbin
$ cd /usr/local/hadoop/sbin
Step 2 - Start all hadoop daemons.
$ start-all.sh
Step 3 - Create employee.txt file.
employee.txt
Step 4 - Add these following lines to employee.txt file. Save and close.
1201 Gopal 45000 TechnicalManager TP
1202 Manisha 45000 ProofReader PR
1203 Masthanvali 40000 TechnicalWriter TP
1204 Krian 40000 HrAdmin HR
1205 Kranthi 30000 OpAdmin Admin
Step 5 - Copy employee.txt from local file system into HDFS.
$ hdfs dfs -copyFromLocal /home/hduser/Desktop/employee.txt /user/hduser/employee123.txt
Step 6 - Change the directory to /usr/local/hive/bin
$ cd $HIVE_HOME/bin
Step 7 - Create wordcount hive query file. The file should have .hql extension.
wordcount.hql
Step 8 - Add thses following lines to wordcount.hql Save and close.
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'hdfs://localhost:9000/user/hduser/employee123.txt' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
Step 9 - Execute wordcount.hql hiveQL
$ hive -f /home/hduser/Desktop/HIVE/wordcount.hql
Hive WordCount hiveQL Example
Hive WordCount hiveQL Example
Step 10 - Execute select hiveQL
$ hive -e 'select * from word_counts'
Hive WordCount hiveQL Example
Set these Hive Execution Parameters in hive-site.xml
  <property>
     <name>mapred.reduce.tasks</name>
     <value>-1</value>
     <description>The default number of reduce tasks per job.</description>
  </property>

  <property>
     <name>hive.exec.scratchdir</name>
     <value>/tmp/mydir</value>
     <description>Scratch space for Hive jobs</description>
  </property>

  <property>
     <name>hive.metastore.warehouse.dir</name>
     <value>/user/hive/warehouse</value>
     <description>location of default database for the warehouse</description>
  </property>

  <property>
     <name>hive.enforce.bucketing</name>
     <value>true</value>
     <description>Whether bucketing is enforced. If true, while inserting into the table, bucketing is enforced. </description>
  </property>

Comments

Popular posts from this blog

Apache Spark WordCount scala example

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark WordCount Scala Example Step 1 - Change the directory to /usr/local/spark/sbin. $ cd /usr/local/spark/sbin Step 2 - Start all spark daemons. $ ./start-all. sh Step 3 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions. $ jp...

Hive hiveserver2 and Web UI usage

Hive hiveserver2 and Web UI usage HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results (a more detailed intro here). The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC. Step 1 - Change the directory to /usr/local/hive/bin $ cd $HIVE_HOME/bin Step 2 - Start hiveserver2 daemon $ hiveserver2 OR $ hive --service hiveserver2 & Step 3 - You can browse to hiveserver2 web ui at following url http: //localhost:10002/hiveserver2.jsp Step 4 - You can see the hive logs in /tmp/hduser/hive. log To kill hiveserver2 daemon $ ps -ef | grep -i hiveserver2 $ kill - 9 29707 OR $ rm -rf /var/run/hive/hive...

Apache Spark Shell Usage

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Pre Requirements 1) A machine with Ubuntu 14.04 LTS operating system 2) Apache Hadoop 2.6.4 pre installed ( How to install Hadoop on Ubuntu 14.04 ) 3) Apache Spark 1.6.1 pre installed ( How to install Spark on Ubuntu 14.04 ) Spark Shell Usage The Spark shell provides an easy and convenient way to prototype certain operations quickly, without having to develop a full program, packaging it and then deploying it. Step 1 - Change the directory to /usr/local/hadoop/sbin. $ cd /usr/local/hadoop/sbin Step 2 - Start all hadoop daemons. $ ./start-all. sh ...