Cloud computing big data learning route course syllabus material: hive getting started 07/19 Update SLTechnology News&Howtos

Cloud computing big data learning route course syllabus material: hive getting started

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Hive creates background

Apache Hive data warehouse software makes it easy to read, write, and manage large data sets distributed across distributed storage using SQL. Structures can be projected onto already stored data. A command-line tool and JDBC driver are provided to connect users to Hive.

·Open sourced by Facebook, originally used to solve massive structured log data statistics problems

MapReduce programming inconvenience

Files on HDFS lack Schema (field names, field types, etc.)

2 What is Hive?

·Data warehouse built on top of Hadoop

Hive defines a SQL-like query language: HQL (similar to SQL but not identical)

·Usually used for offline data processing (MapReduce)

·Low-level support for multiple execution engines (Hive on MapReduce, Hive on Tez, Hive on Spark)

·Supports a variety of different compression formats, storage formats and custom functions (compression: GZIP, LZO, Snappy, BZIP2.. ; Storage: TextFile, SequenceFile, RCFile, ORC, Parquet ; UDF: Custom Function)

What exactly is Hive? Let's first take a look at how Hive's official Wiki introduces Hive (https://cwiki.apache.org/confluence/display/Hive/Home):

Apache Hive Apache Hive™ data warehouse software provides great convenience for reading, writing, and managing large data sets stored in distributed storage, as well as querying large data sets using SQL syntax.

1. It is a tool that is easy to extract, transform and load data (ETL). It can be understood as data cleansing analysis presentation. It has a mechanism for imposing structure on large amounts of formatted data. It can analyze and process data stored directly in HDFS or data stored in other data storage systems, such as hbase. Query execution is done via mapreduce. Hive can use stored procedures 6. Apache YARN and Apache Slider to achieve sub-second query retrieval.

III. Hive installation

1. Hive stand-alone installation (using derby for metadata storage)

·Installation package preparation

Upload hive installation package apache-hive-1.2.1-bin.tar.gz to VM/bigdata/

JDK installation package jdk-8u151-x64.gz

Cluster Preparation (Linux1,Linux2,Linux3)

· Hive unzipping installation

Unzip the uploaded hive to the VM/app directory

tar -zxvf /app/apache-hive-1.2.1-bin.tar.gz -C /app

mv /app/apache-hive-1.2.1-bin/ /app/hive-1.2.1

·Configure Hive's profile

View Profile Contents

Copy the configuration file hive-env.sh. template to hive-env.sh

cp /app/hive-1.2.1/conf/hive-env.sh.template /app/hive-1.2.1/conf/hive-env.sh

vim /app/hive-1.2.1/conf/hive-env.sh

·Configure hive environment variables

vim /etc/profile

source /etc/profile

which hive

·Start hadoop cluster

·Start Hive service

hive

·View database

show databases;

·Create a database

create database myhive;

show databases;

·Creating tables

create table student（id int，chinese string，math string，English string）;

·Load data and query

load data local inpath '/root/student.txt' into table student;

select * from student;

2. Hive's stand-alone installation mode (using mysql for metadata storage)

Install MySQL server and MySQL client and start MySQL service.

·Set up a MySQL account for Hive on linux1, and give it sufficient permissions

create user 'hive' identified by '123456';

GRANT ALL PRIVILEGES ON *.* TO hive@'%' IDENTIFIED BY '123456' with grant option;

GRANT ALL PRIVILEGES ON *.* TO hive@'localhost' IDENTIFIED BY '123456' with grant option;

flush privileges

See if it works.

·Continue to configure hive in inline mode: hive-site.xml, hive-env.sh

Configure hive-env.sh

Configure hive-site.xml, copy hive-default.xml file under/app/hive-1.2.1/conf as hive-site.xml

cp /app/hive-1.2.1/conf/hive-default.xml.template /app/hive-1.2.1/conf/hive-site.xml

vim /app/hive-1.2.1/conf/hive-site.xml

·Copy the data driver jar package to the specified directory/app/hive-1.2.1/lib/. No driver pack will report error

·Start hive service from command line, then view database, create database named heihei, view cluster web page

Looking at the cluster web page, you can see that the file directory corresponding to the heihei database has been generated on hdfs

·Access hive using beeline

Exit the hive service just now. Modify the hadoop configuration file etc/hadoop/core-site.xml on linux1, add the following configuration items, and log in to the hdfs file system anonymously through the httpfs interface. Then restart the cluster.

hadoop.proxyuser.root.hosts

hadoop.proxyuser.root.groups

Use the command hive --service hiveserver2 & background to start the hive service

hive --service hiveserver2 &

Cloned windows connect as clients and execute beeline scripts

Connect to the server. This method uses the thrift service. 10000 is the default connection port number.

! connect jdbc:hive2://linux1:10000

Verify that the connection is the hive service we just accessed from the command line

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.