In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
1. Overview 1. What is hive
is open source by Facebook to solve the data statistics of massive structured logs. Is a data warehouse tool based on Hadoop, which can map structured data files to a table and provide SQL-like query functions. In essence, it is to convert HQL/SQL to MapReduce or spark task execution and then return the result. It has the following essence:
1) the data processed by Hive is stored in HDFS. By default in
Create the corresponding file under / user/hive/warehouse/.db//
2) the underlying implementation of Hive analysis data is a distributed computing engine such as MapReduce or spark.
3) the executor runs on Yarn
So it can be concluded that hive uses a language interface similar to SQL, and other features are completely different from databases such as mysql.
2. Advantages and disadvantages of hive.
Advantages:
1) the operation interface adopts SQL-like syntax to provide the ability of rapid development (simple and easy to use)
2) avoid writing MapReduce and reduce the learning cost of developers.
3) the execution delay of Hive is relatively high, so Hive is often used in data analysis, where real-time requirements are not high.
4) Hive has the advantage of dealing with big data, but it has no advantage in dealing with small data, because the execution delay of Hive is relatively high.
5) Hive supports user-defined functions, and users can implement their own functions according to their needs.
Disadvantages:
1) the HQL expression ability of Hive is limited.
(1) iterative algorithm can not be expressed.
(2) not good at data mining
2) the efficiency of Hive is relatively low
(1) MapReduce jobs automatically generated by Hive are usually not intelligent enough.
(2) Hive tuning is difficult and coarse-grained.
3. Hive basic architecture
figure 1.1 hive architecture
as shown in the figure, Hive receives the user's instructions (SQL) through a series of interactive interfaces provided to the user, uses its own Driver, combined with MetaStore, translates these instructions into MapReduce (or spark), submits them to Hadoop for execution, and finally outputs the returned results to the user interface.
1) user interface: Client
CLI (hive shell), JDBC/ODBC (java access hive), WEBUI (browser access hive)
2) metadata: Metastore
Metadata includes: table name, database to which the table belongs (default is default), table owner, column / partition field, table type (whether it is an external table), table data directory, etc.
It is stored in the built-in derby database by default, but the performance is not good. It is recommended to use MySQL to store Metastore.
The data in the table is stored on hdfs, and only the table metadata is stored in mysql
3) Hadoop
Use HDFS for storage and MapReduce for calculation.
4) Drive: Driver
(1) SQL Parser: convert SQL strings into abstract syntax tree AST. This step is usually done with third-party tool libraries, such as antlr; parsing AST syntax, such as whether tables exist, fields exist, and SQL semantics are incorrect.
(2) Compiler (Physical Plan): compiles AST into a logical execution plan.
(3) Query Optimizer: optimize the logical execution plan.
(4) Execution: converts a logical execution plan into a runnable physical plan. For Hive, it is MR/Spark.
Deploy hive1 and plan the project information host bigdata121 (192.168.50.121) hadoop2.8.4 (pseudo-distributed build, a datanode node) hive1.2.1mysql5.7
The deployment of mysql,hadoop does not repeat the instructions, see the previous article.
2. Deploy (1) hive installation configuration
Extract the program to / opt/modules and rename it
Tar-zxvf apache-hive-1.2.1-bin.tar.gz-C / opt/modules/ mv / opt/modules/apache-hive-1.2.1-bin / opt/modules/hive-1.2.1-bin
Modify the environment configuration file hive-env.sh
Cd / opt/modules/hive-1.2.1-bin/confmv hive-env.sh.template hive-env.sh adds the following configuration: (a) configure HADOOP_HOME path, hadoop home directory export HADOOP_HOME=/opt/modules/hadoop-2.8.4 (b) configure HIVE_CONF_DIR path. Directory export HIVE_CONF_DIR=/opt/modules/hive-1.2.1-bin/conf of hive configuration information
Modify log output configuration hive-log4j.properties
Change cd / opt/modules/hive-1.2.1-bin/confmv hive-log4j.properties.template hive-log4j.properties to hive.log.dir=/opt/module/hive-1.2.1-bin/logs
Configure the hive environment variable:
Vim / etc/profile.d/hive.sh adds the following: #! / bin/bashexport HIVE_HOME=/opt/modules/hive-1.2.1-binexport PATH=$PATH:$ {HIVE_HOME} / bin enables the environment variable source / etc/profile.d/hive.sh
Once this is configured, you can start hive using the hive command. Default metadata is stored in the built-in detry database
(2) set the metadata of hive to be stored in mysql
Modify the configuration file conf/hive-site.xml
Javax.jdo.option.ConnectionURL jdbc:mysql://bigdata121:3306/metastore?createDatabaseIfNotExist=true JDBC connect string for a JDBC metastore,jdbc connection string, and specify the name of the library where the metadata is stored is metastore, which can be customized by javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver Driver class name for a JDBC metastore. Connection driver class javax.jdo.option.ConnectionUserName root username to use against metastore database,mysql user javax.jdo.option.ConnectionPassword wjt86912572 password to use against metastore databasem The mysql password hive.metastore.warehouse.dir / user/hive/warehouse specifies the storage path of the default database in hdfs hive.cli.print.header true display header Hive.cli.print.current.db true displays the name of the current database
Then download mysql-connector-java-5.1.27.jar (version of your choice), which is just a jar package for jdbc, and put it under / opt/modules/hive-1.2.1-bin/lib.
Next, initialize the data of the metadata database in mysql, and migrate the detry data to mysql
The command schematool-dbType mysql-initSchema is under / opt/modules/hive-1.2.1-bin/bin.
Finally, words similar to the following appear to indicate that initialization is complete
SchemeTool completed
If the following problems occur:
Metastore connection URL: jdbc:mysql://bigdata121:3306/metastore?createDatabaseIfNotExist=trueMetastore Connection Driver: com.mysql.jdbc.DriverMetastore connection User: rootStarting metastore schema initialization to 1.2.0Initialization script hive-schema-1.2.0.mysql.sqlError: Duplicate entry'1' for key 'PRIMARY' (state=23000,code=1062) org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization FAILED! Metastore state would be inconsistent! * * schemaTool failed * *
Solution:
Manually delete the database drop database metastore you just created from mysql, and then execute the above initialization command again.
After initialization is complete, you can start hive.
(3) hdfs stores hive data location
By default, it is stored under / user/hive/warehouse/, each library has its own separate directory named after databaseName.db, under the library directory is each table, each table is named after the table name directory, and the table data is stored under the directory.
(4) hive configure metaServer
by default, hive connects to mysql directly based on the connection information of the Metabase configured in hive-site.xml, and only one hive client can connect to the Metabase at the same time (for fear of transaction problems). Metaserver is a meta-database server, which connects directly to mysql, while other hive clients connect to mysql indirectly through metaserver to read metadata information. Multiple hive client connections are allowed at this time, and there is no need to specify the address, username and password of the connection MySQL. Of course, as a hive client, you can specify the user information to connect to mysql in the configuration file, but generally you need to create multiple mysql users to connect mysql to different hive client, which is cumbersome. Generally used within the company.
typically configures a metaServer on one host and hive client on multiple other hosts in a production environment. The host where metaServer is located requires high permissions and does not allow others to access it arbitrarily, while the host of hive client can be used by people with low permissions. To some extent, it protects the secrecy of Metabase mysql.
Host function bigdata121 (192.168.50.121) hive metastore,mysqlbigdata122 (192.168.50.122) hive client
Configuration:
1 > bigdata121:
Vim conf/hive-site.xml javax.jdo.option.ConnectionURL jdbc:mysql://bigdata121:3306/metastore?createDatabaseIfNotExist=true JDBC connect string for a JDBC metastore,jdbc connection string, and specify the name of the library where the metadata is stored is metastore, which can be customized by javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver Driver class name for a JDBC metastore. Connection driver class javax.jdo.option.ConnectionUserName root username to use against metastore database,mysql user javax.jdo.option.ConnectionPassword wjt86912572 password to use against metastore databasem The mysql password hive.metastore.warehouse.dir / user/hive/warehouse specifies the storage path of the default database in hdfs hive.cli.print.header true display header Hive.cli.print.current.db true displays the name of the current database hive.zookeeper.quorum bigdata121 Bigdata122,bigdata123 The list of ZooKeeper servers to talk to. This is only needed for read/write locks. Hive.zookeeper.client.port 2181 The port of ZooKeeper servers to talk to. This is only needed for read/write locks. Hive.metastore.uris thrift://bigdata121:9083
When the configuration is complete, start metastore server:
Hive-- service metastore & netstat-tnlp | grep 9083 to see if the corresponding port is up
2 > bigdata122:
Vim conf/hive-site.xml hive.metastore.warehouse.dir / user/hive/warehouse specifies the storage path of the default database in hdfs hive.cli.print.header true display header Hive.cli.print.current.db true displays the name of the current database hive.zookeeper.quorum bigdata121 Bigdata122,bigdata123 The list of ZooKeeper servers to talk to. This is only needed for read/write locks. Hive.zookeeper.client.port 2181 The port of ZooKeeper servers to talk to. This is only needed for read/write locks. The configuration of hive.metastore.uris thrift://bigdata121:9083 hive client is relatively simple. There is no need to configure the connection information of mysql. Just configure the data directory of hive in hdfs and the connection address of hive metastore server.
After the configuration is complete, directly hive starts client to connect to metastore, and then use hive.
There is no need to start hiveserver, because the hive command itself is a hiveserver
(5) hive configure hiveserver
by default, hive only allows hive client to connect to hive, because in this case hive does not need to start the server service, and each hive client essentially starts a hive server locally, and then connects to metastore server to read the hdfs's data directory, so that the hive service can be used.
If needs to connect to hive through jdbc, then you need to start hive as a backend service and provide the corresponding address and port to jdbc connection. Note that this hiveserver has nothing to do with the previous metastore server, the two are separate processes. And after hive server starts, there will be a built-in metastore server inside, which is for your own use and cannot be used externally. Therefore, the two are the products of different use scenarios, pay attention to the distinction.
Configure hive server on bigdata121 (192.168.50.121)
Javax.jdo.option.ConnectionURL jdbc:mysql://bigdata121:3306/metastore?createDatabaseIfNotExist=true JDBC connect string for a JDBC metastore,jdbc connection string, and specify the name of the library where the metadata is stored is metastore, which can be customized by javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver Driver class name for a JDBC metastore. Connection driver class javax.jdo.option.ConnectionUserName root username to use against metastore database,mysql user javax.jdo.option.ConnectionPassword wjt86912572 password to use against metastore databasem The mysql password hive.metastore.warehouse.dir / user/hive/warehouse specifies the storage path of the default database in hdfs hive.cli.print.header true display header Hive.cli.print.current.db true displays the name of the current database hive.zookeeper.quorum bigdata121 Bigdata122,bigdata123 The list of ZooKeeper servers to talk to. This is only needed for read/write locks. Hive.zookeeper.client.port 2181 The port of ZooKeeper servers to talk to. This is only needed for read/write locks. / / metastore is of no use here Hive.metastore.uris thrift://bigdata121:9083 hive.server2.thrift.port 10000 hive.server2.thrift.bind.host bigdata121 Hive.server2.long.polling.timeout 5000
Start hiveserver:
Hive-- service hiveserver2 & n | grep 10000 to see if the port is up
Then test jdbc on bigdata122, and there is a command under hive's bin:
Bin/beeline [root@bigdata122 hive-1.2.1-bin] # bin/beeline Beeline version 1.2.1 by Apache Hivebeeline >! connect jdbc:hive2://bigdata121:10000 here is the test of jdbc connection SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [SLF4J: Found binding in] SLF4J: Found binding in [jar:file:/opt/modules/hadoop-2.8.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jarring] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Connecting to jdbc:hive2://bigdata121:10000Enter username for jdbc:hive2://bigdata121:10000: Enter password for jdbc: Hive2://bigdata121:10000: Connected to: Apache Hive (version 1.2.1) Driver: Hive JDBC (version 1.2.1) Transaction isolation: TRANSACTION_REPEATABLE_READ0: jdbc:hive2://bigdata121:10000 > this shows that 0: jdbc:hive2://bigdata121:10000 > show databases has been connected successfully +-+-- + | database_name | +-+-+ | default | | test | +-+-- + 3. Basic use of hive 1. Hive command line tools basically use hive-options to directly output hive commands to enter the hive operation command line common options:--database databaseName: specify to enter a library-e SQL: do not enter hive Execute the sql command directly, and the result returns-f xxx.sql on the command line: the sql in the execution file, and the result is returned directly
Common operation commands after entering hive:
Exit/quit: exit hivedfs xxx: operate on hdfs in hive! ShellCommand: execute shell command directly in hive ~ / .hivehistory: this is the way to record hive history commands file 2, hive parameter configuration (1) you can configure parameters in hive-site.xml (2) parameters are only valid for this startup through-hiveconf param=value, and parameters are only valid for this startup. (3) through set param=value; configuration in hive, you can view the priority of the above three setting methods of parameter configuration through set param. That is, the configuration files' John' and 'last'- >' Doe', then you can get the last element through the field name ['last']. The map () ARRAY array is a collection of variables of the same type and name. These variables are called array elements, and each array element has a number starting from zero. For example, if the array value is ['John',' Doe'], the second element can be referenced by the array name [1]. Array ()
Among them, map and struct are somewhat similar, but in fact, the concept of key does not exist in struct, but similar to key, it is equivalent to the definition of the names of attributes within struct itself. We only fill in the corresponding value of these existing attributes. So key already exists for struct. Map is easy to understand, key and value are arbitrary, both are data, and you need to enter both key and value when populating the data.
Examples of use:
Suppose a table has a row of data as follows (converted to json format):
{"name": "songsong", "friends": ["bingbing", "lili"], / / list Array, "children": {/ / key value Map, "xiao song": 18, "xiaoxiao song": 19} "address": {/ / structure Struct "street": "hui long guan", "city": "beijing"}}
The original data are as follows:
The data of different fields in songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long guan_beijingyangyang,caicai_susu,xiao yang:18_xiaoxiao yang:19,chao yang_beijing are separated by ",", multiple values in the same field are separated by "_", and KV are separated by ":".
Create a test table (I'll talk about usage later, take a look):
Create table test (name string,friends array,children map, map is to define the type of KV, address struct structure is to specify which key is defined, and then fill in value) row format delimited fields terminated by', 'collection items terminated by' _ 'map keys terminated by': 'lines terminated by'\ n' Field explanation: row format delimited fields terminated by','--column delimiter collection items terminated by'_'--separator (data division symbol) for MAP STRUCT and ARRAY map keys terminated by':-- lines terminated by'\ nseparator for key and value in MAP;-- row delimiter
Import data:
Hive (default) > load data local inpath'/ opt/module/datas/test.txt' into table test
Query data:
Hive (default) > select friends [1], children ['xiao song'], address.city from test where name= "songsong"; OK_c0 _ C1 citylili 18 beijingTime taken: 0.076 seconds, Fetched: 1 row (s) (3) Type conversion
The atomic data type of Hive can be implicitly converted, similar to the type conversion of Java. For example, if an expression uses the INT type, TINYINT will automatically convert to the INT type, but Hive will not convert it in reverse. For example, if an expression uses the TINYINT type, INT will not automatically convert to the TINYINT type, it will return an error unless you use the CAST operation.
1) the rules for implicit type conversion are as follows.
(1) any integer type can be implicitly converted to a broader type, such as TINYINT can be converted to INT,INT can be converted to BIGINT.
(2) all integer types, FLOAT and STRING types can be implicitly converted to DOUBLE.
(3) TINYINT, SMALLINT and INT can all be converted into FLOAT.
(4) the BOOLEAN type cannot be converted to any other type.
2) you can use the CAST operation display for data type conversion, for example, CAST ('1' AS INT) will convert the string'1' to the integer 1; if the cast fails, such as performing CAST ('X' AS INT), the expression returns a null value NULL.
Let's start with the use of HQL
4, DDL data definition 1, create database create database [if not exists] DBNAME [location PATH]; if not exists: do not create if the library exists, create location PATH if it does not exist: specify the storage path of the database in hdfs 2, modify the database user can use the ALTER DATABASE command to set key-value pair attribute values for the DBPROPERTIES of a database to describe the attribute information of this database. Other metadata information for the database is immutable, including the database name and the directory location where the database is located. \ alter database DBNAME set dbproperties ('param'='value'); for example: modify the creation time alter database db_hive set dbproperties (' createtime'='20180830'); 3. View the database to view the basic information: show databases view details: desc database DBNAME view the database more detailed information: desc database extended DBNAME4, delete the database drop database [if exists] DBNAME [cascade] if exists: it is best to use to determine whether the library exists, otherwise it is easy to report an error. Cascade: force deletion of non-empty libraries. By default, only empty libraries can be deleted. Create table CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment],...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment],...)] [CLUSTERED BY (col_name, col_name,...) [SORTED BY (col_name [ASC | DESC],...)] The INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] [LIKE DBNAME] field explains: (1) EXTERNAL: create an external table, and the internal table is created by default. There is a difference between the two later (2) COMMENT: add comments to tables and columns, followed by (3) PARTITIONED BY to create partition tables, followed by (4) CLUSTERED BY to create bucket tables, and (5) SORTED BY is not commonly used: mr internal sorting It is followed by (6) ROW FORMAT DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value,...)] The format of these defined output data is field separator, value delimiter, KV separator, line separator (7) STORED AS specifies the storage file type is generally divided into row storage and column storage, as mentioned later, the default is textfile, that is, line storage text format (8) LOCATION: specify the location of the table on the HDFS. Specify the data load path, put the file under that path, and load the data automatically without load. If you create a partition directory on your own, you need to manually load (9) LIKE: allow users to copy existing table structures, but do not copy data examples: create table if not exists student2 (id int, name string) row format delimited fields terminated by'\ t'stored as textfilelocation'/ user/hive/warehouse/student2';6, view table information view table basic information: desc TABLENAME View details: desc formatted TABLENAME7, create external and internal tables
The difference between internal and external tables:
External table: when you delete an external table using drop, you will only delete the table structure, not the internal table: also known as a management table, when you delete an external table using drop, the table structure and data will be deleted. Generally speaking, external tables are used to store the original collected data and are not allowed to be deleted easily, so it is appropriate to use external tables to save them. Later, based on the external table data analysis, when you need to store some intermediate result tables, it is appropriate to use internal tables, and you can delete the table structure and data at any time.
View the type of table
Desc formatted TABLEName;.Table Type: EXTERNAL_TABLE (external table) / MANAGED_TABLE (internal table) .8, partitioned table operation
Partition table actually divides the data of a table into several partitions, each partition creates an independent partition directory under the table storage directory, and the directory is named "partition key=value" in the form of "partition", such as "month=201902", the data file that stores the file under the directory. And when querying, you can query the data of the specified partition through the where statement. To some extent, it reduces the pressure of reading data at one time. For example, it is partitioned by day, and the data of each day is stored separately. It is also easy to manage.
(1) create a partition table
Create table dept_partition (deptno int, dname string, loc string) partitioned by (month string) row format delimited fields terminated by'\ tcontains; where partitioned by (month string) is the field name and type that defines the partition. Note that this field is not a field of the data itself, but is artificially defined. And this field is displayed when you view table data using select.
(2) load data to the specified partition
Load data local inpath'/ opt/module/datas/dept.txt' into table default.dept_partition partition (month='201809'); must be specified as month=value, otherwise an error will be reported, the directory name month=201809 will be created under the table directory, and then the data file will be saved to that directory load data local inpath'/ opt/module/datas/dept.txt' into table default.dept_partition partition (month='201807'); if the partition already exists, the new data file will be placed in the same partition directory
(3) View partition table data
Select * from dept_partition where month='201809';_u3.deptno _ u3.dname _ u3.loc _ u3.month10 ACCOUNTING NEW YORK 201809 can see that month is displayed as a field
(4) increase zoning
Create a single partition: hive (default) > alter table dept_partition add partition (month='201806'); create multiple partitions at the same time: hive (default) > alter table dept_partition add partition (month='201805') partition (month='201804'); separate multiple partitions with spaces
(5) delete partition
Delete a single partition: hive (default) > alter table dept_partition drop partition (month='201804'); delete multiple partitions at the same time: hive (default) > alter table dept_partition drop partition (month='201805'), partition (month='201806'); separate multiple partitions with commas
(6) check how many partitions there are in the partition table
Hive > show partitions dept_partition
(7) View the partition table structure
Hive > desc formatted dept_partition; desc dept_partition to see simple information. # Partition Information # col_name data_type comment month string.
(8) Secondary partition table
Create a secondary partition table: create table dept_partition2 (deptno int, dname string, loc string) partitioned by (month string, day string) row format delimited fields terminated by'\ tpartition; in fact, two partition fields are defined to load data into the secondary partition: hive (default) > load data local inpath'/ opt/module/datas/dept.txt' into table default.dept_partition2 partition (month='201809', day='13') View data: hive (default) > select * from dept_partition2 where month='201809' and day='13'
(9) Storage structure of partition table
A directory with the name month=xxx will be created under the table directory. The secondary partition is similar.
(10) how to manually create partition directories and upload data
Manually create partition directories and upload data, there is no way to directly look up the data through select. The treatment is as follows
Mode 1:msck repair table tb_name repair table 2: manually add partition through alter table tb add parition (xxxx) mode 3: manually use load to load data to the specified partition 9, modify table name: ALTER TABLE table_name RENAME TO new_table_name update column: update single field ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST | AFTER column_name] such as: alter table dept_partition change column deptdesc desc int Add column: alter table dept_partition add columns (deptdesc string); replace column: note here to replace all fields of the entire table alter table dept_partition replace columns (deptno string, dname string, loc string); 10. Delete table drop table TB_ name 5, DML data operation 1, data import (1) load load data hive > load data [local] inpath'/ opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1, …)] ; local: indicates that data is loaded locally into the hive table; otherwise, data is loaded into the hive table from HDFS. When you load data into a table through hdfs, the file is cut to the path where the table is located. Whether it is an external table or an internal table, overwrite: it means to overwrite the data already in the table, otherwise it means append. If there is no overwrite, when the file with the same name is loaded, it will be renamed to tableName_copy_1partition: upload to the specified partition (2) insert insert data to create a partition table that is easy to explain with examples: hive (default) > create table student2 (id int, name string) partitioned by (month string) row format delimited fields terminated by'\ t'. Basic insert: insert into table student partition (month='201809') values (1MagneWangwu'); single table query insert: insert overwrite table student2 partition (month='201808') select id, name from student where month='201809'; multi-table query insert: from studentinsert overwrite table student2 partition (month='201807') select id, name where month='201809'insert overwrite table student2 partition (month='201806') select id, name where month='201809' (3) as select queries the data to create a new table and import the data create table if not exists student3as select id. Name from student; is generally used in the result table. Save the results to another table after the query. (4) location specifies the data location create table if not exists student5 (id int, name string) row format delimited fields terminated by'\ t'location'/ user/hive/warehouse/student5' when creating the table. Then upload the data directly to the directory specified above, and you can directly query the data by select without the need for manual load again. If location is not specified, the default table directory is under / user/hive/warehouse/xxxx/, and manual load data (5) import import data is required. Here, only data exported using export can be imported. Here is exportimport table student2 partition (month='201809') from'/ user/hive/warehouse/export/student'. 2. Data export (1) insert export insert overwrite [local] directory'/opt/module/datas/export/student1'ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t 'select * from student Local: export to the local file system, otherwise you can specify field delimiters, line delimiters, etc. (2) hadoop command exports directly because the data file of hive is a text file. You can export hive (default) > dfs-get / user/hive/warehouse/student/month=201809/000000_0 / opt/module/datas/export/student3.txt directly through hdfs. (3) export exports export table default.student to'/ user/hive/warehouse/export/student';3, clears data truncatetruncate table TB_ name 6, and queries SELECT [ALL | DISTINCT] select_expr, elect_expr,... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number]
Several tables are involved in the following example:
Employee table emp:empno neame job mgr hiredate comm deptno employee number employee name position leader employment date payroll department number department table dept:deptno dname loc department name location information table location:loc loc_nmae location label location name 1. Select...fromselect field name from table_name field name can be used to define aliases for fields using: field names as aliases (1) arithmetic operations for fields example: select sal + 1 from emp (2) commonly used function 1) find the total number of lines (count) hive (default) > select count (*) cnt from emp; here can be written as count (1), which is more efficient than count (*) 2) the maximum wage (max) hive (default) > select max (sal) max_sal from emp;3) the minimum wage (min) hive (default) > select min (sal) min_sal from emp 4) the sum of wages (sum) hive (default) > select sum (sal) sum_sal from emp; 5) the average (avg) hive (default) > select avg (sal) avg_sal from emp (3) limit N limit returns the first N line 2, wherewhere judgment condition: (1) comparison operator: commonly used greater than less than, etc., between and, in, is null usage and mysql are similar. (2) like is null B is a simple wildcard string, and% represents zero or more characters (any character). _ stands for a character RLIKE B: this is using the standard regular expression (3) logical operators: AND OR NOT3, grouping (1) group byGROUP BY statements are usually used with aggregate functions, grouping according to one or more queue results, and then performing aggregation operations on each group. Case practice: (1) calculate the average salary of each department in emp table. Group by department number hive (default) > select t.deptno, avg (t.sal) avg_sal from emp t group by t.deptno; (2) calculate the maximum salary for each position in each department of emp. Grouped by department number hive (default) > select t.deptno, t.job, max (t.sal) max_sal from emp t group by t.deptno, t.job.Note: when group by, you should pay attention to that the fields in front of select should appear in group by, for example, select id,name from a group id,name; cannot be written as select id,name from a group id. (2) the usage of havinghaving is the same as that of where, with the following differences: (1) where plays a role in querying data for the columns in the table. Having plays a role in filtering data against the columns in the query results. (2) grouping functions cannot be written after where, but grouping functions can be used after having. (3) having is only used for group by grouping statistics statements. Example: hive (default) > select deptno, avg (sal) avg_sal from emp group by deptno having avg_sal > 2000 for each department whose average salary is greater than 2000. 4, join
Hive only supports equivalent connections, not non-equivalent connections. Note that the use of or for multi-conditional union is not supported in the on statement of join
(1) Internal connection select xx from An alias 1 join B alias 2 on equivalence judgment example: connect select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno = d.deptno according to the department number of employee table and department table; (2) left outer connection:
All records in the table to the left of the JOIN operator that match the WHERE clause will be returned.
According to the left table, if the rows in the right table do not match the left table, they will be removed directly, and the matching will show that if there are multiple rows in the right table that match the same row in the left table, the row will be displayed multiple times. If the rows of the left table do not match in the right table, the data of the right table will be displayed as null
Select xx from An alias 1 left join B alias 2 on equivalence judgment example: select e.empno, e.ename, d.deptno from emp e left join dept d on e.deptno = d.deptno; (3) right outer connection
Connect with the left, but use the latter table as the left table. Or connect according to the left table.
Select xx from An alias 1 right join B alias 2 on equivalence judgment example: select e.empno, e.ename, d.deptno from emp e right join dept d on e.deptno = d.deptno; (4) full outer connection
All records of the two tables are displayed, and those that do not match are displayed as null
Select xx from An alias 1 full join B alias 2 on equivalence judgment example: select e.empno, e.ename, d.deptno from emp e full join dept d on e.deptno = d.deptno; (5) Multi-table join
To join n tables, at least 1 connection condition is required. For example, to join three tables, at least two join conditions are required.
Example: SELECT e.ename, d.deptno, l.loc _ nameFROM emp e JOIN dept dON d.deptno = e.deptno JOIN location lON d.loc = l.loc. in most cases, Hive starts a MapReduce task for each pair of JOIN connection objects. In this example, you will first start a MapReduce job to join table e and table d, and then start a MapReduce job to join the output of the first MapReduce job and table l;. Note: why not join table d and table l first? This is because Hive is always executed from left to right. 5, sorting (1) Global sorting order byOrder By: global sorting, the essence is to start a separate MapReduce for global sorting 1) use the ORDER BY clause to sort ASC (ascend): ascending (default) DESC (descend): descending 2) the ORDER BY clause is at the end of the SELECT statement. 3) case practice (1) query employee information by salary ascending order hive (default) > select * from emp order by sal; (2) query employee information by salary descending order hive (default) > select * from emp order by sal desc; (2) sort by alias sort by twice employee salary select ename, sal*2 twosal from emp order by twosal (3) multiple column sorting by department and salary ascending order hive (default) > select ename, deptno, sal from emp order by deptno, sal; (4) Local sorting within each MapReduce this statement is used to sort within each partition in MapReduce. In the end, each reduce outputs its own sorting results alone, and will not sort all reduce again 1) set the number of reduce hive (default) > set mapreduce.job.reduces=3;2) View the number of reduce hive (default) > set mapreduce.job.reduces;3) View the employee information hive (default) > select * from emp sort by empno desc according to the department number in descending order (5) Partition field Distribute ByDistribute By: similar to partition in MR, partition is carried out, that is, which field of partition is specified. The partition here is not the same as the partition of partition above. Note that this refers to the partition in the program, and the above refers to the stored partition. Used in combination with sort by. Note: Hive requires the DISTRIBUTE BY statement to be written before the SORT BY statement. For distribute by testing, be sure to allocate multiple reduce for processing, otherwise you can't see the effect of distribute by. (1) the division is divided according to the department number, and then sorted according to the descending order of the employee number. Hive (default) > set mapreduce.job.reduces=3; hive (default) > insert overwrite local directory'/ opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc; is written in the file here, because it is partitioned, it will generate multiple files (one file for each reduce), so it is easy to view the results of the partition (6) cluster by partition sorting when the distribute by and sorts by fields are the same, you can use the cluster by method. Cluster by not only has the function of distribute by but also has the function of sort by. However, sorting can only be in reverse order, and the collation cannot be specified as ASC or DESC. That is, to achieve the specified field partition, and sort the field 1) the following two writing methods are equivalent hive (default) > select * from emp cluster by deptno;hive (default) > select * from emp distribute by deptno sort by deptno; Note: according to the department number partition, it is not necessarily a fixed dead value, it can be 20 and 30 departments divided into a partition. 6. Separate buckets
The partition is for the storage path of the data; the bucket is for the data file. After dividing the bucket, the original data file will be directly separated into multiple bucket files, one file for each bucket. The partition simply puts the data file itself in a different partition directory and does not split the data file.
partitions provide a convenient way to isolate data and optimize queries. However, not all datasets can form reasonable partitions, especially the previously mentioned concern about determining the appropriate partition size. In fact, the purpose of dividing a bucket is to automatically divide a large data file into several partitions according to the partition field. This is the function of dividing the bucket.
(1) prepare data create table stu_buck (id int, name string) clustered by (id) into 4 bucketsrow format delimited fields terminated by'\ tbucket; field explanation: clustered by (id): field name of into 4 buckets: number of buckets original data: student.txt1001 ss11002 ss21003 ss31004 ss41005 ss51006 ss61007 ss71008 ss81009 ss91010 ss101011 ss111012 ss121013 ss131014 ss141015 ss151016 ss16 needs to open the corresponding parameter settings: hive (default) > set hive.enforce.bucketing=true Hive (default) > set mapreduce.job.reduces=-1; then creates the same table (without split buckets) and imports the data: create table stu2 (id int, name string) row format delimited fields terminated by'\ t imports the data into the bucket table through a subquery (the bucket table can only import data in this way): insert overwrite table stu_buckselect id, name from stu2 Check the bucket data: hive (default) > select * from stu_buck;OKstu_buck.id stu_buck.name1016 ss161012 ss121008 ss81004 ss41009 ss91005 ss51001 ss11013 ss131010 ss101002 ss21006 ss61014 ss141003 ss31011 ss111007 ss71015 ss15, you can see that in general, it should be displayed sequentially by id, but this is obviously not the case. This is how the bucket affects the view of bucket table information: desc formatted stu_buck (2) bucket sampling query for very large data sets, sometimes users need to use a representative query result instead of all the results. Hive can meet this requirement by sampling the table. Query the data in the table stu_buck. Hive (default) > select * from stu_buck tablesample (bucket 1 out of 4 on id); Note: tablesample is a sample statement, syntax: TABLESAMPLE (BUCKET x OUT OF y). Y must be a multiple or factor of the total bucket number of table. Hive determines the proportion of sampling according to the size of y. For example, table divides a total of 4 parts, and extracts data of (4 bucket 2 =) 2 bucket when yearly 2, and 2 bucket data of (4 Unix 8 =) 1 canister when yearly 8. (personal understanding) y represents the number of bucket in the sampling interval, such as 2, indicating that there is an empty bucket in the middle, and then draw. If y is greater than the number of bucket, then the rest of the interval sampling will continue in the first starting bucket. If the bucket number is 4 bucket 12, then only the first bucket will be sampled, and 3 data will be extracted from this bucket. X indicates which bucket to start the extraction from. For example, the total bucket number of table is 4 bucket tablesample (bucket 4 out of 4), which means that a total of 1 Tablesample is extracted and the data of the fourth bucket is extracted. Note: the value of x must be less than or equal to the value of y, otherwise FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_ Buck 7, function 1, built-in function 1) check the function that comes with the system hive > show functions;2) display the usage of the function hive > desc function upper;3) detailed display the use of the function hive > desc function extended upper;2, custom function UDF
If the built-in function can not meet the requirements, you can customize the function, called UDF--user defined function
According to the user-defined function category, it is divided into the following three categories: (1) UDF (User-Defined-Function) one in and one out (2) UDAF (User-Defined Aggregation Function) aggregate function, which is similar to: count/max/min (3) UDTF (User-Defined Table-Generating Functions) one in and many out, such as lateral view explore ()
To customize a function:
(1) inheriting org.apache.hadoop.hive.ql.UDF (2) requires the implementation of evaluate function; evaluate function supports overloading (3) create a function in hive's command line window a) add jar add jar linux_jar_path b) create function, create [temporary] function [dbname.] function_name AS class_name Later, you can use function_name to call the function (4) delete the function Drop [temporary] function [if exists] [dbname.] function_name; 8, compression and storage in the command line window of hive
When the underlying hive uses MapReduce, using the compression feature requires the hadoop itself to open the corresponding configuration and the corresponding jar package dependency in the compressed format. Without repeating this, look at the previous configuration in "hadoop- Compression". The following is the configuration in hive.
1. Hive compression configuration (1) enable map output compression to enable map output phase compression can reduce the amount of data transferred between map and Reduce task in job. The specific configurations are as follows: 1) enable hive intermediate transfer data compression function hive (default) > set hive.exec.compress.intermediate=true;2) enable map output compression function in mapreduce hive (default) > set mapreduce.map.output.compress=true;3) set the compression mode of map output data in mapreduce, here use snappy hive (default) > set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec 4) execute query statement hive (default) > select count (ename) name from emp; (2) reduce output compression is enabled when Hive writes the output to the table, the output can also be compressed. The property hive.exec.compress.output controls this function. Users may need to keep the default value of false in the default settings file so that the default output is an uncompressed plain text file. Users can turn on the output compression function by setting this value to true in the query statement or execution script. 1) enable hive final output data compression function hive (default) > set hive.exec.compress.output=true;2) enable mapreduce final output data compression hive (default) > set mapreduce.output.fileoutputformat.compress=true;3) set mapreduce final data output compression mode hive (default) > set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec 4) set mapreduce final data output compression to block compression hive (default) > set mapreduce.output.fileoutputformat.compress.type=BLOCK;5) to test whether the output is a compressed file. If compression is enabled, the output file is in compressed format hive (default) > insert overwrite local directory'/ opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;2, hive data file storage format (1) storage format
Hive is mainly divided into two categories: row storage and column storage. The main formats are:
Row storage: TEXTFILE (text format), SEQUENCEFILE (binary format) column storage: ORC (Optimized Row Columnar, compression is enabled by default, and zlib compression is used), PARQUET. The storage principles of these two formats can be learned by yourself. The characteristics of row storage are not described in detail here: when querying an entire row of data that meets the criteria, column storage needs to go to each aggregated field to find the corresponding value of each column. The row store only needs to find one of the values, and the rest of the values are adjacent, so the row store query is faster at this time. A row of data is stored continuously. The characteristics of column storage: because the data aggregation storage of each field can greatly reduce the amount of data read when the query only needs a few fields; the data type of each field must be the same. Column storage can be targeted to design a better design compression algorithm.
figure 8.1 column and row storage of hive
This is the architecture diagram of row and column storage, and you can see the obvious difference.
(2) create a table to specify a storage format. Use the stored as xxx keyword to specify a storage format such as: create table log_text (track_time string,url string,session_id string,referer string,ip string,end_user_id string,city_id string) row format delimited fields terminated by'\ t'stored as textfile. Here, it is specified as textfile format, which is also the default storage format. By default, column storage is compressed, while row storage is not. So the general ranking of occupied space is: ORC > Parquet > textFile3, and the combination of storage format and compression.
Create a table to specify the storage format and compression method:
Create table log_orc_snappy (track_time string,url string,session_id string,referer string,ip string,end_user_id string,city_id string) row format delimited fields terminated by'\ t'stored as orc tblproperties ("orc.compress" = "SNAPPY"); where tblproperties is used to specify the working parameters of orc, which is in the form of key=value. Orc.compress is the compression format used by the specified orc, which is specified here to use SNAPPY IX and hive in the production environment.
hive is commonly used to store structured data in production environments. For example, in general, data is regularly read from the production database (such as mysql) and written to hive, and then other programs read the data from hive for analysis and processing. So hive acts as a data warehouse.
9.1 bug when mysql imports data into hive
Both mysql and hive support the int type, but their resolvable ranges are different. The int type class in mysql represents a larger range of numbers than int in hive, which leads to a problem, that is, when the value of int in mysql is greater than this limit, it cannot be parsed normally in hive, and the import usually fails.
so to avoid this, it is recommended that all fields of the table imported into hive be set to type string, which avoids these problems. When you need to convert a field type to an actual type, you can convert it when you use it.
Of course, this is also true in spark sql, after all, they are not fully compatible with the standard sql.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.