Example Analysis of hive+Sqoop+Flume 07/12 Update SLTechnology News&Howtos

Example Analysis of hive+Sqoop+Flume

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the example analysis of hive+Sqoop+Flume, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

Part one: Hive Overview HDFS = > Storage of massive data MapReduce = > Analysis and processing of massive data YARN = > Management and Job scheduling of Cluster Resources Section 1 background of Hive production

If you directly use MapReduce to deal with big data, you will face the following problems:

-the development of MapReduce is difficult and the learning cost is high (wordCount = > Hello World). The Hdfs file has no field name and data type, so it is not convenient to manage the data effectively.-using MapReduce framework development, the project cycle is long and the cost is high.

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a table (similar to the table in RDBMS) and provide SQL-like query function; Hive is open source by Facebook and is used to solve the data statistics of massive structured logs.

* the essence of Hive is to convert SQL to MapReduce tasks to perform operations * data storage is provided by HDFS at the bottom * Hive can be understood as a tool to convert SQL into MapReduce tasks.

Data warehouse (Data Warehouse) is a topic-oriented, integrated, relatively stable data collection that reflects historical changes, which is mainly used for management and decision-making. (bill Enmen, father of data warehouse, proposed in 1991).

* the purpose of the data warehouse: to build an analysis-oriented and integrated data set; to provide decision support for enterprises * the data warehouse itself does not produce data, data sources and external * store a large amount of data, the analysis and processing of these data inevitably use Hive section 2 Hive and RDBMS comparison

Because Hive uses HQL (Hive Query Language), a query language similar to SQL, it is easy to understand Hive as a database. In fact, from a structural point of view, Hive and traditional relational databases have no similarities except for similar query languages.

* query languages are similar. HQL SQL is highly similar because SQL is widely used in data warehouse, therefore, a query language like SQL, HQL, is designed specifically for the characteristics of Hive. Developers who are familiar with SQL development can easily use Hive for development. * data scale. Hive stores huge amounts of data; RDBMS can only handle limited data sets; because Hive is built on clusters and can use MapReduce for parallel computing, it can support large-scale data; while RDBMS can support small-scale data. * execution engine. The engine of Hive is that MR/Tez/Spark/Flink;RDBMS uses its own execution engine, Hive, to execute most queries through MapReduce provided by Hadoop. RDBMS usually has its own execution engine. * data storage. Hive is stored on HDFS; data saved by RDBMS on the local file system or on the bare device Hive is stored in HDFS. RDBMS, on the other hand, saves the data on the local file system or on a bare device. * execution speed. Hive is relatively slow (the amount of MR/ data); RDBMS is relatively fast; Hive stores a large amount of data. When querying data, there is usually no index and the entire table needs to be scanned. In addition, Hive uses MapReduce as the execution engine, these factors will lead to high latency. On the other hand, RDBMS's access to data is usually based on index, with low execution latency. Of course, this low is conditional, that is, the data scale is small, when the data scale is large enough to exceed the processing capacity of the database, the parallel computing of Hive can obviously reflect the parallel advantages. * scalability. Hive supports horizontal scaling; usually RDBMS supports vertical scaling, while horizontal scaling is not friendly for Hive based on Hadoop, and its scalability is consistent with the scalability of Hadoop (Hadoop cluster size can easily exceed 1000 nodes). However, due to the strict semantic limitation of ACID, the extension line of RDBMS is very limited. The theoretical scalability of the most advanced parallel database Oracle is only about 100. * data update. Hive is not friendly to data update; RDBMS supports frequent and fast data update Hive is designed for data warehouse applications, and the content of data warehouse is read more and write less. Therefore, rewriting data is not recommended in Hive, and all data is determined at load time. The data in RDBMS needs to be updated frequently and quickly.

* * install the python-devel development package

1. An overview sometimes reports an error when installing certain software:

Error: must have python development packages for 2.4, 2.5, 2.6 or 2.7. Could not find Python.h.Please install python2.4-devel, python2.5-devel, python2.6-devel or python2.7-devel

This is due to the lack of python development packages.

two。 Solution

If you are using a centOS system, or a system that supports yum, you can install it as follows:

Yum search python | grep-I devel

Use the above command to find the devel package, and then run the following command to install:

Yum install python-devel.x86_64 thank you for reading this article carefully. I hope the article "sample Analysis of hive+Sqoop+Flume" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.