Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to analyze big data

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Today I will introduce to you how to analyze big data. The content of the article is good. Now I would like to share it with you. Friends who feel in need can understand it. I hope it will be helpful to you. Let's read it along with the editor's ideas.

What is big data?

Big data, big data and big data define big data in this way. Big data means that you can't use a shortcut like random analysis (sampling survey), but use all the data for analysis and processing.

This sentence conveys at least two messages:

1. Big data is a huge amount of data.

2. Big data has no shortcuts and puts forward higher requirements for analysis and processing technology.

Second, the processing flow of big data

The following figure shows the data processing flow:

1. The underlying layer is hundreds of billions of data sources, which can be SCM (supply chain data), 4PL (logistics data), CRM (customer data), website logs and other data.

2. The second layer is the data processing layer, in which the data engineers extract, clean, transform and load the data according to the standard statistical caliber and index (the whole process is referred to as ELT).

3. The third layer is the data warehouse. The processed data flows into the data warehouse for integration and storage, forming one data Mart after another.

A data Mart refers to a collection of classified and stored data, that is, storing data according to the needs of different departments or users.

4. The fourth layer is BI (Business Intelligence), which analyzes, models, mines and calculates the data according to the business requirements, and outputs a unified data analysis platform.

5. The fifth layer is the data access layer, which opens different data roles and permissions to different demanders to drive the business.

The magnitude of big data determines the difficulty of big data's processing and application, and it is necessary to use specific technical tools to deal with big data.

Welcome to join big data exchange group: 658558542 blow water exchange and study together.

Third, big data's processing technology

Take the most commonly used Hadoop as an example:

Hadoop is an open source framework developed by Apache that allows big data to be stored and processed in a distributed environment using computers with a simple programming model throughout the cluster.

Cluster means that two or more servers build nodes to provide data services. A single server can not handle the huge amount of big data. The more servers there are, the more powerful the cluster is.

Hadoop is similar to a data ecosystem, with different modules performing their own functions. The following picture is the ecological map of Hadoop's official website.

Welcome to join big data exchange group: 658558542 blow water exchange and study together.

Hadoop's LOGO is a flexible elephant. There are different opinions on the origin of LOGO on the Internet. Some people say that it is because elephants symbolize behemoths, referring to big data, and Hadoop makes big data flexible. Officially stamped, LOGO comes from the child of founder Doug Cutting who named an elephant toy hadoop.

As can be seen from the above picture, the core of Hadoop is HDFS,YARN and Map Reduce. Let's talk about the meaning and function of several major modules.

1. HDFS (distributed file storage system)

The data is distributed in different nodes of the cluster in the form of blocks. When using HDFS, you don't have to care about which node the data is stored on or from, just manage and store the data in the file system as if you were using a local file system.

2. Map Reduce (distributed computing framework)

The distributed computing framework distributes complex data sets to different nodes to operate, and each node periodically returns the work it has done and the latest status. You can understand the principle of Map Reduce with the following figure:

Welcome to join big data exchange group: 658558542 blow water exchange and study together.

The computer counts the words entered:

If we use the centralized calculation method, we have to first calculate how many times a word such as Deer appears, and then calculate how many times another word appears, until all the words are counted, which will waste a lot of time and resources.

If distributed computing is adopted, computing will become efficient. We randomly assign the data to three nodes, and the nodes count the number of words in their respective data, and then aggregate the same words to output the final results.

3. YARN (resource scheduler)

It is equivalent to the task manager of a computer to manage and schedule resources.

4. HBASE (distributed database)

HBase is a non-relational database (Nosql). In some business scenarios, data storage queries are used more efficiently in Hbase.

The difference between relational database and Philippine relational database will be discussed in detail in future articles.

5. HIVE (data warehouse)

HIVE is a data warehouse tool based on Hadoop, which can be transformed into Map Reduce task to query and analyze hdfs data in SQL language. The advantage of HIVE is that users do not need to write Map Reduce tasks, but only need to master SQL to complete the query analysis work.

6. Spark (big data computing engine)

Spark is a fast and general computing engine designed for large-scale data processing.

7. Mahout (Machine Learning Mining Library)

Mahout is an extensible machine learning and data mining library

8 、 Sqoop

Sqoop can import relational database into HDFS of Hadoop, or import data from HDFS into relational database.

In addition to the above modules, Hadoop also has Zookeeper, Chukwa and other modules. Because it is open source, there will be more and more efficient modules in the future. If you are interested, you can learn about them online.

Through the powerful biosphere of Hadoop, the big data processing process is completed.

The above is how to analyze all the contents of big data, more content related to how to analyze big data can search the previous articles or browse the following articles to learn ha! I believe the editor will add more knowledge to you. I hope you can support it!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report