What are the five general learning steps in big data's development process? 04/06 Update SLTechnology News&Howtos

What are the five general learning steps in big data's development process?

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What are the five general learning steps in the development process of big data? aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Big data's development process is shown in figure 1-1.

Figure 1-1 General step diagram of big data's development

The figure above is just a simplified step and process. In actual development, some steps may not be needed, some may need additional steps, and some processes may be more complex, depending on the specific situation.

Let's take the Google search engine as an example to illustrate the above steps.

If you want to learn big data well, you'd better join a good learning environment. You can come to this Q Group 529867072 so that it is more convenient for everyone to learn, and you can also communicate and share materials together.

Big data collection

Google data comes from web pages on the Internet, which are crawled by Google Spider (spiders, reptiles, robots). The principle of crawling is also very simple, that is, to simulate our human behavior, to visit each web page, and then save the content of the web page.

Google Spider is a program that runs on Google servers around the world. Spider are very diligent and work day and night. Click to get free materials and lessons

According to Google data in 2008, they visit about 20 billion web pages a day, while in total, they track about 30 billion individual URL links.

It can be said that as long as it is a website on the Internet, as long as Spider access is not prohibited in the robots.txt file, its web pages will basically be crawled to the Google server in a very short time.

The global web page, this is a typical big data. Therefore, what Google Spider does is a typical big data collection work.

Big data pretreatment

Google Spider crawled web pages, no matter from the format or structure, are not unified, in order to facilitate subsequent processing, need to do some processing, for example, before storage, transcoding, using a unified format to encode the web page, these work is preprocessing.

Big data storage

After the web page has been preprocessed, it can be stored on the Google server.

In 2008, Google had indexed 1 trillion web pages around the world, and by 2014, that number had grown to 30 trillion.

In order to reduce overhead and save space, Google merges multiple web page files into one large file, which is usually larger than 1GB.

This is the figure of 15 years ago, when the mainstream desktop hard drive was about 60GB, and 1GB files could be said to be large files at that time.

In order to achieve efficient, reliable and low-cost storage of these large files, Google invented a distributed file system built on ordinary business machines: Google File System, abbreviated as GFS, to store files (also known as unstructured data).

After the web page file is stored, these web pages can be processed, such as counting the words and times that appear on each web page, counting the outer links of each web page, and so on.

These statistical information becomes an attribute in the database table, and each web page will eventually become one or more records in the database table.

Because Google stores too many web pages, more than 30 trillion, therefore, this database table is also super large, traditional databases, such as Oracle, simply can not handle such a large amount of data, so Google based on GFS, invented a distributed system Bigtable to store massive structured data (database tables).

The above two systems (GFS and Bigtable) are not open source, and Google describes their design ideas only in the form of an article.

Fortunately, based on these design ideas of Google, there have been many open source massive data distributed file systems, such as HDFS, etc., as well as many open source massive structured data distributed storage systems, such as HBase, Cassandra, etc., which are used for different types of big data storage.

In short, if the collected big data needs to be stored, the data type should be judged first, and then the storage scheme selection should be determined.

If there is no need for storage (for example, some stream data does not need storage, direct processing), skip this step directly and process it.

Here I still want to recommend the big data Learning Exchange Group I built myself: 529867072, all of them are developed by big data. If you are studying big data, the editor welcomes you to join us. Everyone is a software development party. Irregularly share practical information (only related to big data software development), including the latest big data advanced materials and advanced development tutorials sorted out by myself. Welcome to join us if you want to go deep into big data. 4. Big data's treatment

After the web page is stored, the stored data can be processed. For search engines, there are three main steps:

1) word statistics: count the number of times each word appears on the web page

2) inverted index: count the URL (Uniform Resource Locator uniform resource locator, commonly known as web address) and times of each word

3) calculate the page level: calculate the level of each page according to a specific sorting algorithm, such as PageRank, the more important the page, the higher the level, so as to determine the ranking position of the page in the search results.

For example, when the user enters the keyword "football" in the search box, the search engine will look up the inverted index table to find out which pages (URL) the keyword "football" appears in, and then sort them according to their level, ranking the highest-level pages first and returning them to the user, which is the final result when you click "search".

When processing, big data often needs to read data from the storage system, and after processing, the results often need to be output to storage. Therefore, big data's interaction with the storage system during the processing phase is very frequent.

Big data Visualization

Big data visualization is to show the data in a graphical way. Compared with the pure digital representation, the graphic way is more intuitive and easier to find the rules between the data.

For example, Google Analytics is a website traffic analysis tool, which calculates the data of each user visiting the website using a search engine, and then gets the traffic information of each website, including the number of visits per day, the pages with the most visits, the average stay time of users, the return rate, etc., all the data are displayed graphically, as shown in figure 1-2.

Figure 1-2 Analysis of visits to the Google website

The answers to the questions about what are the five general learning steps in the development process of big data are shared here. I hope the above content can be of some help to you, if you still have a lot of doubts to be solved. You can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.