Data Collection of big data's Technology 10/18 Update SLTechnology News&Howtos

Data Collection of big data's Technology

2025-10-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

[guide] data acquisition is both a prerequisite and a necessary condition for big data's analysis, and plays an important role in the whole process. This paper will introduce big data's three collection forms: system log collection method, network data acquisition method and other data acquisition methods.

(1) system log collection method

Syslog records information about hardware, software, and system problems in the system, as well as monitoring events that occur in the system. Users can use it to check the cause of the error, or to look for traces left by the victim. System logs include system logs, application logs, and security logs. (Baidu encyclopedia) big data platform or similar to the open source Hadoop platform will produce a large number of high-value system log information, how to collect has become a research hotspot. At present, Chukwa based on Hadoop platform, Flume of Cloudera and Scribe of Facebook (Li Lianning, 2016) can all be models of system log collection. At present, this kind of acquisition technology can transmit hundreds of MB of log data per second, which meets the current demand for information speed. Generally speaking, what is related to us is not this kind of acquisition method, but the network data acquisition method.

Here I still want to recommend the big data Learning Exchange Group I built myself: 529867072, all of them are developed by big data. If you are studying big data, the editor welcomes you to join us. Everyone is a software development party. Irregularly share practical information (only related to big data software development), including the latest big data advanced materials and advanced development tutorials sorted out by myself. Welcome to join us if you want to go deep into big data.

(2) Network data acquisition method

Students of natural language may be deeply impressed by this. In addition to the existing public data sets, which are used in daily algorithm research, sometimes in order to meet the actual needs of the project, it is necessary to collect, preprocess and save the data from the real web pages. At present, there are two methods of network data collection, one is API, the other is web crawler.

1.API

API, also known as application program interface, is a kind of program interface written by website managers for users. This kind of interface can shield the complex algorithm at the bottom of the website and realize the request function to the data only through simple and monotonous use. Mainstream social media platforms such as Sina Weibo, Baidu Tieba and Facebook all offer API services, and the relevant DEMO can be obtained on the open platform of their official website. However, API technology is after all limited by platform developers, in order to reduce the load of the website (platform), the general platform will limit the daily limit of interface calls, which brings us great inconvenience. To this end, we usually use the second way-web crawler.

two。 Web crawler

Web crawler (also known as web spider, web robot, in the FOFA community, more often called web chaser), is a program or script that automatically grabs the information of the World wide Web according to certain rules. Other infrequently used names include ants, automatic indexing, simulators, or worms. (Baidu encyclopedia) the most common crawlers are the search engines we often use, such as Baidu, 360 search and so on. This kind of crawler is called general-purpose crawler, which collects all web pages unconditionally. The specific working principle of general-purpose crawlers is shown in figure 1.

Fig. 1 how the crawler works [2]

Given the initial URL of the crawler, the crawler extracts and saves the resources that need to be extracted from the web page, and at the same time extracts the links of other websites that exist in the website. after sending the request, receiving the response of the website and parsing the page again, the crawler extracts the required resources and saves them, and then extracts the required resources from the web page. By analogy, the implementation process is not complicated, but pay special attention to the forgery of the IP address and header during the collection, so as not to be found by the network management to block the IP (I have been blocked), which means the failure of the whole collection task. Of course, in order to meet more needs, multi-thread crawler, theme crawler also arises at the historic moment. Multithreaded crawlers perform collection tasks through multiple threads at the same time. Generally speaking, with several threads, the data collection data will be increased several times. Theme crawler and general-purpose crawler are diametrically opposed to each other. They filter the web page information that has nothing to do with the topic (collection task) through certain strategies, leaving only the required data. This can greatly reduce the problem of data sparsity caused by extraneous data.

(3) other collection methods

Other collection methods refer to how to ensure the safe transmission of data when scientific research institutes, enterprises and governments have confidential information. System-specific ports can be used for data transmission tasks, so as to reduce the risk of data leakage.

[conclusion] big data's collection technology is the beginning of big data's technology, and a good beginning is half the success, so we must carefully choose the method when doing data collection, especially the crawler technology. Thematic crawler should be a better method for most data collection tasks and can be studied in depth.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.