In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
This article will explain in detail what is climbing data, the content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.
Crawl data means: through a web crawler to get the content information on the site that you need, such as text, video, pictures and other data. Web crawler (web spider) is a program or script that automatically grabs the information of the World wide Web according to certain rules.
What's the use of learning something about climbing data?
For example, it is so big that people often use search engines (Google, Sogou)
When users retrieve the corresponding keywords on the Google search engine, Google will analyze the keywords and find out the items that are most suitable for the user from the "included" web pages. Then, how to obtain these web pages is what the crawler needs to do, and of course, how to push to the most valuable web pages of the user also needs to be combined with the corresponding algorithm, which involves the knowledge of data mining.
For smaller applications, for example, we count the workload of testing work, which requires statistics on the number of modification orders in a week / month, the number of defects recorded by jira, and the specific content
And then there is the recent hot World Cup, if you want to count the data of each player / country and store it for other purposes.
There is according to their own interests through some data to do some analysis, etc. (statistics of the praise of a book / movie), which requires crawling the existing web page data, and then through the data obtained to do some specific analysis / statistical work.
What basic knowledge does it take to learn a simple reptile?
I divided the basics into two parts:
1. Front-end basic knowledge
HTML/JSON,CSS; Ajax
Reference:
Http://www.w3school.com.cn/h.asp
Http://www.w3school.com.cn/ajax/
Http://www.w3school.com.cn/json/
Https://www.php.cn/course/list/1.html
Https://www.php.cn/course/list/2.html
Https://www.html.cn/
2. Knowledge related to python programming
(1) basic knowledge of Python
Basic grammar knowledge, dictionaries, lists, functions, regular expressions, JSON, etc.
Reference:
Http://www.runoob.com/python3/python3-tutorial.html
Https://www.py.cn/
Https://www.php.cn/course/list/30.html
(2) Python common library:
Usage of Python's urllib library (I use more urlretrieve functions in this module, mainly using it to save some acquired resources (documents / pictures / mp3/ videos, etc.))
Python pyMysql library (database connection, addition, deletion, modification and query)
Python module bs4 (need to have css selector, domTree knowledge of html tree structure, etc., locate the content we need according to css selector / html tag / attribute)
Requests of python (as the name implies, this module is used to send / POST/Get of request requests, etc., to get a Response object)
Python's os module (this module provides a wealth of methods for working with files and directories. The os.path.join/exists function is used more often)
Reference: you can refer to the interface API documentation of related modules in this section.
Extended data:
Web crawler is a program that automatically extracts web pages. It downloads web pages from the World wide Web for search engines and is an important component of search engines.
The traditional crawler starts from the URL of one or more initial web pages and obtains the URL on the initial web page. In the process of crawling the web page, the traditional crawler constantly extracts new URL from the current page and puts it into the queue until certain stopping conditions of the system are met.
The workflow of focus crawler is more complex, so it is necessary to filter links that have nothing to do with the topic according to certain web page analysis algorithms, retain useful links and put them into the URL queue waiting to be crawled. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process until a certain condition of the system is reached.
In addition, all the web pages crawled by the crawler will be stored by the system, analyzed, filtered, and indexed for subsequent query and retrieval; for focused crawlers, the analysis results obtained from this process may also provide feedback and guidance for the future crawling process.
Compared with general-purpose web crawlers, focused crawlers have three main problems to solve:
(1) description or definition of grasping target
(2) Analysis and filtering of web pages or data
(3) search strategy for URL.
About what is climbing data to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.