In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article shows you how data acquisition and parsing are in Python. The content is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.
We have learned the work that needs to be done to develop a crawler and some common problems. Here we give a list of technologies related to crawler development and the standard libraries and third-party libraries involved in these technologies. We'll talk about these later.
Download the data-urllib / requests / aiohttp.
Parse the data-re / lxml / beautifulsoup4 / pyquery.
Caching and persistence-pymysql / sqlalchemy / peewee/ redis / pymongo.
Generate a digital signature-hashlib.
Serialization and compression-pickle / json / zlib.
Scheduler-multiprocess (multiprocessing) / multithreading (threading).
HTML page
Home / * cascading style sheet code is omitted here * / Yoko's Kitchen Home Classes Catering About Contact
Bok Choi Japanese Vegetarian Five week course in London
A five week introduction to traditional Japanese vegetarian meals, teaching you a selection of rice and noodle dishes.
Teriyaki Sauce Sauces Masterclass One day workshop
An intensive one-day course looking at how to create the most delicious sauces for use in a range of Japanese cookery.
Popular Recipes Yakitori (grilled chicken) Tsukune (minced chicken patties) Okonomiyaki (savory pancakes) Mizutaki (chicken stew) Contact
Yoko's Kitchen
27 Redchurch Street
Shoreditch
London E2 7DP
©2011 Yoko's Kitchen / / omit the JavaScript code here
If you are familiar with the above code, you must know that HTML pages are usually made up of three parts: the Tag (tags) used to host content, the CSS (cascading style sheets) that is responsible for rendering the page, and the JavaScript that controls interactive behavior. Usually, we can get the code of the web page and understand the structure of the page by "viewing the source code of the web page" in the right-click menu of the browser; of course, we can also learn more through the developer tools provided by the browser.
Use requests to get the page
GET request and POST request.
URL parameters and request headers.
Complex POST requests (file uploads).
Manipulate Cookie.
Set up the proxy server.
[description]: for more information about the use of requests, please refer to its official documentation.
Page parsing
Comparison of several parsing methods
Description: BeautifulSoup optional parsers include: Python Standard Library (html.parser), lxml's HTML parser, lxml's XML parser and html5lib.
Parse a page using regular expressions
If you don't have any idea of regular expressions, it is recommended to read the 30-minute introduction to regular expressions before reading our previous article on how to use regular expressions in Python.
XPath parsing and lxml
XPath is a syntax for finding information in XML documents. It uses path expressions to select nodes or node sets in XML documents. The XPath node here includes elements, attributes, text, namespaces, processing instructions, comments, root nodes, and so on.
Harry Potter 29.99 Learning XML 39.95
For the XML file above, we can get the nodes in the document using the XPath syntax shown below.
When using the XPath syntax, you can also use predicates in XPath.
XPath also supports wildcard usage, as shown below.
If you want to select multiple nodes, you can use the method shown below.
[description]: the above example comes from the XPath tutorial on the rookie tutorial website, and interested readers can read the original text on their own.
Of course, if you don't understand or are not familiar with the XPath syntax, you can view the XPath syntax of the element in a Chrome browser as shown below.
The use of BeautifulSoup
BeautifulSoup is a Python library that can extract data from HTML or XML files. It can achieve the usual way of navigating, finding and modifying documents through your favorite converter.
1. Traverse the document tree
Get label
Get tag attributes
Get tag content
Get a child (grandchild) node
Get parent / ancestor node
Get sibling node
two。 Search tree node
Find / find_all
Select_one / select
[description]: for more information, please refer to the official documentation of BeautifulSoup.
The use of PyQuery
Pyquery is equivalent to the Python implementation of jQuery and can be used to parse HTML pages.
Example-get a link to the problem on Zhihu discovery
From urllib.parse import urljoinimport reimport requestsfrom bs4 import BeautifulSoupdef main (): headers= {'user-agent':' Baiduspider'} proxies= {'http':' http://122.114.31.177:808'} base_url = 'https://www.zhihu.com/' seed_url = urljoin (base_url,' explore') resp = requests.get (seed_url, headers=headers, proxies=proxies) soup = BeautifulSoup (resp.text 'lxml') href_regex = re.compile (r' ^ / question') link_set = set () for a_tag in soup.find_all ('averse, {' href': href_regex}): if 'href' in a_tag.attrs: href = a_tag.attrs [' href'] full_url = urljoin (base_url Href) link_set.add (full_url) print ('Total% d question pages found.'% len (link_set)) if _ _ name__ = =' _ main__': main () the above is about data acquisition and parsing in Python Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.