What about data acquisition and parsing in Python 07/11 Update SLTechnology News&Howtos

What about data acquisition and parsing in Python

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article shows you how data acquisition and parsing are in Python. The content is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

We have learned the work that needs to be done to develop a crawler and some common problems. Here we give a list of technologies related to crawler development and the standard libraries and third-party libraries involved in these technologies. We'll talk about these later.

Download the data-urllib / requests / aiohttp.

Parse the data-re / lxml / beautifulsoup4 / pyquery.

Caching and persistence-pymysql / sqlalchemy / peewee/ redis / pymongo.

Generate a digital signature-hashlib.

Serialization and compression-pickle / json / zlib.

Scheduler-multiprocess (multiprocessing) / multithreading (threading).

HTML page

Home / * cascading style sheet code is omitted here * / Yoko's Kitchen Home Classes Catering About Contact

Bok Choi Japanese Vegetarian Five week course in London

A five week introduction to traditional Japanese vegetarian meals, teaching you a selection of rice and noodle dishes.

Teriyaki Sauce Sauces Masterclass One day workshop

An intensive one-day course looking at how to create the most delicious sauces for use in a range of Japanese cookery.

Popular Recipes Yakitori (grilled chicken) Tsukune (minced chicken patties) Okonomiyaki (savory pancakes) Mizutaki (chicken stew) Contact

Yoko's Kitchen

27 Redchurch Street

Shoreditch

London E2 7DP

If you are familiar with the above code, you must know that HTML pages are usually made up of three parts: the Tag (tags) used to host content, the CSS (cascading style sheets) that is responsible for rendering the page, and the JavaScript that controls interactive behavior. Usually, we can get the code of the web page and understand the structure of the page by "viewing the source code of the web page" in the right-click menu of the browser; of course, we can also learn more through the developer tools provided by the browser.

Use requests to get the page

GET request and POST request.

URL parameters and request headers.

Complex POST requests (file uploads).

Manipulate Cookie.

Set up the proxy server.

[description]: for more information about the use of requests, please refer to its official documentation.

Page parsing

Comparison of several parsing methods

Description: BeautifulSoup optional parsers include: Python Standard Library (html.parser), lxml's HTML parser, lxml's XML parser and html5lib.

Parse a page using regular expressions

If you don't have any idea of regular expressions, it is recommended to read the 30-minute introduction to regular expressions before reading our previous article on how to use regular expressions in Python.

XPath parsing and lxml

XPath is a syntax for finding information in XML documents. It uses path expressions to select nodes or node sets in XML documents. The XPath node here includes elements, attributes, text, namespaces, processing instructions, comments, root nodes, and so on.

Harry Potter 29.99 Learning XML 39.95

For the XML file above, we can get the nodes in the document using the XPath syntax shown below.

When using the XPath syntax, you can also use predicates in XPath.

XPath also supports wildcard usage, as shown below.

If you want to select multiple nodes, you can use the method shown below.

[description]: the above example comes from the XPath tutorial on the rookie tutorial website, and interested readers can read the original text on their own.

Of course, if you don't understand or are not familiar with the XPath syntax, you can view the XPath syntax of the element in a Chrome browser as shown below.

The use of BeautifulSoup

BeautifulSoup is a Python library that can extract data from HTML or XML files. It can achieve the usual way of navigating, finding and modifying documents through your favorite converter.

1. Traverse the document tree

Get label

Get tag attributes

Get tag content

Get a child (grandchild) node

Get parent / ancestor node

Get sibling node

two。 Search tree node

Find / find_all

Select_one / select

[description]: for more information, please refer to the official documentation of BeautifulSoup.

The use of PyQuery

Pyquery is equivalent to the Python implementation of jQuery and can be used to parse HTML pages.

Example-get a link to the problem on Zhihu discovery

From urllib.parse import urljoinimport reimport requestsfrom bs4 import BeautifulSoupdef main (): headers= {'user-agent':' Baiduspider'} proxies= {'http':' http://122.114.31.177:808'} base_url = 'https://www.zhihu.com/' seed_url = urljoin (base_url,' explore') resp = requests.get (seed_url, headers=headers, proxies=proxies) soup = BeautifulSoup (resp.text 'lxml') href_regex = re.compile (r' ^ / question') link_set = set () for a_tag in soup.find_all ('averse, {' href': href_regex}): if 'href' in a_tag.attrs: href = a_tag.attrs [' href'] full_url = urljoin (base_url Href) link_set.add (full_url) print ('Total% d question pages found.'% len (link_set)) if _ _ name__ = =' _ main__': main () the above is about data acquisition and parsing in Python Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.