Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to automatically crawl Web Chart with Pandas function

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the Pandas function how to automatically crawl Web chart, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian take you to understand.

Introduce a very practical and magical function-read_html (), which can save you the trouble of writing crawlers and automatically help you crawl tables in static web pages.

Simple usage: pandas.read_html (url)

Main parameters:

Io: receive URL, file, string header: specify the line where the column name is located encoding:The encoding used to decode the web pageattrs: pass a dictionary and filter out a specific table with its attributes

Just input url, you can grab all the tables in the web page, grab the table and save it to the list, and each table in the list is in dataframe format.

Let's simply grab the net fund table of Tiantian Fund Network and target url: http://fund.eastmoney.com/fund.html.

You can see that there are table table data in the html above, which is just suitable for crawling.

Import pandas as pd

Url = "http://fund.eastmoney.com/fund.html"

Data = pd.read_html (url,attrs = {'id':' oTable'})

# View the number of forms

Tablenum = len (data)

Print (tablenum)

Output: 1

After screening through the 'id':' oTable', there was only one form, and we climbed directly to the fund's net worth statement.

Data [1]

But here only crawled the first page of the data table, because the net fund net fund data every page of the url is the same, so read_html () function can not get other pages of the table, which may be the use of ajax dynamic loading technology to prevent crawlers.

Generally speaking, when the data of a crawler object is not fully displayed at one time, it will be displayed many times. There are two ways to deal with the website:

1. The url of the next page is different from that of the previous page, that is, the url of each page is different, which is usually the accumulation of sequence numbers. The processing method is to download all the html pages locally to get all the data. (daily Fund Network shows that it is not this type) 2, the url of the next page is the same as the url of the previous page, that is, the url that shows all the data is the same, so that there will generally be "next page" or "input box" and "confirmation" button on the web page, and the processing method is to trigger the "next page" or "input box" and "confirm" button click event in the code to turn the page, thus getting all the data. (Tiantian Fund Network is this type)

Just simply used the function of read_html () to get the web table, it has a more complex use, you need to understand the meaning of its parameters.

Detailed usage

Pandas.read_html (io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True)

Detailed parameters

"io:" str, path object or file-like objectURL,file-like object or the original string containing HTML. Note that lxml only accepts the http,ftp and file url protocols. If your URL is' https' 'you can try to delete' s'.

"match:" str or compiled regular expression, the optional parameter returns a table set that contains text that matches the regular expression or string. Unless HTML is very simple, you may need to pass a non-empty string here. The default is ". +" (matches any non-empty string). The default value returns all tables contained on the page. This value is converted to a regular expression so that there is consistent behavior between Beautiful Soup and lxml.

"flavor:" the parsing engine to be used by str or None.' Bs4' and 'html5lib' are synonymous with each other and they are both for backward compatibility. The default value, None, attempts to parse using lxml, and if it fails, it reappears bs4+html5lib.

"header:" int or list-like or None, optional parameter line (or MultiIndex) is used to create column headings.

"index_col:" int or list-like or None, optional parameters for the column (or column list) used to create the index.

"skiprows:" int or list-like or slice or None, optional parameter the number of rows to skip after parsing column integers. Start at 0. If an integer sequence or slice is given, the rows indexed by that sequence are skipped. Note that a single sequence of elements means "skip line n", while an integer means "skip line n".

"attrs:" dict or None, optional parameter this is a dictionary of attributes that you can pass to identify tables in HTML. They are not checked for validity before they are passed to lxml or Beautiful Soup. However, these properties must be valid HTML properties to work properly. For example, attrs = {'id':' table'} is a valid attribute dictionary because the 'id' HTML tag attribute is a valid HTML attribute of any HTML tag, this file. Attrs = {'asdf':' table'} is not a valid attribute dictionary because 'asdf' is not a valid HTML attribute even if it is a valid XML attribute. You can find valid HTML 4.01 table properties here. You can find the working draft of the HTML 5 specification here. It contains the latest information about modern Web table properties.

"parse_dates:" bool, optional parameters refer to read_csv () for more details.

"thousands:" str, an optional parameter to parse thousands of delimiters. The default is','.

"encoding:" str or None, optional parameters are used to decode the encoding of the web page. The default is NoneNone to retain the previous encoding behavior, depending on the underlying parser library (for example, the parser library will try to use the encoding provided by the document).

"decimal:" str, default is'.' Characters that can be recognized as decimal points (for example, for European data, use ",").

"converters:" dict, which defaults to the dictionary of functions that None uses to convert values in some columns. The key can be an integer or a column label, and the value is a function that takes an input parameter, the contents of the cell (not the column) and returns the converted content.

"na_values:" iterable, default to None custom na value.

"keep_default_na:" bool, defaults to True if na_values is specified and keep_default_na is False, the default nan value will be overridden, otherwise they will be appended.

"displayed_only:" bool, which defaults to whether True should resolve elements with "display:none".

Finally, read_html () only supports static page parsing, you can use other methods to get dynamic page loading after the response.text is passed into read_html () to get tabular data.

Thank you for reading this article carefully. I hope the article "how to climb the Web chart automatically with Pandas function" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and follow the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report