What is the Python stock data oriented crawler like? 04/19 Update SLTechnology News&Howtos

What is the Python stock data oriented crawler like?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

Python stock data oriented crawler is like, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Brief introduction of function

Objective: to obtain the names and trading information of all stocks on the Shanghai and Shenzhen stock exchanges.

Output: save to a file.

Technical route: requests-bs4-re

Language: python3.5

Description

Website selection principle: stock information statically exists in html pages, non-js code generation, no Robbts protocol restrictions.

Selection method: open the web page, view the source code, and search whether the stock price data of the web page exists in the source code.

For example, open the Sina stock website: link description (http://finance.sina.com.cn/realstock/company/sz000877/nc.shtml), as shown in the following figure:

The left side of the picture above is the interface of the web page, which shows that the stock price of Tianshan shares is 13.06. On the right is the source code of the web page. Query 13.06 in the source code and found that it was not found. Therefore, it is judged that the data of this web page is generated by js, which is not suitable for this project. So change the web page.

Then open the URL of Baidu stock: link description (https://gupiao.baidu.com/stock/sz300023.html), as shown in the following figure:

From the above picture, we can find that the data of Baidu stock is generated by html code, which meets the requirements of our project, so we choose the URL of Baidu stock in this project.

Since Baidu stock only has the information of a single stock, we also need a list of all stocks in the current stock market. Here we choose Oriental Fortune Network, the URL is: link description (http://quote.eastmoney.com/stocklist.html)), and the interface is shown in the following figure:

Principle analysis

If you look at the website of each Baidu stock: https://gupiao.baidu.com/stock/sz300023.html, you can find that there is a number 300023 in the URL that happens to be the serial number of the stock, the Shenzhen Stock Exchange represented by sz. So the structure of the program we constructed is as follows:

Step 1: get a list of stocks from Oriental Fortune

Step 2: get the stock symbol one by one and add it to the link of Baidu stock. Visit these links one by one to get stock information.

Step 3: save the results to a file.

Then check the source code of Baidu individual stock information web page, and find that the information of each stock is stored in the html code as follows:

Therefore, when we store information about each stock, we can refer to how the html code is stored in the figure above. Each information source corresponds to an information value, which is stored in the way of key-value pairs. In python, key-value pairs can be done using dictionary types. Therefore, in this project, a dictionary is used to store the information of each stock, and then the information of all stocks is recorded in the dictionary, and the data in the dictionary is output to a file.

Code writing

First of all, the program for obtaining html web page data, which is not introduced here, the code is as follows:

# get the html text def getHTMLText (url): try: r = requests.get (url) r.raise_for_status () r.encoding = r.apparent_encoding return r.text except: return ""

Next is the html code parser. The first thing you need to parse here is the Oriental Fortune website page: link description (http://quote.eastmoney.com/stocklist.html), we open its source code, as shown in the following figure:

As you can see from the above figure, the URL link in the href attribute of the a tag contains the corresponding number of each stock, so we just need to parse the corresponding stock number in the URL. The parsing steps are as follows:

* step to get a page:

Html = getHTMLText (stockURL)

The second step is to parse the page and find all the a tags

Soup = BeautifulSoup (html, 'html.parser') a = soup.find_all (' a')

The third step is to deal with each of the a tags all the time. The process is as follows:

1. Find the href attribute in the a tag, determine the link in the middle of the attribute, and take out the number after the link, where you can use regular expressions to match. Since the code of Shenzhen Stock Exchange starts with sz and that of Shanghai Stock Exchange begins with sh, the number of stocks is composed of six digits, so the regular expression can be written as [s] [hz]\ d {6}. That is to say, construct a regular expression, find a string that satisfies the regular expression in the link, and extract it. The code is as follows:

For i in a: href = i.attrs ['href'] lst.append (re.findall (r "[s] [hz]\ d {6}", href) [0]))

two。 Because there are a lot of a tags in html, but some a tags do not have the href attribute, so the above programs have an exception when running, and all the above programs have to be try. Except to handle the exception of the program, the code is as follows:

For i in a: try: href = i.attrs ['href'] lst.append (re.findall (r "[s] [hz]\ d {6}", href) [0]) except: continue

As you can see from the above code, we used the continue statement for the exception, just let it skip it and continue to execute the following statement. Through the above program, we can save all the code information of Oriental Fortune online stocks.

Package the above code into a function, and the complete code for parsing the Oriental Fortune website page is as follows:

Def getStockList (lst, stockURL): html = getHTMLText (stockURL) soup = BeautifulSoup (html, 'html.parser') a = soup.find_all (' a') for i in a: try: href = i.attrs ['href'] lst.append (re.findall (r "[s] [hz]\ d {6}", href) [0]) except: continue

The next step is to get the link description of Baidu Stock Network (https://gupiao.baidu.com/stock/sz300023.html) individual stock information). Let's first look at the source code of the page, as shown in the following figure:

The stock information is stored in the html code shown in the figure above, so we need to parse this html code. The process is as follows:

1. The website of Baidu Stock Network is https://gupiao.baidu.com/stock/

The website of a stock information is: https://gupiao.baidu.com/stock/sz300023.html

Therefore, as long as the URL of Baidu Stock Network + the code of each stock, and the code of each stock, we already have the previous program getStockList parsed from Oriental Fortune Network, so you can traverse the list returned by the getStockList function as follows:

For stock in lst: url = stockURL + stock + ".html"

two。 After obtaining the URL, you have to visit the web page to get the html code of the web page. The procedure is as follows:

Html = getHTMLText (url)

3. After obtaining the html code, you need to parse the html code. From the figure above, we can see that the information of a single stock is stored in the html code labeled div and attribute stock-bets, so parse it:

Soup = BeautifulSoup (html, 'html.parser') stockInfo = soup.find (' div',attrs= {'class':'stock-bets'})

4. We also found that the stock name is in the bets-name tag, continue to parse, and store it in the dictionary:

InfoDict = {} name = stockInfo.find_all (attrs= {'class':'bets-name'}) [0] infoDict.update ({' stock name': name.text.split () [0]})

Split () means that the part after the space of the stock name is not needed.

5. We can also observe from the html code that other information about the stock is stored in the dt and dd tags, where dt represents the key domain of the stock information, and the dd tag is the value range. Get all the keys and values:

KeyList = stockInfo.find_all ('dt') valueList = stockInfo.find_all (' dd')

And put the obtained key and value key-value pairs into the dictionary:

For i in range (len (keyList)): key = keyList [I] .text val = valueList [I] .text infoDict [key] = val

6. Save the data in the dictionary to an external file:

With open (fpath, 'asides, encoding='utf-8') as f: f.write (str (infoDict) +'\ n')

Package the above process into a completed function with the following code:

Def getStockInfo (lst, stockURL, fpath): for stock in lst: url = stockURL + stock + ".html" html= getHTMLText (url) try: if html== "": continue infoDict = {} soup = BeautifulSoup (html, 'html.parser') stockInfo = soup.find (' div' Attrs= {'class':'stock-bets'}) name = stockInfo.find_all (attrs= {' class':'bets-name'}) [0] infoDict.update ({'stock name': name.text.split () [0]}) keyList = stockInfo.find_all ('dt') valueList = stockInfo.find_all (' dd') For i in range (len (keyList)): key = keyList [I] .text val = valueList [I] .text infoDict [key] = val with open (fpath) 'asides, encoding='utf-8') as f: f.write (str (infoDict) +'\ n') except: continue

Among them try... Except is used for exception handling.

Next, write the main function and call the above function:

Def main (): stock_list_url = 'http://quote.eastmoney.com/stocklist.html' stock_info_url =' https://gupiao.baidu.com/stock/' output_file = 'DJV White BaiduStockInfo.txt' slist= [] getStockList (slist, stock_list_url) getStockInfo (slist, stock_info_url, output_file)

Project complete program

#-*-coding: utf-8-*-import requests from bs4 import BeautifulSoup import traceback import re def getHTMLText (url): try: r = requests.get (url) r.raise_for_status () r.encoding = r.apparent_encoding return r.text except: return "" def getStockList (lst, stockURL): html = getHTMLText (stockURL) soup = BeautifulSoup (html 'html.parser') a = soup.find_all (' a') for i in a: try: href = i.attrs ['href'] lst.append (re.findall (r "[s] [hz]\ d {6}", href) [0]) except: continue def getStockInfo (lst, stockURL) Fpath): count = 0 for stock in lst: url = stockURL + stock + ".html" html= getHTMLText (url) try: if html== "": continue infoDict = {} soup = BeautifulSoup (html, 'html.parser') stockInfo = soup.find (' div' Attrs= {'class':'stock-bets'}) name = stockInfo.find_all (attrs= {' class':'bets-name'}) [0] infoDict.update ({'stock name': name.text.split () [0]}) keyList = stockInfo.find_all ('dt') valueList = stockInfo.find_all (' dd') For i in range (len (keyList)): key = keyList [I] .text val = valueList [I] .text infoDict [key] = val with open (fpath) As f: f.write (str (infoDict) +'\ n') count = count + 1 print ("\ r current progress: {: .2f}%" .format (count*100/len (lst)) End= ") except: count = count + 1 print ("\ r current progress: {: .2f}% ".format (count*100/len (lst)) End= "") continue def main (): stock_list_url = 'http://quote.eastmoney.com/stocklist.html' stock_info_url =' https://gupiao.baidu.com/stock/' output_file = 'Djouxxt' slist= [] getStockList (slist, stock_list_url) getStockInfo (slist, stock_info_url, output_file) main ()

The print statement in the above code is used to print the progress of the crawl. After the execution of the above code, the BaiduStockInfo.txt file will appear on the D disk, which stores the stock information.

After reading the above, have you mastered the method of Python stock data directed crawler? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.