How does Python crawl Sohu Securities Stock data 04/26 Update SLTechnology News&Howtos

How does Python crawl Sohu Securities Stock data

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "Python how to crawl Sohu Securities stock data," interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "Python how to crawl Sohu Securities stock data"!

Crawling of data

Let's take the SSE 50 stock as an example. First, we need to find a website containing the stock codes of these 50 stocks. For example, here we use the list provided by Sohu Securities.

https://q.stock.sohu.com/cn/bk_4272.shtml

As you can see, there are all the tickers of SSE 50 in this website. What we want to crawl is this table containing tickers and get the first column of this table.

Crawl website data We use Beautiful Soup this toolkit, it should be noted that, generally can only crawl to static web page information.

Simply put, Beautiful Soup is a Python library whose main function is to grab data from web pages.

As usual, we need to import the library bs4 before using it. In addition to this, we also need to use the requests tool to get website information, so import these two libraries:

import bs4 as bsimport requests

We define a function saveSS50 Tickers() to obtain the SSE 50 stock code, the data obtained from the Sohu Securities web page, use the get() method to obtain the data of the given static web page.

def saveSS50Tickers(): resp = requests.get('https://q.stock.sohu.com/cn/bk_4272.shtml')

Next, we open this website of Sohu Securities, and right-click anywhere on the page to select View Element, or Inspect Element, or similar options to view the source code information of the current website.

We need to find out some basic information about the page and the characteristics of the data we need to crawl.

First, find Element, and find the header file of the web page in the content below. Then find out how the text of the web page is encoded. The encoding of this page is gb2312.

If we want to crawl and display this page correctly, we need to decode the content of the retrieved page first.

Decoding can be done using the encoding method:

resp.encoding = 'gb2312'

Next, use BeautifulSoup and lxml to parse the web page information:

soup = bs.BeautifulSoup(resp.text, 'lxml')

Here, in order to facilitate later processing, resp.text is first used to convert the web page information into text format, and then the data of the web page is parsed.

Next, we need to find the tag that needs to crawl information in the source code of the web page. Here, we need to crawl the information in this table. First, we can search the relevant data in the table through the search function of the website source code to locate the source code of the table.

Also take this page as an example, general web pages are compiled in HTML language, because to accurately locate, we need to understand some basic content of HTML language. In the source code for this page,

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.