Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to execute website crawler regularly in Python

2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

How to execute website crawler regularly in Python? in view of this problem, this article introduces the corresponding analysis and answer in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

Write crawler code

Write a crawler that crawls and parses Yahoo using requests and beautifulsoup4 packages! Stock market-listed transaction price ranking with Yahoo! Stock market-the information of OTC transaction price ranking, and then use pandas package to display the parsed data.

Import datetime

Import requests

From bs4 import BeautifulSoup

Import pandas as pd

Def get_price_ranks ():

Current_dt = datetime.datetime.now () .strftime ("% Y-%m-%d% X")

Current_dts = [current_dt for _ in range]

Stock_types = ["tse", "otc"]

Price_rank_urls = ["https://tw.stock.yahoo.com/d/i/rank.php?t=pri&e={}&n=100".format(st) for st in stock_types]

Tickers = []

Stocks = []

Prices = []

Volumes = []

Mkt_values = []

Ttl_steps = 100.100

Each_step = 10

For pr_url in price_rank_urls:

R = requests.get (pr_url)

Soup = BeautifulSoup (r.text, 'html.parser')

Ticker = [i.text.split () [0] for i in soup.select (".name a")]

Tickers + = ticker

Stock = [i.text.split () [1] for i in soup.select (".name a")]

Stocks + = stock

Price = [float (soup.find_all ("td") [2] .find _ all ("td") [I] .text) for i in range (5, 5+ttl_steps, each_step)]

Prices + = price

Volume = [int (soup.find_all ("td") [2] .find _ all ("td") [I] .text.replace (",", ") for i in range (11, 11+ttl_steps, each_step)]

Volumes + = volume

Mkt_value = [float (soup.find_all ("td") [2] .find _ all ("td") [I] .text) * 100000000 for i in range (12, 12+ttl_steps, each_step)]

Mkt_values + = mkt_value

Types = ["listed" for _ in range] + ["OTC" for _ in range (100)]

Ky_registered = [True if "KY" in st else False for st in stocks]

Df = pd.DataFrame ()

Df ["scrapingTime"] = current_dts

Df ["type"] = types

Df ["kyRegistered"] = ky_registered

Df ["ticker"] = tickers

Df ["stock"] = stocks

Df ["price"] = prices

Df ["volume"] = volumes

Df ["mktValue"] = mkt_values

Return df

Price_ranks = get_price_ranks ()

Print (price_ranks.shape)

The result of this is shown as

# # (200,8)

Next, we use pandas to show the first few lines.

Price_ranks.head ()

Price_ranks.tail ()

Then we start to deploy to the server.

For the choice of the server, the environment configuration is outside the scope of this lesson, we will mainly talk about how to set up scheduled tasks.

Next, let's modify the code so that the results are stored in sqlite.

Import datetime

Import requests

From bs4 import BeautifulSoup

Import pandas as pd

Import sqlite3

Def get_price_ranks ():

Current_dt = datetime.datetime.now () .strftime ("% Y-%m-%d% X")

Current_dts = [current_dt for _ in range]

Stock_types = ["tse", "otc"]

Price_rank_urls = ["https://tw.stock.yahoo.com/d/i/rank.php?t=pri&e={}&n=100".format(st) for st in stock_types]

Tickers = []

Stocks = []

Prices = []

Volumes = []

Mkt_values = []

Ttl_steps = 100.100

Each_step = 10

For pr_url in price_rank_urls:

R = requests.get (pr_url)

Soup = BeautifulSoup (r.text, 'html.parser')

Ticker = [i.text.split () [0] for i in soup.select (".name a")]

Tickers + = ticker

Stock = [i.text.split () [1] for i in soup.select (".name a")]

Stocks + = stock

Price = [float (soup.find_all ("td") [2] .find _ all ("td") [I] .text) for i in range (5, 5+ttl_steps, each_step)]

Prices + = price

Volume = [int (soup.find_all ("td") [2] .find _ all ("td") [I] .text.replace (",", ") for i in range (11, 11+ttl_steps, each_step)]

Volumes + = volume

Mkt_value = [float (soup.find_all ("td") [2] .find _ all ("td") [I] .text) * 100000000 for i in range (12, 12+ttl_steps, each_step)]

Mkt_values + = mkt_value

Types = ["listed" for _ in range] + ["listed" for _ in range (100)]

Ky_registered = [True if "KY" in st else False for st in stocks]

Df = pd.DataFrame ()

Df ["scrapingTime"] = current_dts

Df ["type"] = types

Df ["kyRegistered"] = ky_registered

Df ["ticker"] = tickers

Df ["stock"] = stocks

Df ["price"] = prices

Df ["volume"] = volumes

Df ["mktValue"] = mkt_values

Return df

Price_ranks = get_price_ranks ()

Conn = sqlite3.connect ('/ home/ubuntu/yahoo_stock.db')

Price_ranks.to_sql (price_ranks, conn, if_exists= "append", index=False)

Next, if we ask him to start on a regular basis, we need the linux crontab command:

If we want to set it to be executed every hour between 9:30 and 16:30 every day

So we just need to name the file price_rank_scraper.py first.

This is the answer to the question about how to regularly execute the website crawler in Python. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report