Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the method of quickly and effectively retrieving web page data in web development

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what is the method of retrieving web page data quickly and effectively in web development". The content of the article is simple and clear, and it is easy to learn and understand. please follow the editor's train of thought to study and learn "what is the method of retrieving web page data quickly and effectively in web development".

Web page crawling problem 1

The web crawler tries to find the current stock price of Facebook. The code is as follows:

Import requests from bs4 importBeautifulSoup defparsePrice (): r = requests.get ("https://finance.yahoo.com/quote/FB?p=FB") soup = BeautifulSoup (r.text," lxml ") price = soup.find (div, {class: My (6px) Pos (r) smartphone_Mt (6px)}) .find (span) .text print (f the current price: {price})

The code output is as follows:

The current price: 216.08

Using a simple web crawl solution is very simple, but it's not lazy enough. Let's look at the next one.

Web page crawling problem 2

Web crawlers are trying to find data on the corporate value of stocks and the number of short stocks from statistical tags. His problem is actually retrieving nested dictionary values that may or may not exist, but in retrieving data, he seems to have found a better solution.

Import requests, re, json, pprint p = re.compile (r root.App.main = (. *) ) tickers = [AGL.AX] results = {} with requests.Session () as s: for ticker in tickers: r = s.get (https://finance.yahoo.com/quote/{}/key-statistics?p={} .format (ticker) Ticker)) data = json.loads (p.findall (r.text) [0]) key_stats = data [context] [dispatcher] [stores] [QuoteSummaryStore] print (key_stats) res = {Enterprise Value: key_stats [defaultKeyStatistics] [enterpriseValue] [fmt] Shares_Short: key_stats [defaultKeyStatistics] [sharesShort] .get (longFmt, NCMA)} results [ticker] = res print (results)

Look at line 3: the page crawler can find the data he is looking for in the javascript variable:

Root.App.main = {.... }

There, you can easily retrieve data by simply accessing the appropriate nested keys in the dictionary. But there are more "lazy" ways.

The solution to laziness 1

Import requests r = requests.get ("https://query2.finance.yahoo.com/v10/finance/quoteSummary/FB?modules=price") data = r.json () print (data) print (f" the currentprice: {data [quoteSummary] [result] [0] [price] [regularMarketPrice] [raw]} ")

Look at the URL on the third line, and the output is as follows:

{quoteSummary: {error: None, result: [{price: {averageDailyVolume10Day: {}, averageDailyVolume3Month: {}, circulatingSupply: {}, currency: USD, currencySymbol: $, exchange: NMS ExchangeDataDelayedBy: 0, exchangeName: NasdaqGS, fromCurrency: None, lastMarket: None, longName: Facebook,Inc., marketCap: {fmt: 698.42B, longFmt: 698423836672.00 Raw: 698423836672}, marketState: REGULAR, maxAge: 1, openInterest: {}, postMarketChange: {}, postMarketPrice: {}, preMarketChange: {fmt:-0.90 Raw:-0.899994}, preMarketChangePercent: {fmt:-0.37%, raw:-0.00368096}, preMarketPrice: {fmt: 243.60 Raw: 243.6}, preMarketSource: FREE_REALTIME, preMarketTime: 1594387780, priceHint: {fmt: 2, longFmt: 2, raw: 2} QuoteSourceName: Nasdaq Real Time Price, quoteType: EQUITY, regularMarketChange: {fmt: 0.30, raw: 0.30160522}, regularMarketChangePercent: {fmt: 0.12% Raw: 0.0012335592}, regularMarketDayHigh: {fmt: 245.49, raw: 245.49}, regularMarketDayLow: {fmt: 239.32 Raw: 239.32}, regularMarketOpen: {fmt: 243.68, raw: 243.685}, regularMarketPreviousClose: {fmt: 244.50 Raw: 244.5}, regularMarketPrice: {fmt: 244.80, raw: 244.8016}, regularMarketSource: FREE_REALTIME, regularMarketTime: 1594410026 RegularMarketVolume: {fmt: 19.46M, longFmt: 19456621.00, raw: 19456621}, shortName: Facebook,Inc., strikePrice: {}, symbol: FB ToCurrency: None, underlyingSymbol: None, volume24Hr: {}, volumeAllCurrencies: {}]}} the current price: 241.63

The solution to laziness 2

Import requests r = requests.get ("https://query2.finance.yahoo.com/v10/finance/quoteSummary/AGL.AX?modules=defaultKeyStatistics") data = r.json () print (data) print ({AGL.AX: {Enterprise Value: data [quoteSummary] [result] [0] [defaultKeyStatistics] [enterpriseValue] [fmt]) Shares Short: data [quoteSummary] [result] [0] [defaultKeyStatistics] [sharesShort] .get (longFmt, NCMA)}})

Take a look at the URL on the third line again, and the output is as follows:

{quoteSummary: {result: [{defaultKeyStatistics: {maxAge: 1, priceHint: {raw: 2, fmt: 2, longFmt: 2} EnterpriseValue: {raw: 13677747200, fmt: 13.68B, longFmt: 13677747200}, forwardPE: {}, profitMargins: {raw: 0.07095 Fmt: 7.10%}, floatShares: {raw: 637754149, fmt: 637.75M, longFmt: 637754149}, sharesOutstanding: {raw: 639003008 Fmt: 639M, longFmt: 639003008}, sharesShort: {}, sharesShortPriorMonth: {}, sharesShortPreviousMonthDate: {}, dateShortInterest: {}, sharesPercentSharesOut: {} HeldPercentInsiders: {raw: 0.0025499999, fmt: 0.25%}, heldPercentInstitutions: {raw: 0.31033, fmt: 31.03} ShortRatio: {}, shortPercentOfFloat: {}, beta: {raw: 0.365116, fmt: 0.37}, morningStarOverallRating: {}, morningStarRiskRating: {}, category: None BookValue: {raw: 12.551, fmt: 12.55}, priceToBook: {raw: 1.3457094, fmt: 1.35}, annualReportExpenseRatio: {} YtdReturn: {}, beta3Year: {}, totalAssets: {}, yield: {}, fundFamily: None, fundInceptionDate: {}, legalType: None, threeYearAverageReturn: {}, fiveYearAverageReturn: {} PriceToSalesTrailing12Months: {}, lastFiscalYearEnd: {raw: 1561852800, fmt: 2019-06-30}, nextFiscalYearEnd: {raw: 1625011200, fmt: 2021-06-30} MostRecentQuarter: {raw: 1577750400, fmt: 2019-12-31}, earningsQuarterlyGrowth: {raw: 0.114, fmt: 11.40%}, revenueQuarterlyGrowth: {} NetIncomeToCommon: {raw: 938000000, fmt: 938M, longFmt: 938000000}, trailingEps: {raw: 1.434, fmt: 1.43} ForwardEps: {}, pegRatio: {}, lastSplitFactor: None, lastSplitDate: {}, enterpriseToRevenue: {raw: 1.035, fmt: 1.03} EnterpriseToEbitda: {raw: 6.701, fmt: 6.70}, 52WeekChange: {raw:-0.17621362, fmt:-17.62%} SandP52WeekChange: {raw: 0.045882702, fmt: 4.59%}, lastDividendValue: {}, lastCapGain: {}, annualHoldingsTurnover: {}}] Error: None}} {AGL.AX: {Enterprise Value: 13.73B, Shares Short: N/A}}

The "lazy" solution simply changes the request from using a front-end URL to some kind of unofficial API endpoint that returns JSON data. This solution is simpler and can export more data, so what about its speed? The code is as follows:

Import timeit import requests from bs4 importBeautifulSoup import json import re repeat = 5 number = 5 defweb_scrape_1 (): r = requests.get (f https://finance.yahoo.com/quote/FB?p=FB) soup = BeautifulSoup (r.text, "lxml") price = soup.find (div) {class: My (6px) Pos (r) smartphone_Mt (6px)}) .find (span) .text returnf the currentprice: {price} deflazy_1 (): r = requests.get (https://query2.finance.yahoo.com/v10/finance/quoteSummary/FB?modules=price) data = r.json () returnf "the currentprice: {data [quoteSummary] [result] [0 ] [price] [regularMarketPrice] [raw]} "defweb_scrape_2 (): P = re.compile (r root.App.main = (. *)) ) ticker = AGL.AX results = {} with requests.Session () as s: r = s.get (https://finance.yahoo.com/quote/{}/key-statistics?p={} .format (ticker) Ticker)) data = json.loads (p.findall (r.text) [0]) key_stats = data [context] [dispatcher] [stores] [QuoteSummaryStore] res = {Enterprise Value: key_stats [defaultKeyStatistics] [enterpriseValue] [fmt], Shares Short: key_stats [defaultKeyStatistics] [sharesShort] .get (longFmt Results [ticker] = res return results deflazy_2 (): r = requests.get (https://query2.finance.yahoo.com/v10/finance/quoteSummary/AGL.AX?modules=defaultKeyStatistics) data = r.json () return {AGL.AX: { Enterprise Value: data [quoteSummary] [result] [0] [defaultKeyStatistics] [enterpriseValue] [fmt] Shares Short: data [quoteSummary] [result] [0] [defaultKeyStatistics] [sharesShort] .get (longFmt, NCMA)}} web_scraping_1_times = timeit.repeat (web_scrape_1 (), setup= import requests) From bs4 import BeautifulSoup, globals=globals (), repeat=repeat, number=number) print (f web scraping # 1min time is {min (web_scraping_1_times) / number}) lazy_1_times = timeit.repeat (lazy_1 (), setup= import requests, globals=globals (), repeat=repeat Number=number) print (f lazy # 1 min timeis {min (lazy_1_times) / number}) web_scraping_2_times = timeit.repeat (web_scrape_2 (), setup= import requests, re, json, globals=globals (), repeat=repeat Number=number) print (f web scraping # 2min time is {min (web_scraping_2_times) / number}) lazy_2_times = timeit.repeat (lazy_2 (), setup= import requests, globals=globals (), repeat=repeat Number=number) print (f lazy # 2 min timeis {min (lazy_2_times) / number}) web scraping # 1 min timeis 0.5678426799999997 lazy # 1 min timeis 0.11238783999999953 web scraping # 2 min timeis 0.3731000199999997 lazy # 2 min timeis 0.0864451399999993

The alternative to "laziness" is 4 to 5 times faster than crawling similar products on its web pages!

The process of "laziness"

Consider the two problems encountered above: in the original solution, after the code was loaded into the page, we tried to retrieve the data. The "lazy" solution is directed at the data source, ignoring the front-end page at all. This is an important difference and a good method when you try to extract data from a website.

Step 1: check the XHR request

The XHR (XMLHttpRequest) object is an API that is available for Web browser scripting languages, such as JavaScript, which sends HTTP or HTTPS requests to the Web server and loads the server response data back into the script. Basically, XHR allows clients to retrieve data from URL without having to refresh the entire web page.

The author will use Chrome for the following demonstration, but other browsers have similar functionality.

Open the developer console of Chrome. To open the developer console in Google Chrome, open the Chrome menu in the upper right corner of the browser window and select more tools > developer tools. You can also use the shortcut Option +? + J (for ios systems) or Shift + CTRL + J (for Windows / Linux).

Select the Network tab.

Then filter the results through "XHR"

It should be noted that although some requests contain "AAPL", the results will be similar but different. To start investigating these, click one of the links that contain the character "AAPL" in the leftmost column.

When you select one of the links, you will see an additional window that provides details of the selected request. The first tab, Headers, provides detailed information about browser requests and server responses. You should immediately notice that the "URL request" in the "Headers" tab is very similar to the URL request provided in the lazy solution above.

If you select the Preview tab, you will see the data returned from the server.

That's great! Looks like we found the URL that fetched the Apple OHLC data!

Step 2: search

Now we have found some XHR requests made through browsers. Search the javascript file to see if you can find more information. The author finds that the common features of URL related to XHR requests are "query1" and "query2". In the upper-right corner of the developer console, select three vertical points, and then select "search" in the drop-down box.

Search for "query2" in the search bar:

Select the first option. An additional tab pops up with the location where "query2" is found. You should notice something similar here:

Web crawl solution 2 extracts the same data variable as this one. The console should provide options for the "quality print" variable. You can select this option, or you can copy and paste the entire line (line 11 above) into https://beautifier.io/. Or if you use vscode and download the beautification extension, it will do the same thing.

Once formatted correctly, paste the entire code into a text editor or similar editor, and then search for "query2" again. The search results should be in "Service Plugin". This section contains the URL that Yahoo Finance uses to populate its pages with data. The following is the content of this section:

"tachyon.quoteSummary": {"path": "/ v10/finance/quoteSummary/ {symbol}", "timeout": 6000, "query": ["lang", "region", "corsDomain", "crumb", "modules", "formatted"], "responseField": "quoteSummary", "get": {"formatted": true}}, thank you for reading The above is the content of "what is the method of fast and effective retrieval of web data in web development". After the study of this article, I believe you have a deeper understanding of what is the method of fast and effective retrieval of web data in web development, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report