Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use the pyppeteer library of python crawler

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of "how to use the pyppeteer library of python crawler". The editor shows you the operation process through an actual case. The method of operation is simple, fast and practical. I hope this article "how to use the pyppeteer library of python crawler" can help you solve the problem.

Pyppeteer

Before introducing Pyppeteer, I would like to say that Puppeteer,Puppeteer is a tool developed by Google based on Node.js, which is mainly used to manipulate API of Chrome browser, manipulate Chrome browser through Javascript code, and complete tasks such as data crawling and automatic testing of Web programs.

Pyppeteer is an unofficial Python version of the Puppeteer library, browser automation library, developed by Japanese engineers.

Puppeteer is a tool developed by Google based on Node.js, which calls the API of Chrome and manipulates Chrome to complete some operations through JavaScript code, which is used for web crawlers, automatic testing of Web programs, etc.

Pyppeteer uses Python asynchronous protocol library asyncio, which integrates Scrapy for distributed crawlers.

Puppet puppet, puppeteer puppet operator.

The differences between pyppeteer and puppeteer

Pyppeteer supports dictionary and keyword parameters, while puppeteer only supports dictionary parameters.

# puppeteer support dictionary parameter browser = await launch ({"headless": True}) # pyppeteer support dictionary and keyword parameter browser = await launch ({"headless": True}) browser = await launch (headless=True)

Element selector method name $becomes querySelector

# puppeteer uses the $symbol page. $() / page.%% () / page.$x () # pyppeteer uses the python-style function name page.querySelector () / page.querySelectorAll () / page.xpath () # abbreviated page.J () / page.JJ () / page.Jx ()

Parameters for page.evluate () and page.querySelectorEval ()

The evaluate () method of puppeteer uses either the JavaScript native function or the JavaScript expression string. Pyppeteer's evaluate () method uses only the JavaScript string, which can be a function or an expression, and pyppeteer automatically determines. But sometimes there is an error, and if the string is judged as a function and an error is reported, you can add the parameter force_expr=True to force pyppeteer to be treated as an expression.

Get the content of the page:

Content = await page.evaluate ("document.body.textContent", force_expr=True)

Get the internal text of the element:

Element = await page.querySelector ("H1") title = await page.evaluate ("(element) = > element.textContent", element)

Installation

1. Install pyppeteer

Pip install pyppeteer

2. Install chromium

Pyppeteer-install

Easy to use

Import asynciofrom pyppeteer import launchasync def main (): url = "https://www.toutiao.com/" # headless parameter is set to Falase, then the header mode browser = await launch (headless=False, ignoreDefaultArgs= ["-- enable-automation "]) page = await browser.newPage () # sets the page view size await page.setViewport (viewport= {" width ": 1600," herght ": 900}) # whether JS,enabled is enabled and set to False Then no rendering effect await page.setJavaScriptEnable (enabled=True) # wait time 1000 milliseconds res = await page.goto (url,options= {"timeout": 1000}) resp_headers = res.headers # response header resp_status = res.status # response status # wait for await asyncio.sleep (2) await page.waitFor (1000) # second method Forcibly query an element in the while loop and wait for while not await page.querySelector (".t") # to scroll to the bottom of the page await page.evaluate ("window.scrollBy (0)" Document.body.scrollHeight) ") await page.screenshot ({" path ":" test.png "}) # print web page cookies print (await page.cookies ()) # get all html content print (await page.content ()) dimensions = await page.evaluate (pageFunction="() = > {return {width:document.documentElement.clentWidth / / Page width height:document.documentElement.clentHeight, / / Page height deviceScaleFactor: window.devicePixelRatio, / / Pixel ratio 1.0000000149011612} "" Force_expr=False) # force_expr=False executes the function print (dimensions) content = await page.evaluate (pageFunction= "document.body.textContent", force_expr=True) # only gets the text to execute the js script Force_expr=True executes the expression print (content) # prints the title of the current page print (await page.title ()) # grabs the news content using the xpath expression "" pyppeteer "three parsing methods page.querySelector () page.querySelectorAll () page.xpath () abbreviated as: page.J () page. JJ () page.Jx () "" element = await page.querySelector (".feed-infinite-wrapper > ul > li") print (element) element = await page.querySelectorAll (".title-box a") for item in element: print (await item.getProperty ("textContent")) # get the text content title_str = await (await item.getProperty ("textContent"). JsonValue ( ) title_link = await (await item.getProperty ("textContent")) .jsonValue () # get the attribute value # title = await (await item.getProperty ("class")) .jsonValue () print (title_str Title_link) await browser.close () asyncio.get_event_loop () .run_until_complete (main ())

Simulate text input and click

# simulated input account password parameter {"delay": reand_int ()} delay input time await page.type ("# kw", "Baidu", delay=100) await page.type ("# TPL_username_1", "asdasd") await page.waitFor (1000) await page.click ("# su")

Removal of Chrome is being controlled by automated testing software

Browser = await launch (headless=False, ignoreDefaultArgs= ["--enable-automation"]) # add ignoreDefaultArgs= ["--enable-automation"] parameter

Climb JD.com Mall

From bs4 import BeautifulSoupfrom pyppeteer import launchimport asynciodef screen_size (): return 1600900async def main (url): browser = await launch ({"args": ["- no-sandbox"],}) # "headless": False page = await browser.newPage () width, height = screen_size () await page.setViewport (viewport= {"width": width, "height": height}) await page.setJavaScriptEnabled (enabled=True) await page.setUserAgent ("Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36 ") await page.goto (url) await page.evaluate (" window.scrollBy (0) " Document.body.scrollHeight) ") await asyncio.sleep (1) # content = await page.content () li_list = await page.xpath (" / / * [@ id= "J_goodsList"] / ul/li ") item_list = [] for li in li_list: a = await li.xpath (". / / div [@ class= "p-img"] / a ") detail_url = await (await a [0]. GetProperty ("href"). JsonValue () promo_words = await (await a [0] .getProperty ("title")) .jsonValue () a= await li.xpath (". / / div [@ class=" p-commit "] / strong/a") p_commit = await (await a0] .getProperty ("textContent"). JsonValue () I = await li.xpath (". / div/div [3] / strong"). / I ") price = await (await I [0] .getProperty (" textContent ")) .jsonValue () em = await li.xpath (". / div/div [4] / a/em ") title = await (await em [0] .getProperty (" textContent "). JsonValue () item = {" title ": title "detail_url": detail_url, "promp_words": promo_words, "p_commit": p_commit "price": price} item_list.append (item) await page_close (browser) return item_listasync def page_close (browser): for _ page in await browser.pages (): await _ page.close () await browser.close () url = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&wq=" "% E6% 89% 8B% E6% 9C% BAroompvidroome07184578b8442c58ddd65b221020e99roompage= {} & s=56&click=0" task_list = [] for i in range (1Magazine 4): page= I * 2-1 task_list.append (main (url.format (page) results = asyncio.get_event_loop (). Run_until_complete (asyncio.gather (* task_list)) for i in results: print (I) Len (I)) print ("*")

This is the end of the content about "how to use the pyppeteer library of python crawler". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report