Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Selenium+Tesseract-OCR intelligent identification verification code to crawl web page data

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article will explain in detail how to use Selenium+Tesseract-OCR intelligent identification verification code to crawl web page data. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

1. Project requirements description

The detailed data of the order in a certain system is obtained through the order number, which does not need the login verification of the account password, but has the dynamic identification of the picture verification code, and the obtained data is saved to the database.

two。 Overall thinking

1. Open the browser in windowless mode through Selenium technology

two。 Dynamically enter the order number in the input box

3. Save the screenshot of the picture CAPTCHA locally

4. Use Tesseract-OCR technology to locally identify the CAPTCHA code and convert it into text

5. Enter the obtained CAPTCHA into the input box

6. Click query to get list data

3. Function realization

1. Download and install Google browser, install Google driver chromedriver.exe, get the installation path, and configure it in the project

two。 Use Selenium for browser operation

System.setProperty (browser driver, browser driver installation location); ChromeOptions options = new ChromeOptions (); options.addArguments ("--headless"); / / windowless mode options.addArguments ("--disable-infobars"); / / taboo message bar options.addArguments ("--disable-extensions") / / disable plug-in options.addArguments ("--disable-gpu"); / disable GPUoptions.addArguments ("--no-sandbox"); / / disable sandboxie mode options.addArguments ("--disable-dev-shm-usage"); options.addArguments ("--hide-scrollbars") / / hide scroll bar WebDriver driver = new ChromeDriver (options); driver.get (crawl website URL); driver.manage (). Window (). SetSize (new Dimension (450,260)); / / resize try {/ / Save IMG images to local saveImgToLocal (driver) after opening the browser; Thread.sleep (2000); / / OCR Intelligent Identification CAPTCHA String codeByOCR = getCodeByOCR () If (codeByOCR! = null) {try {WebElement input1 = driver.findElement (By.id (TEXTBOX1)); input1.sendKeys (code); WebElement input2 = driver.findElement (By.id (TEXTBOX2)); input2.sendKeys (codeByOCR); / / get table data WebElement addButton = driver.findElement (By.id (SELECT_BUTTON)) AddButton.click (); List tRCollection = driver.findElement (By.id (TABLE_ID)) .findElements (By.tagName ("tr")); for (int t = 1; t < tRCollection.size (); tweak +) {List tDCollection = tRCollection.get (t) .findElements (By.tagName ("td")); VipLogisticsMinHangDetailVo minHangDetailVo = new VipLogisticsMinHangDetailVo () MinHangDetailVo.setLogistics_number (code); for (int I = 0; I < tDCollection.size ()) {String text = tDCollection.get (I) .getText (); switch (I) {case 0: minHangDetailVo.setTime (text) Case 1: minHangDetailVo.setOutlet (text); case 2: minHangDetailVo.setOrganization (text); case 3: minHangDetailVo.setEvent (text) Case 4: minHangDetailVo.setDetail (text);}} list.add (minHangDetailVo);} log.info ("CAPTCHA recognition successful!") ;} catch (Exception e) {if (e.toString () .contains ("error prompt: CAPTCHA error or expired!") {log.error ("CAPTCHA recognition error!" + e.toString ();} else if (e.toString () .contains ("error prompt: please enter CAPTCHA!")) {log.error ("No CAPTCHA!:" + e.toString ());} else {log.error ("other exceptions:" + e.toString ());} driver.quit ();} catch (Exception e) {e.printStackTrace ();}

3. Save the screenshot of the picture verification code locally (screenshot method)

Private void saveImgToLocal (WebDriver driver) {WebElement element = driver.findElement (By.id (img element ID)); / / create full screen screenshot WrapsDriver wrapsDriver = (WrapsDriver) element; File screen = ((TakesScreenshot) wrapsDriver.getWrappedDriver ()) .getScreenshotAs (OutputType.FILE); try {BufferedImage image = ImageIO.read (screen); / / create a rectangle using the upper height and width Point p = element.getLocation () / element coordinates BufferedImage img = image.getSubimage (p.getX (), p.getY (), element.getSize (). GetWidth (), element.getSize (). GetHeight ()); ImageIO.write (img, "png", screen); FileUtils.copyFile (screen, new File (save local address + "imgname.png"));} catch (IOException e) {e.printStackTrace ();}}

4. Save the picture CAPTCHA locally (mouse method)

Private static void saveImgToLocal1 (WebDriver driver) {Actions action = new Actions (driver); action.contextClick (driver.findElement (By.id (img element ID)). Build (). Perform (); try {Robot robot = new Robot (); Thread.sleep (1000); robot.keyPress (KeyEvent.VK_DOWN); Thread.sleep (1000); robot.keyPress (KeyEvent.VK_DOWN); Thread.sleep (1000) Robot.keyPress (KeyEvent.VK_ENTER); Thread.sleep (1000); / / release the down arrow, otherwise the previous entry will work robot.keyRelease (KeyEvent.VK_DOWN); Thread.sleep (1000); / / run save Runtime.getRuntime () .exec (SAVE_IMG_EXE); Thread.sleep (10000) } catch (Exception e) {e.printStackTrace ();}}

5. OCR recognition of local CAPTCHA codes

Private String getCodeByOCR () {String result = null; File file = new File (local image address); if (! file.exists ()) {if (systemFalg! = 1) {file.setWritable (true, false);} file.mkdirs ();} File imageFile = new File (local image address + "imgname.png") If (imageFile.exists ()) {ITesseract instance = new Tesseract (); instance.setDatapath (tessdata storage address); try {String doOCR = instance.doOCR (imageFile); result = replaceBlank (doOCR); log.info ("parsed CAPTCHA is: {}", result! = null? Result: "empty!") ;} catch (Exception e) {e.printStackTrace (); log.error ("parsing CAPTCHA exception!") ;}} else {log.error ("the file parsing the CAPTCHA does not exist!") ;} return result;} this is the end of the article on "how to crawl web page data using Selenium+Tesseract-OCR intelligent identification code". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report