Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use regular expressions to implement web crawlers

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

In this issue, the editor will bring you about how to use regular expressions to achieve web crawlers. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

Train of thought:

1. To simulate a web crawler, we can now deploy a 1.html page on our tomcat server. Deployment step: create a new 1.html under the ROOTS directory of the webapps directory in the tomcat directory. Use notepad++ to edit the content as follows:

)

two。 Use URL to connect with web pages

3. Gets the input stream, which is used to read the content in a web page

4. Establish a regular rule, because here we are crawling to the mailbox information in the web page, so we create a regular expression that matches the mailbox: String regex= "\ w + (\.\ w +) +"

5. Put the extracted data into the collection.

Code:

Import java.io.BufferedReader;import java.io.InputStream;import java.io.InputStreamReader;import java.net.URL;import java.util.ArrayList;import java.util.List;import java.util.regex.Matcher;import java.util.regex.Pattern;/* * Web crawler: a program used to obtain data of specified rules on the Internet * * / public class RegexDemo {public static void main (String [] args) throws Exception {List list=getMailByWeb () For (String str:list) {System.out.println (str);}} private static List getMailByWeb () throws Exception {/ / 1. Establish contact with the web page. Using URL String path= "http://localhost:8080//1.html";// followed by a double slash is used to escape URL url=new URL (path); / / 2. Get input stream InputStream is=url.openStream (); / / buffered BufferedReader br=new BufferedReader (new InputStreamReader (is)); / / 3. Extract the mailbox-compliant data String regex= "\.\\ w+ (\.\\ w+) +; / match / / encapsulate the regular rules into an object Pattern p=Pattern.compile (regex); / / put the extracted data into a collection List list=new ArrayList (); String line=null; while ((line=br.readLine ())! = null) {/ / matcher Matcher m=p.matcher (line); while (m.find ()) {/ / 3. Store the data that meets the rules in the collection list.add (m.group ());}} return list;}}

Note: you need to turn on the tomcat server before execution

Running result:

The above is the editor for you to share how to use regular expressions to achieve web crawlers, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report