Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use regular expressions to get the content of web pages in Java

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to use regular expressions to obtain web content in Java. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Regular expressions that crawl web pages and parse parts of HTML content

Package com.xiaofeng.picup;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStreamReader;import java.net.MalformedURLException;import java.net.URL;import java.util.ArrayList;import java.util.HashMap;import java.util.List;import java.util.regex.Matcher;import java.util.regex.Pattern / * / / * @ crawl page article title and content (test) manually input URL crawl, you can further automatically crawl all contents of the entire page * * / public class WebContent... {/ * read all contents of a web page * / public String getOneHtml (String htmlurl) throws IOException... {URL url; String temp; StringBuffer sb = new StringBuffer () Try... {url = new URL (htmlurl); BufferedReader in = new BufferedReader (new InputStreamReader (url .openStream (), "utf-8")); / / read all the contents of the web page while ((temp = in.readLine ())! = null). {sb.append (temp);} in.close () } catch (MalformedURLException me)... {System.out.println ("there is something wrong with the URL format you entered! Please type "); me.getMessage (); throw me;} catch (IOException e) {e.printStackTrace (); throw e;} return sb.toString ();} / * * / * @ param s * @ return to get the page title * / public String getTitle (String s). {String regex; String title =" List list = new ArrayList (); regex = ". *?"; Pattern pa = Pattern.compile (regex, Pattern.CANON_EQ); Matcher ma = pa.matcher (s); while (ma.find ()). {list.add (ma.group ());} for (int I = 0; I)

< list.size(); i++) ...{ title = title + list.get(i); } return outTag(title); } /** *//** * * @param s * @return 获得链接 */ public List getLink(String s) ...{ String regex; List list = new ArrayList(); regex = "]*href=("([^"]*)"|'([^']*)'|([^s>

] *) [^ >] * > (. *?) "; Pattern pa = Pattern.compile (regex, Pattern.DOTALL); Matcher ma = pa.matcher (s); while (ma.find ()). {list.add (ma.group ());} return list } / * / * @ param s * @ return get the script code * / public List getScript (String s). {String regex; List list = new ArrayList (); regex = "

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report