In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "how to collect HTML data by Java". In daily operation, I believe many people have doubts about how to collect HTML data by Java. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to collect HTML data by Java". Next, please follow the editor to study!
On group regularity
When it comes to how regular expressions help java collect html pages, we need to briefly mention the group method in regular expressions:
Public static void main (String [] args) {/ / Pattern is used to compile regularities here three regularities are used to wrap / / match URL in parentheses (), of course, the regularity here is not necessarily accurate, and the regularity that matches URL is wrong, but it just happens to match / / the second regular is used to match the title SoFlash. / the third regular is used to match the date / * here only one statement is used to url The title and date are all matched * / Pattern p = Pattern.compile ("='(\\ w. +)'> (\\ w. + [a-zA-Z])-(\\ d {1 ~ 2}\\.\ d {1 ~ ~ 2}\.\\ d {4})) String s = "SoFlash-12.22.2011"; Matcher m = p.matcher (s); while (m.find ()) {/ / print out url, title and date by calling the index in the group () method System.out.println ("print out url link:" + m.group (1)) System.out.println ("print out title:" + m.group (2)); System.out.println ("print date:" + m.group (3)); System.out.println ();} System.out.println ("number of data captured by group method:" + m.groupCount () + ");}
Let's take a look at the output:
Print out the url link: www.cnblogs.com/longwu
Print out the title: SoFlash
Printed date: 12.22.2011
Number of data captured by group method: 3
For more information, please see JAVA regular expressions (hyperdetailed).
If you haven't learned regular before, you can look at the metacharacter matching of this regular expression.
All right, after introducing the group method, let's simply use group to collect data from a football website page.
Page link: http://www.footballresults.org/league.php?all=1&league=EngPrem
First, we read the entire html page and print the code as follows:
Public static void main (String [] args) {String strUrl = "http://www.footballresults.org/league.php?all=1&league=EngPrem";" Try {/ / create a url object to point to the website link the path of the site link is loaded in parentheses / / for more information, please see http://wenku.baidu.com/view/8186caf4f61fb7360b4c6547.html URL url = new URL (strUrl) / / InputStreamReader is an input stream reader used to convert read bytes into characters / / for more information, see http://blog.sina.com.cn/s/blog_44a05959010004il.html InputStreamReader isr = new InputStreamReader (url.openStream (), "utf-8") / / uniformly use utf-8 encoding mode / / use BufferedReader to read the characters BufferedReader br = new BufferedReader (isr) converted by InputStreamReader / / if what BufferedReader reads is not empty while (br.readLine ()! = null) {/ /, then the result printed here should be the System.out.println (br.readLine ()) of the entire website;} br.close () / / close the reader} catch (IOException e) {/ / throw an exception e.printStackTrace ();}} if an error occurs
The printed result is the source code of the entire html page (some screenshots are as follows)
At this point, the data has been successfully collected, of course, we do not want the entire html source code, what we need is the competition data on the web page.
First of all, let's analyze the html source code structure and come to the http://www.footballresults.org/league.php?all=1&league=EngPrem page.
Right-click the page and click "View Source File" as shown in the figure:
Let's take a look at the internal html code structure and the data we need.
The corresponding page data
At this point, powerful regular expressions come in handy, and we need to write a few regularities to match the data we need.
Here we need to use 3 rules including dates, 2 teams (home team and visiting team), and the results of the game are as follows
String regularDate = "(\\ d {1Magne2}\\.\\ d {1Magne2}\\.\ d {4})"; / / date regular String regularTwoTeam = "> [^] [^] *"; / / team regular String regularResult = "> (\\ d {1jin2} -\\ d {1pm 2})"; / / regular results of the game
Once the rules are written, we can use the rules to get the data we want.
First, let's write a GroupMethod class to hold the regularGroup () method
Import java.util.regex.Matcher; import java.util.regex.Pattern; public class GroupMethod {/ / pass in two string parameters, one is pattern (the regular we use) and the other is the html source code public String regularGroup (String pattern, String matcher) {Pattern p = Pattern.compile (pattern, Pattern.CASE_INSENSITIVE); Matcher m = p.matcher (matcher) If (m.find ()) {/ / if you read return m.group (); / / return captured data} else {return "; / / otherwise return a null value}
Then write the code of the main function.
Compare the data on html (partial screenshot-initial phase)
Output result (partial screenshot-end phase)
Compare the data on html (partial screenshot-end phase)
All right, the html data acquisition is completed:)
Of course, here just grab the content of a page if you are interested and want to grab more page content, you can analyze the alliance name after the link. For example, league=EngPrem gets all the league game data by changing the league name, and you can write an interface to put the names of all teams in it, and of course there's a smarter way you can write a method. Get the names of all teams from the http://www.footballresults.org/allleagues.php page and append them to the "http://www.footballresults.org/league.php?all=1&league=" link" to complete the link loop to read the contents of each league match page.
At this point, the study on "how to collect HTML data by Java" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.