In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces the Web crawl framework JSoup how to use, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian take you to understand.
For reference and study only.
Web crawling frame
Like many modern technologies, there are several frameworks to choose from for extracting information from websites. The most popular ones are JSoup, HTMLUnit and Selenium WebDriver. This article discusses JSoup.
JSoup
JSoup is an open source project that provides powerful data extraction API. You can use it to parse the HTML in a given URL, file, or string. It can also manipulate HTML elements and attributes.
Parsing strings using JSoup
Parsing strings is the easiest way to use JSoup.
Public class JSoupExample {
Public static void main (String [] args) {
String html = "Website title
Sample paragraph number 1
Sample paragraph number 2
"
Document doc = Jsoup.parse (html)
System.out.println (doc.title ())
Elements paragraphs = doc.getElementsByTag ("p")
For (Element paragraph: paragraphs) {
System.out.println (paragraph.text ())
}
}
This code is very intuitive. Call the parse () method to parse the input HTML and turn it into a Document object. You can manipulate and extract data by calling the method of the object.
In the above example, we first output the title of the page. Then we get all the elements with the label "p". Then we output the text of each paragraph in turn.
By running this code, we can get the following output:
Website title
Sample paragraph number 1
Sample paragraph number 2
Parsing URL using JSoup
Parsing URL is a little different from parsing strings, but the basic principles are the same:
Public class JSoupExample {
Public static void main (String [] args) throws IOException {
Document doc = Jsoup.connect ("https://www.wikipedia.org").get();
Elements titles = doc.getElementsByClass ("other-project")
For (Element title: titles) {
System.out.println (title.text ())
}
}
}
To grab data from URL, you need to call the connect () method, providing URL as an argument. Then use get () to get the HTML from the connection. The output of this example is:
Commons Freely usable photos & more
Wikivoyage Free travel guide
Wiktionary Free dictionary
Wikibooks Free textbooks
Wikinews Free news source
Wikidata Free knowledge base
Wikiversity Free course materials
Wikiquote Free quote compendium
MediaWiki Free & open wiki application
Wikisource Free library
Wikispecies Free species directory
Meta-Wiki Community coordination & documentation
As you can see, this program grabs all the elements whose class is other-project.
This method is the most commonly used, so let's look at some other examples of fetching through URL.
Grab all links to the specified URL
Public void allLinksInUrl () throws IOException {
Document doc = Jsoup.connect ("https://www.wikipedia.org").get();
Elements links = doc.select ("a [href]")
For (Element link: links) {
System.out.println ("\ nlink:" + link.attr ("href"))
System.out.println ("text:" + link.text ())
}
}
The result is a long list:
Link: / / en.wikipedia.org/
Text: English 5 678 000 + articles
Link: / / ja.wikipedia.org/
Text: Japan 1112000 + Chronicle
Link: / / es.wikipedia.org/
Text: Espa ñ ol 1 430000 + art í culos
Link: / / de.wikipedia.org/
Text: Deutsch 2 197000 + Artikel
Link: / / ru.wikipedia.org/
Text: 1 482 000 + 1 482 000 + 1 482 000 + 1 482 000
Link: / / it.wikipedia.org/
Text: Italiano 1 447 000 + voci
Link: / / fr.wikipedia.org/
Text: Fran ç ais 2000 000 + articles
Link: / / zh.wikipedia.org/
Text: Chinese 1 013 000 +
Text: Wiktionary Free dictionary
Link: / / www.wikibooks.org/
Text: Wikibooks Free textbooks
Link: / / www.wikinews.org/
Text: Wikinews Free news source
Link: / / www.wikidata.org/
Text: Wikidata Free knowledge base
Link: / / www.wikiversity.org/
Text: Wikiversity Free course materials
Link: / / www.wikiquote.org/
Text: Wikiquote Free quote compendium
Link: / / www.mediawiki.org/
Text: MediaWiki Free & open wiki application
Link: / / www.wikisource.org/
Text: Wikisource Free library
Link: / / species.wikimedia.org/
Text: Wikispecies Free species directory
Link: / / meta.wikimedia.org/
Text: Meta-Wiki Community coordination & documentation
Link: https://creativecommons.org/licenses/by-sa/3.0/
Text: Creative Commons Attribution-ShareAlike License
Link: / / meta.wikimedia.org/wiki/Terms_of_Use
Text: Terms of Use
Link: / / meta.wikimedia.org/wiki/Privacy_policy
Text: Privacy Policy
Similarly, you can get the number of images, meta-information, form parameters, and everything you can think of, so it is often used to get statistics.
Parsing files using JSoup
Public void parseFile () throws URISyntaxException, IOException {
URL path = ClassLoader.getSystemResource ("page.html")
File inputFile = new File (path.toURI ())
Document document = Jsoup.parse (inputFile, "UTF-8")
System.out.println (document.title ())
/ / parse document in any way
}
If you want to parse the file, you don't need to send a request to the website, so you don't have to worry about running the program and putting too much burden on the server. Although this approach has many limitations and the data is static and therefore not suitable for many tasks, it provides a more legal and harmless way to analyze data.
The resulting document can be parsed in any of the ways mentioned earlier.
Set property value
In addition to reading strings, URL, and files and getting data, we can also modify data and input forms.
For example, when visiting Amazon, click the website logo in the upper left corner to return to the home page of the site.
If you want to change this behavior, do this:
Public void setAttributes () throws IOException {
Document doc = Jsoup.connect ("https://www.amazon.com").get();
Element element = doc.getElementById ("nav-logo")
System.out.println ("Element:" + element.outerHtml ())
Element.children (. Attr ("href", "notamazon.org")
System.out.println ("Element with set attribute:" + element.outerHtml ())
}
After we get the id of the website logo, we can view its HTML. You can also access its child elements and change its attributes.
Element:
Amazon
Try Prime
Element with set attribute:
Amazon
Try Prime
By default, both child elements point to their respective links. After changing the attribute to a different value, you can see that the href attribute of the child element has been updated.
Add or remove classes
In addition to setting property values, we can also modify the previous example to add or remove classes to the element:
Public void changePage () throws IOException {
Document doc = Jsoup.connect ("https://www.amazon.com").get();
Element element = doc.getElementById ("nav-logo")
System.out.println ("Original Element:" + element.outerHtml ())
Element.children (. Attr ("href", "notamazon.org")
System.out.println ("Element with set attribute:" + element.outerHtml ())
Element.addClass ("someClass")
System.out.println ("Element with added class:" + element.outerHtml ())
Element.removeClass ("someClass")
System.out.println ("Element with removed class:" + element.outerHtml ())
}
Running the code gives us the following information:
Original Element:
Amazon
Try Prime
Element with set attribute:
Amazon
Try Prime
Element with added class:
Amazon
Try Prime
Element with removed class:
Amazon
Try Prime
You can save the new code locally as .html, or send it to Greater Europe through a HTTP request, but be aware that the latter may be illegal.
Thank you for reading this article carefully. I hope the article "how to use the Web crawl framework JSoup" shared by the editor will be helpful to everyone. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.