Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Web crawl frame JSoup

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces the Web crawl framework JSoup how to use, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian take you to understand.

For reference and study only.

Web crawling frame

Like many modern technologies, there are several frameworks to choose from for extracting information from websites. The most popular ones are JSoup, HTMLUnit and Selenium WebDriver. This article discusses JSoup.

JSoup

JSoup is an open source project that provides powerful data extraction API. You can use it to parse the HTML in a given URL, file, or string. It can also manipulate HTML elements and attributes.

Parsing strings using JSoup

Parsing strings is the easiest way to use JSoup.

Public class JSoupExample {

Public static void main (String [] args) {

String html = "Website title

Sample paragraph number 1

Sample paragraph number 2

"

Document doc = Jsoup.parse (html)

System.out.println (doc.title ())

Elements paragraphs = doc.getElementsByTag ("p")

For (Element paragraph: paragraphs) {

System.out.println (paragraph.text ())

}

}

This code is very intuitive. Call the parse () method to parse the input HTML and turn it into a Document object. You can manipulate and extract data by calling the method of the object.

In the above example, we first output the title of the page. Then we get all the elements with the label "p". Then we output the text of each paragraph in turn.

By running this code, we can get the following output:

Website title

Sample paragraph number 1

Sample paragraph number 2

Parsing URL using JSoup

Parsing URL is a little different from parsing strings, but the basic principles are the same:

Public class JSoupExample {

Public static void main (String [] args) throws IOException {

Document doc = Jsoup.connect ("https://www.wikipedia.org").get();

Elements titles = doc.getElementsByClass ("other-project")

For (Element title: titles) {

System.out.println (title.text ())

}

}

}

To grab data from URL, you need to call the connect () method, providing URL as an argument. Then use get () to get the HTML from the connection. The output of this example is:

Commons Freely usable photos & more

Wikivoyage Free travel guide

Wiktionary Free dictionary

Wikibooks Free textbooks

Wikinews Free news source

Wikidata Free knowledge base

Wikiversity Free course materials

Wikiquote Free quote compendium

MediaWiki Free & open wiki application

Wikisource Free library

Wikispecies Free species directory

Meta-Wiki Community coordination & documentation

As you can see, this program grabs all the elements whose class is other-project.

This method is the most commonly used, so let's look at some other examples of fetching through URL.

Grab all links to the specified URL

Public void allLinksInUrl () throws IOException {

Document doc = Jsoup.connect ("https://www.wikipedia.org").get();

Elements links = doc.select ("a [href]")

For (Element link: links) {

System.out.println ("\ nlink:" + link.attr ("href"))

System.out.println ("text:" + link.text ())

}

}

The result is a long list:

Link: / / en.wikipedia.org/

Text: English 5 678 000 + articles

Link: / / ja.wikipedia.org/

Text: Japan 1112000 + Chronicle

Link: / / es.wikipedia.org/

Text: Espa ñ ol 1 430000 + art í culos

Link: / / de.wikipedia.org/

Text: Deutsch 2 197000 + Artikel

Link: / / ru.wikipedia.org/

Text: 1 482 000 + 1 482 000 + 1 482 000 + 1 482 000

Link: / / it.wikipedia.org/

Text: Italiano 1 447 000 + voci

Link: / / fr.wikipedia.org/

Text: Fran ç ais 2000 000 + articles

Link: / / zh.wikipedia.org/

Text: Chinese 1 013 000 +

Text: Wiktionary Free dictionary

Link: / / www.wikibooks.org/

Text: Wikibooks Free textbooks

Link: / / www.wikinews.org/

Text: Wikinews Free news source

Link: / / www.wikidata.org/

Text: Wikidata Free knowledge base

Link: / / www.wikiversity.org/

Text: Wikiversity Free course materials

Link: / / www.wikiquote.org/

Text: Wikiquote Free quote compendium

Link: / / www.mediawiki.org/

Text: MediaWiki Free & open wiki application

Link: / / www.wikisource.org/

Text: Wikisource Free library

Link: / / species.wikimedia.org/

Text: Wikispecies Free species directory

Link: / / meta.wikimedia.org/

Text: Meta-Wiki Community coordination & documentation

Link: https://creativecommons.org/licenses/by-sa/3.0/

Text: Creative Commons Attribution-ShareAlike License

Link: / / meta.wikimedia.org/wiki/Terms_of_Use

Text: Terms of Use

Link: / / meta.wikimedia.org/wiki/Privacy_policy

Text: Privacy Policy

Similarly, you can get the number of images, meta-information, form parameters, and everything you can think of, so it is often used to get statistics.

Parsing files using JSoup

Public void parseFile () throws URISyntaxException, IOException {

URL path = ClassLoader.getSystemResource ("page.html")

File inputFile = new File (path.toURI ())

Document document = Jsoup.parse (inputFile, "UTF-8")

System.out.println (document.title ())

/ / parse document in any way

}

If you want to parse the file, you don't need to send a request to the website, so you don't have to worry about running the program and putting too much burden on the server. Although this approach has many limitations and the data is static and therefore not suitable for many tasks, it provides a more legal and harmless way to analyze data.

The resulting document can be parsed in any of the ways mentioned earlier.

Set property value

In addition to reading strings, URL, and files and getting data, we can also modify data and input forms.

For example, when visiting Amazon, click the website logo in the upper left corner to return to the home page of the site.

If you want to change this behavior, do this:

Public void setAttributes () throws IOException {

Document doc = Jsoup.connect ("https://www.amazon.com").get();

Element element = doc.getElementById ("nav-logo")

System.out.println ("Element:" + element.outerHtml ())

Element.children (. Attr ("href", "notamazon.org")

System.out.println ("Element with set attribute:" + element.outerHtml ())

}

After we get the id of the website logo, we can view its HTML. You can also access its child elements and change its attributes.

Element:

Amazon

Try Prime

Element with set attribute:

Amazon

Try Prime

By default, both child elements point to their respective links. After changing the attribute to a different value, you can see that the href attribute of the child element has been updated.

Add or remove classes

In addition to setting property values, we can also modify the previous example to add or remove classes to the element:

Public void changePage () throws IOException {

Document doc = Jsoup.connect ("https://www.amazon.com").get();

Element element = doc.getElementById ("nav-logo")

System.out.println ("Original Element:" + element.outerHtml ())

Element.children (. Attr ("href", "notamazon.org")

System.out.println ("Element with set attribute:" + element.outerHtml ())

Element.addClass ("someClass")

System.out.println ("Element with added class:" + element.outerHtml ())

Element.removeClass ("someClass")

System.out.println ("Element with removed class:" + element.outerHtml ())

}

Running the code gives us the following information:

Original Element:

Amazon

Try Prime

Element with set attribute:

Amazon

Try Prime

Element with added class:

Amazon

Try Prime

Element with removed class:

Amazon

Try Prime

You can save the new code locally as .html, or send it to Greater Europe through a HTTP request, but be aware that the latter may be illegal.

Thank you for reading this article carefully. I hope the article "how to use the Web crawl framework JSoup" shared by the editor will be helpful to everyone. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report