Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Java jsoup

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

Today, I would like to share with you the relevant knowledge of how to use Java jsoup. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.

Jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-saving API that can fetch and manipulate data through DOM, CSS, and JQuery-like operations.

Front position

Introduce Jsoup dependency

Org.jsoup jsoup 1.11.3 parsing HTML using Jsoup

The following code uses Jsoup to parse the html. The html here is just a small test. You can use the HttpClient request and get the HTML of the response, and then complete the parsing through Jsoup.

Here we choose the parsing of two parameters. The second parameter is the URL path of HTML. When there is a relative path in our HTML, we can convert the relative path to the absolute path through the URL path.

In addition, Jsoup provides parsing of other parameters, such as setting the parsing timeout, and customizing the parser to set parsing parameters, such as allowing the case of tags or attributes. Generally, these two parameters are enough.

StringBuilder htmlSB = new StringBuilder (); htmlSB.append (") .append (") .append ("title") .append (") .append ("); Document document = Jsoup.parse (htmlSB.toString (), "http://www.baidu.com");)

Parsing process: simply iterate through HTML by character, parse it into Element and append it to Document object. The following whole piece can be thought of as a Document, … Is an Element, and each Element contains its parent Element (up to 1) and child Elements (multiple)

. -.

For example, we need to get links to all the pictures in HTML. The cssQuery selector syntax is used here to search for the Element of img with the src attribute in the document, and then print the image path (in fact, you must deal with the path, which is lazy here).

For (Element element: document.select ("IMG [src]")) {System.out.println (element.baseUri () + element.attr ("src"));}

Jsoup also provides us with some JavaScript-like methods that allow us to easily filter out the elements we need.

Document.getElementsByTag ("); document.getElementsByClass ("); document.getElementById (""); document.getElementsByAttribute (""); send a HTTP request GET request using Jsoup

The following two ways can implement the GET request call. The two main methods return different types. Here, the HTML returned by the request is printed.

String url = "http://www.baidu.com";Connection.Response response = Jsoup.connect (url). Execute (); System.out.println (response.body ()); Document document = Jsoup.connect (url). Get (); System.out.println (document.html ())

GET request with parameters, where a local interface is written

RequestMapping ("/ user") @ RestControllerpublic class UserController {@ GetMapping ("/ myself") public String myself (String name, int age) {return "name:" + name + ", age:" + age;}} String url = "http://localhost:8080/user/myself";" Connection.Response response = Jsoup .connect (url) .data ("name", "Zhang San") .data ("age", "20") .execute (); System.out.println (response.body ()); POST request

POST request with parameters, where a local interface is written

@ RequestMapping ("/ user") @ RestControllerpublic class UserController {@ PostMapping ("/ login") public ResponseVo login (@ RequestBody UserLoginReq userLoginReq) {return ResponseVo.getSuccess (userLoginReq);}}

Note that there is an ignoreContentType (true), which means that the type returned by the response is ignored, and an org.jsoup.UnsupportedMimeTypeException exception will be thrown without addition.

String url = "http://localhost:8080/user/login";Connection connection = Jsoup.connect (url); connection .returreContentType (true) .header (" Content-Type "," application/json ") .requestBody (" {\ "username\":\ "123123\",\ "password\":\ "123123\"} "); System.out.println (connection.post (). Text ()); additional request attributes

In addition to setting the request header, request body, parameters, etc., Jsoup also supports setting proxy, request timeout, Cookie, and so on.

The following simulates a request for the page where the profile needs to be logged in. Https://www.jianshu.com/my/paid_notes (1) after logging in to the profile, go to the page. (2) refresh the page after pressing F12, and there is a remember_user_token in the request header. If you have this value, you can assume that we have logged in to the brief book.

String url = "https://www.jianshu.com/my/paid_notes";Connection.Response response = Jsoup .connect (url) .cookie (" remember_user_token "," I omitted it here!! ") .execute (); System.out.println (response.body ())

If we comment out the cookie, the brief book will respond by giving us a login page asking us to log in.

Two other methods provided by Jsoup

Clean: supports two functions, the first is to convert the relative path in the HTML body to absolute path, the second is to filter out some HTML tags and attributes, you can set reserved tags and attributes by using whitelist.

StringBuilder htmlSB = new StringBuilder (); htmlSB.append (") .append (") .append ("title") .append (") .append (") Whitelist whitelist = new Whitelist () .addTags ("head", "body", "img") .addAttributes ("img", "src") .addProtocols ("img", "src", "http", "https"); String html = Jsoup.clean (htmlSB.toString (), "http://www.baidu.com/", whitelist); System.out.println (html)

Print the results, note that there is no here, because the contents of the body are processed.

Title "http://www.baidu.com/test.png">"

IsValid: this method is to determine whether HTML's body matches the whitelist.

Public static boolean isValid (String bodyHtml, Whitelist whitelist) {return new Cleaner (whitelist) .isValidBodyHtml (bodyHtml);} above is all the content of this article "how to use Java jsoup". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report