In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >
Share
Shulou(Shulou.com)05/31 Report--
This article introduces the knowledge of "how to use Java crawler to compare data on the east". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
How to make a crawler by Java
When you think of crawlers, you must want to say, crawlers, isn't this something that people who learn from Python can do? What can we do with Java? What A Fan wants to tell you is, yes, Java language for so many years, lasted for so long, how could there be no such content, so A Fan began to learn the crawler road of Java.
Jsoup
Before introducing this class, Ah Fan must first talk about what we usually see. Now, for example, we developers all know that at least when we access the data of certain things or treasures on the computer side, the data they give us feedback is displayed through HTML, for example:
In the development, we must know what all this means. Here, we will not introduce in detail what this HTML is. What Ah Fan needs to introduce is Jsoup, and then tell you how to use the class Jsoup to crawl JD.com 's data.
As the official documentation tells us, how to parse a piece of HTML code:
String html = "First parse" + "
Parsed HTML into a doc.
"; Document doc = Jsoup.parse (html)
And what is this Document? We can take a look at the output and take a look at the source code explanation. After all, developers are not qualified programmers if they don't look at what this class is for.
Output:
First parse
Parsed HTML into a doc.
In fact, we can see that Document actually outputs a new document for us, and it is sorted out, which is equivalent to a professional preparation for the later analysis of HTML.
When we look at the comments of the source code, it is not difficult to see that Jsoup can not only parse the string we give, but also can be a URL, can also be a file.
It converts the HTML string we gave him into an object, which is the Document we saw above, and then we can use the elements in the Document object smoothly.
Above is the parsing string, so let's take a look at the following parsing URL:
Public static void main (String [] args) {try {Document doc = Jsoup.connect ("https://www.jd.com/?cu=true&utm_source=baidu-pinzhuan&utm_medium=cpc&utm_campaign=t_288551095_baidupinzhuan&utm_term=0f3d30c8dba7459bb52f2eb5eba8ac7d_0_f38cf584e9fb4328a3e0d2bb515e1458").get(); String title = doc.title (); System.out.println (title);} catch (IOException e) {e.printStackTrace ()) }}
If you do the following, you will certainly be able to see what the title is, and the result looks like this:
It is not quite the same as when we search in Baidu, because this is the home page after entering.
Element
When we look at the source code, we can clearly see that Document is a class that inherits Element, so it is necessary to call the methods in Element, such as:
GetElementById (String id); / / does it look familiar, like the ID selector getElementsByTag (String tagName) in Js; / / Select getAllElements () through tags; / / get all the elements of Element
With regard to the method, A Fan will no longer describe it one by one. If you are interested, you can take a look at the official documentation, or take a look at this source code. Send the package name to package org.jsoup.nodes.
Someone must start to get bored. Don't introduce him if you say A Fan. Then you have said too much nonsense. Introduce climbing JD.com quickly. OK, here we go.
We must analyze JD.com 's URL before crawling, for example, I search the hard drive:
Here's a pile of data, and what we need to parse is the most useful part of the HTML species, such as:
879.00
We wrote down the price here, and then we went to find the name we wanted.
Look, p-name is the name we need, so we can write code.
/ / this is JD.com 's search URL. Let's extract the keyword keyword. Pay attention to English and Chinese. Chinese needs to deal with String url = "https://search.jd.com/Search?keyword=" + keyword; url = url +" & enc=utf-8 "; Document document = Jsoup.parse (new URL (url), 40000); / / We first find this List, and then traverse layer by layer Element element = document.getElementById (" J_goodsList "). Elements elements = element.getElementsByTag ("li"); for (Element el: elements) {String img = el.getElementsByTag ("img"). Eq (0). Attr ("source-data-lazy-img"); String price = el.getElementsByClass ("p-price"). Eq (0). Text (); String title = el.getElementsByClass ("p-name"). Eq (0). Text () String shop = el.getElementsByClass ("p-shop"). Eq (0). Text (); System.out.println ("="); System.out.println ("title:" + title); System.out.println ("Picture url:" + img); System.out.println ("Store:" + shop); System.out.println ("Price:" + price);}
Take a look at the effect picture of the implementation:
This is the end of the content of "how to use the Java crawler to compare the data on the east". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.