How to implement crawler in Java language 07/01 Update SLTechnology News&Howtos

How to implement crawler in Java language

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article is about how to implement crawlers in Java. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Why are we crawling data?

In the era of big data, if we want to get more data, we have to carry out data mining, analysis, and screening. For example, when we are working on a project and need a lot of real data, we need to go to some websites to crawl. Some websites' data crawled and saved to the database cannot be used directly, and they need to be cleaned and filtered before they can be used. We know that some data are very expensive.

Analyze Douban Movie website

We use Chrome browser to visit Douban's website such as

Https://movie.douban.com/explore#!type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0

In the network of the Chrome browser, you will get the following data

You can see the parameter type=movie&tag= popular & sort=recommend&page_limit=20&page_start=0 on the address bar

Where type is the movie tag is the tag, sort is sorted by popularity, page_limit is 20 pieces of data per page, and page_start is from which page to start the query.

But this is not what we want, we need to find the total entry address of Douban movie data is the following

Https://movie.douban.com/tag/#/

After another visit, we finally got Douban's movie data as shown in the following figure.

Take a look at the request header information

Finally, we confirm that the entry of the crawl is:

Https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=0

Create a Maven project to start crawling

Let's create a maven project, as shown in the following figure

The dependency of maven project, here is just crawling data, so there is no need to use Spring, the data persistence layer framework used here is mybatis database uses mysql, here is the dependency of maven

Org.json json 20160810 com.alibaba fastjson 1.2.47 mysql mysql-connector-java 5.1.47 org.mybatis mybatis 3.5.1 junit junit 4.12

Once created, the structure is as follows

First of all, we create an entity object in the model package. The field is the same as the field of Douban movie, that is, the field in the json object that requests Douban movie.

Movie entity class

} public void setDirectors (String directors) {this.directors = directors;} public String getTitle () {return title;} public void setTitle (String title) {this.title = title;} public String getCover () {return cover;} public void setCover (String cover) {this.cover = cover;} public String getRate () {return rate } public void setRate (String rate) {this.rate = rate;} public String getCasts () {return casts;} public void setCasts (String casts) {this.casts = casts;}}

What is noticed here is that there are many directors and actors and I did not deal with them directly. This should be an array object.

Create a mapper interface

Public interface MovieMapper {void insert (Movie movie); List findAll ();}

Create a data connection profile jdbc.properties under resources

Driver=com.mysql.jdbc.Driverurl=jdbc:mysql://localhost:3306/huadiusername=rootpassword=root

Create mybatis profile mybatis-config.xml

Create a mapper.xml mapping file

INSERT INTO movie (id,title,cover,rate,casts,directors) VALUES (# {id}, # {title}, # {cover}, # {rate}, # {casts}, # {directors}) SELECT * FROM movie

Since I don't use any third-party crawler framework here, I use native Java's Http protocol to crawl, so I wrote a utility class

Public class GetJson {public JSONObject getHttpJson (String url, int comefrom) throws Exception {try {URL realUrl = new URL (url); HttpURLConnection connection = (HttpURLConnection) realUrl.openConnection (); connection.setRequestProperty ("accept", "* / *"); connection.setRequestProperty ("connection", "Keep-Alive"); connection.setRequestProperty ("user-agent", "Mozilla/4.0 (compatible)" MSIE 6.0; Windows NT 5.1: Sv1) "); / / establish the actual connection connection.connect (); / / request successful if (connection.getResponseCode () = = 200) {InputStream is = connection.getInputStream (); ByteArrayOutputStream baos = new ByteArrayOutputStream () / / 10MB's cache byte [] buffer = new byte [10485760]; int len = 0; while ((len = is.read (buffer))! =-1) {baos.write (buffer, 0, len);} String jsonString = baos.toString () Baos.close (); is.close (); / / converted to json data processing / / parameter 1 of the / getHttpJson function, indicating that json data is returned, 2 indicates that the data of the http interface is in a () data JSONObject jsonArray = getJsonString (jsonString, comefrom); return jsonArray } catch (MalformedURLException e) {e.printStackTrace ();} catch (IOException ex) {ex.printStackTrace ();} return null;} public JSONObject getJsonString (String str, int comefrom) throws Exception {JSONObject jo = null; if (comefrom==1) {return new JSONObject (str) } else if (comefrom==2) {int indexStart = 0; / / character processing for (int iTuno

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.