In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article is about how to implement crawlers in Java. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
Why are we crawling data?
In the era of big data, if we want to get more data, we have to carry out data mining, analysis, and screening. For example, when we are working on a project and need a lot of real data, we need to go to some websites to crawl. Some websites' data crawled and saved to the database cannot be used directly, and they need to be cleaned and filtered before they can be used. We know that some data are very expensive.
Analyze Douban Movie website
We use Chrome browser to visit Douban's website such as
Https://movie.douban.com/explore#!type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0
In the network of the Chrome browser, you will get the following data
You can see the parameter type=movie&tag= popular & sort=recommend&page_limit=20&page_start=0 on the address bar
Where type is the movie tag is the tag, sort is sorted by popularity, page_limit is 20 pieces of data per page, and page_start is from which page to start the query.
But this is not what we want, we need to find the total entry address of Douban movie data is the following
Https://movie.douban.com/tag/#/
After another visit, we finally got Douban's movie data as shown in the following figure.
Take a look at the request header information
Finally, we confirm that the entry of the crawl is:
Https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=0
Create a Maven project to start crawling
Let's create a maven project, as shown in the following figure
The dependency of maven project, here is just crawling data, so there is no need to use Spring, the data persistence layer framework used here is mybatis database uses mysql, here is the dependency of maven
Org.json json 20160810 com.alibaba fastjson 1.2.47 mysql mysql-connector-java 5.1.47 org.mybatis mybatis 3.5.1 junit junit 4.12
Once created, the structure is as follows
First of all, we create an entity object in the model package. The field is the same as the field of Douban movie, that is, the field in the json object that requests Douban movie.
Movie entity class
} public void setDirectors (String directors) {this.directors = directors;} public String getTitle () {return title;} public void setTitle (String title) {this.title = title;} public String getCover () {return cover;} public void setCover (String cover) {this.cover = cover;} public String getRate () {return rate } public void setRate (String rate) {this.rate = rate;} public String getCasts () {return casts;} public void setCasts (String casts) {this.casts = casts;}}
What is noticed here is that there are many directors and actors and I did not deal with them directly. This should be an array object.
Create a mapper interface
Public interface MovieMapper {void insert (Movie movie); List findAll ();}
Create a data connection profile jdbc.properties under resources
Driver=com.mysql.jdbc.Driverurl=jdbc:mysql://localhost:3306/huadiusername=rootpassword=root
Create mybatis profile mybatis-config.xml
Create a mapper.xml mapping file
INSERT INTO movie (id,title,cover,rate,casts,directors) VALUES (# {id}, # {title}, # {cover}, # {rate}, # {casts}, # {directors}) SELECT * FROM movie
Since I don't use any third-party crawler framework here, I use native Java's Http protocol to crawl, so I wrote a utility class
Public class GetJson {public JSONObject getHttpJson (String url, int comefrom) throws Exception {try {URL realUrl = new URL (url); HttpURLConnection connection = (HttpURLConnection) realUrl.openConnection (); connection.setRequestProperty ("accept", "* / *"); connection.setRequestProperty ("connection", "Keep-Alive"); connection.setRequestProperty ("user-agent", "Mozilla/4.0 (compatible)" MSIE 6.0; Windows NT 5.1: Sv1) "); / / establish the actual connection connection.connect (); / / request successful if (connection.getResponseCode () = = 200) {InputStream is = connection.getInputStream (); ByteArrayOutputStream baos = new ByteArrayOutputStream () / / 10MB's cache byte [] buffer = new byte [10485760]; int len = 0; while ((len = is.read (buffer))! =-1) {baos.write (buffer, 0, len);} String jsonString = baos.toString () Baos.close (); is.close (); / / converted to json data processing / / parameter 1 of the / getHttpJson function, indicating that json data is returned, 2 indicates that the data of the http interface is in a () data JSONObject jsonArray = getJsonString (jsonString, comefrom); return jsonArray } catch (MalformedURLException e) {e.printStackTrace ();} catch (IOException ex) {ex.printStackTrace ();} return null;} public JSONObject getJsonString (String str, int comefrom) throws Exception {JSONObject jo = null; if (comefrom==1) {return new JSONObject (str) } else if (comefrom==2) {int indexStart = 0; / / character processing for (int iTuno
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.