In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
Java spring+mybatis integration of how to achieve Jinri Toutiao funny dynamic picture crawling, I believe that many inexperienced people do not know what to do, so this article summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.
Integrating java spring+mybatis to realize Jinri Toutiao's funny dynamic picture crawling the effect picture first.
Captured dynamic graph:
Database:
one。 This crawler introduces
Jinri Toutiao itself is a crawler, crawl the picture and text information of each major website, and then integrate it and push it to users, especially the dynamic pictures inside, which is very interesting. After searching on the Internet, most of them were written in Python. I studied javaweb, and I was not very familiar with regular expressions, so I wondered if I could write in a way that I was familiar with. The crawler uses spring+mybatis framework integration, uses mysql database to save crawled data, uses jsoup to manipulate HTML tag nodes (perfectly avoiding regular expressions), gets links to dynamic pictures in the article, determines the format of the picture by responding to the value of "Content-Type" in the header, and then saves the picture locally. Of course, you can also crawl the text inside, such as some funny pornographic jokes, which can be realized with a little modification on this basis, this crawler only provides an entry-level idea, and more fun reptile games are still waiting for everyone to develop, .
two。 Technology selection
Core language: java
Core framework: spring
Persistence layer framework: mybatis
Database connection Pool: Alibaba Drui
Log management: Log4j
Jar package Management: maven;.
three。 Find the rules and draw the key points.
Open the front page of the headlines, find and click the funny module, click F12, scroll down and load the next page, and find that the data is obtained through ajax request api, as shown below:
This is the json data of the response, and the parameters and values in it are known to everyone as the name implies.
It is easy to access ajax. After various studies on Baidu Google, I found that the first three parameters of the ajax request are unchanged. Changing the category parameter is a different module of the request. This column is the funny module of the request, so the two parameter values of funny,max_behot_time and max_behot_time_tmp are timestamped, the first request is 0, and the later value is the value of next in the response json data. The as and cp values are generated through a js, which is just an encrypted timestamp. The js code will be posted after it.
four。 Start building a frame to play with the code.
After the project is built, the file structure shown in the following figure, do not know how to Google
Needless to say, let's go straight to the core code:
Api address of public class TouTiaoCrawler {/ / funny section public static final String FUNNY = "http://www.toutiao.com/api/pc/feed/?utm_source=toutiao&widen=1"; / / homepage address public static final String TOUTIAO =" http://www.toutiao.com"; / use the configuration files "spring.xml" and "spring-mybatis.xml" to create the Spring context static ApplicationContext ac = new ClassPathXmlApplicationContext ("spring-mybatis.xml"); / / extract the funnyMapper object we want to use static FunnyMapper funnyMapper = (FunnyMapper) ac.getBean ("funnyMapper") from the Spring container according to the id of bean. / / number of API visits private static int refreshCount = 0; / / timestamp private static long time = 0; public static void main (String [] args) {System.out.println ("- start working! -"); while (true) {crawler (time);}} public static void crawler (long hottime) {/ / pass in the timestamp, which will get the content refreshCount++ of this timestamp System.out.println ("- th" + refreshCount + "refresh-the request time returned is" + hottime + "-") String url = FUNNY + "& max_behot_time=" + hottime + "& max_behot_time_tmp=" + hottime; JSONObject param = getUrlParam () / / get the values of as and cp obtained by js code / / define the module accessed by the interface / * * _ _ all__: recommended news_hot: hot funny: funny * / String module = "funny" Url + = & as= + param.get ("as") + "& cp=" + param.get ("cp") + "& category=" + module; JSONObject json = null; try {json = getReturnJson (url) / / get the json string} catch (Exception e) {e.printStackTrace ();} if (json! = null) {time = json.getJSONObject ("next") .getLongValue ("max_behot_time"); JSONArray data = json.getJSONArray ("data") For (int I = 0; I
< data.size(); i++) { try { JSONObject obj = (JSONObject) data.get(i); // 判断这条文章是否已经爬过 if (funnyMapper.selectByGroupId((String) obj .get("group_id")) != null) { System.out .println("----------此文章已经爬过啦!-----------------"); continue; } // 访问页面返回document对象 String url1 = TOUTIAO + "/a" + obj.getString("group_id"); Document document = getArticleInfo(url1); System.out.println("----------成功访问了文章:" + url1 + "-----------------"); // 将document也存入 obj.put("document", document.toString()); // 将json对象转换成java Entity对象 Funny funny = JSON.parseObject(obj.toString(), Funny.class); // json入库 funny.setBehotTime(new Date()); funnyMapper.insertSelective(funny); } catch (Exception e) { e.printStackTrace(); } } } else { System.out.println("----------返回的json列表为空----------"); } } // 访问接口,返回json封装的数据格式 public static JSONObject getReturnJson(String url) { try { URL httpUrl = new URL(url); BufferedReader in = new BufferedReader(new InputStreamReader( httpUrl.openStream(), "UTF-8")); String line = null; String content = ""; while ((line = in.readLine()) != null) { content += line; } in.close(); return JSONObject.parseObject(content); } catch (Exception e) { System.err.println("访问失败:" + url); e.printStackTrace(); } return null; } // 获取网站的document对象 public static Document getArticleInfo(String url) { try { Connection connect = Jsoup.connect(url); Document document; document = connect.get(); Elements article = document.getElementsByClass("article-content"); if (article.size() >0) {Elements a = article.get (0). GetElementsByTag ("img"); if (a.size () > 0) {for (Element e: a) {String url2 = e.attr ("src") / / download the image in the img tag to the local saveToFile (url2) }} return document;} catch (IOException e) {System.err.println ("failed to access the article page:" + url + "reason" + e.getMessage ()) Return null;}} / / execute js to get as and cp parameter values public static JSONObject getUrlParam () {JSONObject jsonObject = null; FileReader reader = null; try {ScriptEngineManager manager = new ScriptEngineManager () ScriptEngine engine = manager.getEngineByName ("javascript"); String jsFileName = "toutiao.js"; / / read the js file reader = new FileReader (jsFileName); / / execute the specified script engine.eval (reader) If (engine instanceof Invocable) {Invocable invoke = (Invocable) engine; Object obj = invoke.invokeFunction ("getParam"); jsonObject = JSONObject.parseObject (obj! = null? Obj .toString (): null);}} catch (Exception e) {e.printStackTrace () } finally {try {if (reader! = null) {reader.close () } catch (IOException e) {e.printStackTrace ();}} return jsonObject } / / get the picture through url and save it locally on public static void saveToFile (String destUrl) {FileOutputStream fos = null; BufferedInputStream bis = null; HttpURLConnection httpUrl = null; URL url = null; String uuid = UUID.randomUUID () .toString () String fileAddress = "d:\\ imag/" + uuid;// storage local file address int BUFFER_SIZE = 1024; byte [] buf = new byte [buff _ SIZE]; int size = 0; try {url = new URL (destUrl) HttpUrl = (HttpURLConnection) url.openConnection (); httpUrl.connect (); String Type = httpUrl.getHeaderField ("Content-Type"); if (Type.equals ("image/gif")) {fileAddress + = ".gif" } else if (Type.equals ("image/png")) {fileAddress + = ".png";} else if (Type.equals ("image/jpeg")) {fileAddress + = ".jpg" } else {System.err.println ("unknown picture format"); return;} bis = new BufferedInputStream (httpUrl.getInputStream ()); fos = new FileOutputStream (fileAddress) While ((size = bis.read (buf))! =-1) {fos.write (buf, 0, size);} fos.flush (); System.out.println ("picture saved successfully! Address: "+ fileAddress);} catch (IOException e) {e.printStackTrace ();} catch (ClassCastException e) {e.printStackTrace ();} finally {try {fos.close () Bis.close (); httpUrl.disconnect ();} catch (IOException e) {e.printStackTrace ();} catch (NullPointerException e) {e.printStackTrace () }}
Get the js code for as and cp parameters
Function getParam () {var asas; var cpcp; var t = Math.floor ((new Date) .getTime () / 1e3), e = t.toString (16) .toUpperCase (), I = md5 (t) .toString () .toUpperCase (); if (8! = e.length) {asas = "479BB4B7254C150"; cpcp = "7E0AC8874BB0985" } else {for (var n = i.slice (0,5), o = i.slice (- 5), a = "", s = 0; 5 > s; spore +) {a + = n [s] + e [s];} for (var r = "", c = 0; 5 > c; C++) {r + = e [c + 3] + o [c] } asas = "A1" + a + e.slice (- 3); cpcp= e.slice (0,3) + r + "E1";} return'{"as": "'+ asas+'", "cp": "'+ cpcp+'"}';}! function (e) {"use strict" Function t (e, t) {var n = (65535 & e) + (65535 & t), r = (e > > 16) + (t > > 16) + (n > 16) Return r > 32-t} function r (e, r, o, I, a, u) {return t (n (t (t (r, e), t (I, u)), a), o)} function o (e, t, n, o, I, a, u) {return r (t & n | ~ t & o, e, t, I, a, u)} function I (e, t, n, o, I, a) U) {return r (t & o | n & ~ o, e, t, I, a, u)} function a (e, t, n, o, I, a, u) {return r (t ^ n ^ o, e, t, I, a, u)} function u (e, t, n, o, I, a, u) {return r (n ^ (t | ~ o), e, t, I, a) U)} function s (e, n) {e [n > 5] | = 128 > > 9 > 5] > t% 32 & 255) Return n} function l (e) {var t, n = []; for (n [(e.length > > 2)-1] = void 0, t = 0; t
< n.length; t += 1) n[t] = 0; for (t = 0; t < 8 * e.length; t += 8) n[t >> 5] | = (255 & e.charCodeAt (t / 8)) 16 & & (o = s (o, 8 * e.length)), n = 0; 16 > n; n + = 1) I [n] = 909522486 ^ o [n], a [n] = 1549556828 ^ o [n] Return r = s (i.concat (l (t)), 512 + 8 * t.length), c (s (a.concat (r), 640)} function d (e) {var t, n, r = "0123456789abcdef", o = "; for (n = 0; n)
< e.length; n += 1) t = e.charCodeAt(n), o += r.charAt(t >> > 4 & 15) + r.charAt (15 & t) Return o} function h (e) {return unescape (encodeURIComponent (e))} function m (e) {return f (h (e))} function g (e) {return d (m (e))} function v (e, t) {return p (h (e), h (t))} function y (e) T) {return d (v (e, t))} function b (e, t, n) {return t? N? V (t, e): y (t, e): n? M (e): G (e)} "function" = = typeof define & & define.amd? Define ("static/js/lib/md5", ["require"], function () {return b}): "object" = = typeof module & & module.exports? Module.exports = b: e.md5 = b} (this) five. Last
I also found a minimalist version of the headlines, which I found to be easier to climb.
The format of the access is p + page number. If you read the link in each page directly, you can crawl it. You no longer need to get the article address through the json string, nor do you need to pass any restriction parameters. Just make a slight change on this project.
After reading the above, have you mastered the method of how java spring+mybatis integration can realize the crawling of Jinri Toutiao funny dynamic pictures? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.