Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the method of JAVA crawler block chain newsletter

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the method of JAVA crawler block chain newsletter". In daily operation, I believe that many people have doubts about the method of JAVA crawler block chain newsletter. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "what is the method of JAVA crawler block chain newsletter?" Next, please follow the editor to study!

Demand:

The newsletter content of several target sites needs to be crawled regularly and repeatedly filtered according to the content.

Technical Review:

At first, the content crawling of the site wanted to use python, but without thinking about the practice path, we did not directly use python in the part of content repeated filtering.

JAVA has also seen some other open source frameworks such as "xxl-crawler" and "WebCollector". These two frameworks actually have some advantages in distribution, but they do not solve copywriting very well in text filtering.

A. Task scheduling:

Quartz for task scheduling, mature and flexible

B, text repetition filtering

Considering that the matching amount of repeated data is relatively large when there is a lot of data, if you use the content of the participle to index and circulate and filter matching text items, the efficiency is too low; is there a way to quickly build text features? convenient use of relational database can do a global weight filter.

The answer is: yes, build a fingerprint, duplicate text or larger text repetition for each newsletter text.

Its principle is briefly described: the Simhash algorithm calculates a 32-bit binary string of a given text according to the keywords in it. I divide the generated binary string into four 8-bit strings, so it is convenient to calculate the matching degree quickly, and when the number is large, it can also build an index on the fingerprint to improve efficiency.

The image above shows the fingerprint generated according to the captured content.

Key implementations:

The picture above shows the basic architecture of the whole module.

Crawler task thread: the crawler task thread is scheduled by the user Quartz. Each target station is a separate Job, and the specific scheduling is handed over to Quatz. The child JOB of the crawler extracts a parent class to act as the crawler template. In this way, the amount of code for a single crawler is very small, and the implementation class of a target station is attached:

Public class LianNewsJob extends NewsJob implements Job {protected static long MAX_ID=0;Logger logger = LoggerFactory.getLogger (WalianNewsJob.class); @ Overridepublic void execute (JobExecutionContext jobExecutionContext) {TAG= "WALIAN"; try {saveNewsToQueue (getContent ());} catch (Exception e) {logger.error (TAG+ "abnormal collection", e);}} / * crawl information * @ return*/private List getContent () {LxbNewsSuorce newsItem;List newsList = new ArrayList (); List jsonList = new ArrayList (); try {String responseJsonStr = getHttpContent (Constants.WALIAN_URL) JSONObject responseJO = JSON.parseObject (responseJsonStr); if (StringUtils.equals (responseJO.getString ("code"), "000000")) {JSONArray contentList = responseJO.getJSONObject ("data"). GetJSONArray ("list"); for (int I = 0; I

< contentList.size(); i++) {jsonList.add(contentList.getJSONObject(i));}sortContent(jsonList);for (JSONObject jObj : jsonList) {if (jObj.getLong("id") >

MAX_ID) {newsItem = new LxbNewsSuorce (); newsItem.setTitle (jObj.getString ("title")); newsItem.setContent (StringUtils.replace (jObj.getString ("content"); newsList.add (newsItem); MAX_ID = jObj.getLong ("id");}} catch (Exception e) {logger.error (TAG+ "crawl exception", e);} return newsList;} boolean isTitleExist (String title) {Query userQuery = DBManager.getSqlManager (). Query (LxbNewsSuorce.class) List newsList = userQuery.andEq ("title", title). Select (); if (newsList.size () > 0) {return true;} return false;} at this point, the study of "what is the method of JAVA crawler block chain newsletter" is over, hoping to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report