In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the relevant knowledge of "how to use Java to achieve website aggregation tool". The editor shows you the operation process through an actual case. The operation method is simple and fast, and it is practical. I hope this article "how to use Java to achieve website aggregation tool" can help you solve the problem.
Principle
You can think of a website on the Internet as a huge connected graph. Different websites are in different connected blocks, and then traversing the connected block with the breadth-first algorithm can find all the website domain names. The structure of traversing the connected block using the breadth-first algorithm can be abstracted as follows:
Then, we segment the returned content of the site, get rid of meaningless words and punctuation marks, and get the ranking of keywords on the home page of the site. We can take the keywords whose frequency is within the range of (1050) as keywords, and then take these keywords as the theme of the site, and put the information of the site into the markdown file with the word as the name.
By the same token, we also segment the title part of the content returned by the website, because title is the condensation of the website function by the website developer, and it is also important to use these keywords as the theme of the website, and put the information of the website into the markdown file with the word as the name.
Finally, we only need to manually filter from these files, or put these data into elasticsearch to do keyword search engine. In order to get it whenever you want to use it.
However, when you traverse connected blocks without convergence, you get very little data, and some categories tend to have only one or two websites.
Implement code page download
I use httpClient to download the page, and I considered using playwrite to do it in the early stage, but the performance gap between the two is too big, and the latter is too inefficient, so I give up part of the accuracy (that is, the site of web2.0 technology, the former can not get the data), so exactly what I achieve is only the page download function of web1.0 's website classification search engine.
Public SendReq.ResBody doRequest (String url, String method, Map params) {String urlTrue = url; SendReq.ResBody resBody = SendReq.sendReq (urlTrue, method, params, defaultHeaders (); return resBody;}
Among them, SendReq is a httpClient class that I encapsulated, which only implements the function of a page download, which you can replace with RestTemplate or other methods that initiate http (s) requests.
Parse all links in the return value
Because it is a connected block traversal, then the defined connected website is the site where the domain names of all the outer chains in the home page of the site are located, so we need to extract links and directly use regular expressions to extract them.
Public static List getUrls (String htmlText) {Pattern pattern = Pattern.compile ("(http | https):\ /\ / [A-Za-z0-9 _\ +.:? & @ =\ /% #,;] *"); Matcher matcher = pattern.matcher (htmlText); Set ans = new HashSet (); while (matcher.find ()) {ans.add (DomainUtils.getDomainWithCompleteDomain (matcher.group ();} return new ArrayList (ans) } parse the title in the returned value
Title is the condensation of website functions by website developers, so it is necessary to parse title for further processing.
Public static String getTitle (String htmlText) {Pattern pattern = Pattern.compile ("(?). * (? = [\\ s\ S] *?"; / define the regular expression String regEx_style = "] *? > [\\ s\ S] *?" for script; / / define the regular expression String regEx_html = "] + >" for style / / regular expressions that define HTML tags Pattern p_script = Pattern.compile (regEx_script, Pattern.CASE_INSENSITIVE); Matcher m_script = p_script.matcher (htmlStr); htmlStr = m_script.replaceAll (""); / / filter script tags Pattern p_style = Pattern.compile (regEx_style, Pattern.CASE_INSENSITIVE); Matcher m_style = p_style.matcher (htmlStr); htmlStr = m_style.replaceAll ("") / / filter style tags Pattern p_html = Pattern.compile (regEx_html, Pattern.CASE_INSENSITIVE); Matcher m_html = p_html.matcher (htmlStr); htmlStr = m_html.replaceAll (""); / / filter html tags return htmlStr.trim ();} participle
The word segmentation algorithm can use the hanlp mentioned in the previous article on introduction to NLP.
Private static Pattern ignoreWords = Pattern.compile ("[, .0-9 _\,:. ;;\]\ [\ /! () [] *? " () +:\ "% ~ -] +"); public static Set separateWordAndReturnUnit (String text) {Segment segment = HanLP.newSegment () .enableOffset (true); Set detectorUnits = new HashSet (); Map detectorUnitMap = new HashMap (); List terms = segment.seg (text); for (Term term: terms) {Matcher matcher = ignoreWords.matcher (term.word) If (! matcher.find () & & term.word.length () > 1 & &! term.word.contains ("term.word.hashCode")) {Integer hashCode = term.word.hashCode (); Word detectorUnit = detectorUnitMap.get (hashCode); if (Objects.nonNull (detectorUnit)) {detectorUnit.setCount (detectorUnit.getCount () + 1)
Here, in order to remove the interference of words with excessive word frequency, only the top ten of words with word frequency less than 50 are selected.
Public static List print2List (List tmp,int cnt) {PriorityQueue words = new PriorityQueue (); List ans = new ArrayList (); for (Word word: tmp) {words.add (word);} int count = 0; while (! words.isEmpty ()) {Word word = words.poll (); if (word.getCount () = cnt) {break } return ans;}
The method is to take it out one by one in the priority queue, which is implemented using a large top heap, so the fetch must be orderly. If you want to know the friends of Dadingdui, you can read my previous article.
It is worth noting that the classes placed in the priority queue must be sortable, so the Word here is also sortable, and the simplified code is as follows:
Public class Word implements Comparable {private String word; private Integer count = 0;... @ Override public int compareTo (Object o) {if (this.count > = ((Word) o) .count) {return-1;} else {return 1;}
All right, now the preparatory work has been done. Let's start implementing the logic part of the program.
Traverse the connected blocks of the website
The use of breadth-first traversal of the site connection block, the previous article was devoted to the use of queue to write breadth-first traversal. Use this method now.
Public void doTask () {String root = "http://" + this.domain +" / "; Queue urls = new LinkedList (); urls.add (root); Set tmpDomains = new HashSet (); tmpDomains.add (DomainUtils.getDomainWithCompleteDomain (root)); while (! urls.isEmpty ()) {String url = urls.poll (); SendReq.ResBody html = doRequest (url," GET ", new HashMap ()) System.out.println ("current request is" + url + "queue size is" + urls.size () + "result is" + html.getCode ()); if (html.getCode () .equals (0)) {ignoreSet.add (DomainUtils.getDomainWithCompleteDomain (url)); try {GenerateFile.createFile2 ("moneyframework/generate/ignore", "demo.txt", ignoreSet.toString ()) } catch (IOException e) {e.printStackTrace ();} continue;} OnePage onePage = new OnePage (); onePage.setUrl (url); onePage.setDomain (DomainUtils.getDomainWithCompleteDomain (url)); onePage.setCode (html.getCode ()); String title = HtmlUtil.getTitle (html.getResponce ()). Trim () If (! StringUtils.hasText (title) | | title.length () > 100 | title.contains ("neighbors") continue; onePage.setTitle (title); String content = HtmlUtil.getContent (html.getResponce ()); Set words = Nlp.separateWordAndReturnUnit (content); List wordStr = Nlp.print2List (new ArrayList (words), 10); handleWord (wordStr, DomainUtils.getDomainWithCompleteDomain (url), title); onePage.setContent (wordStr.toString ()) If (html.getCode (). Equals (200)) {List domains = HtmlUtil.getUrls (html.getResponce ()); for (String domain: domains) {int flag = 0; for (String I: ignoreSet) {if (domain.endsWith (I)) {flag = 1 Break;}} if (flag = = 1) continue; if (StringUtils.hasText (domain.trim () {if (! tmpDomains.contains (domain)) {tmpDomains.add (domain) Urls.add ("http://" + domain +" / ") } invoke test @ Servicepublic class Task {@ PostConstruct public void init () {new Thread (new Runnable () {@ Override public void run () {while (true) {try { HttpClientCrawl clientCrawl = new HttpClientCrawl ("http://www.mengwa.store/"); ClientCrawl.doTask ();} catch (Exception e) {e.printStackTrace ();}) .start ();}} that's all for "how to implement the site aggregation tool with Java". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.