How to realize the accurate extraction component HtmlExtractor of structured Information from Web pages based on template by Java 04/11 Update SLTechnology News&Howtos

How to realize the accurate extraction component HtmlExtractor of structured Information from Web pages based on template by Java

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how Java implements the template-based web page structured information extraction component HtmlExtractor, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

HtmlExtractor is a template-based web page structured information extraction component implemented by Java, which does not contain crawler function, but can be called by crawler or other programs to extract web page structured information more accurately.

HtmlExtractor is designed for large-scale distributed environment, using master-slave architecture, the master node is responsible for maintaining the extraction rules, and the slave node requests the extraction rules from the slave node. When the extraction rules change, the master node actively notifies the slave node, thus the real-time dynamic effect of the extraction rules can be realized. How to use it? HtmlExtractor consists of two sub-projects, html-extractor and html-extractor-web. Html-extractor implements the data extraction logic, which is the slave node. Html-extractor-web provides web interface to maintain the extraction rules, which is the master node. Html-extractor is a jar package that can be referenced by maven: org.apdplat html-extractor 1.0 html-extractor-web is a war package that needs to be deployed on a Servlet/Jsp container. Centralized stand-alone usage: / / 1, construct extraction rule List urlPatterns = new ArrayList (); / / 1.1.Construct URL pattern UrlPattern urlPattern = new UrlPattern (); urlPattern.setUrlPattern ("http://money.163.com/\\d{2}/\\d{4}/\\d{2}/[0-9A-Z]{16}.html");//1.2, construct HTML template HtmlTemplate htmlTemplate = new HtmlTemplate (); htmlTemplate.setTemplateName (" NetEase Financial Channel ") HtmlTemplate.setTableName ("finance"); / / 1.3.associating the URL schema with the HTML template urlPattern.addHtmlTemplate (htmlTemplate); / / 1.4.Constructing the CSS path CssPath cssPath = new CssPath (); cssPath.setCssPath ("H2"); cssPath.setFieldName ("title"); cssPath.setFieldDescription ("title"); / / 1.5.associating the CSS path with the template htmlTemplate.addCssPath (cssPath); / 1.6.Constructing the CSS path cssPath = new CssPath () CssPath.setCssPath ("div#endText"); cssPath.setFieldName ("content"); cssPath.setFieldDescription ("body"); / 1.7. associate the CSS path with the template htmlTemplate.addCssPath (cssPath); / / construct multiple URLURL schemas urlPatterns.add (urlPattern) as above; / / 2. Get the extraction rule object ExtractRegular extractRegular = ExtractRegular.getInstance (urlPatterns); / / Note: the extraction rule / extractRegular.addUrlPatterns (urlPatterns) can be changed dynamically through the following three methods / / extractRegular.addUrlPattern (urlPattern); / / extractRegular.removeUrlPattern (urlPattern.getUrlPattern ()); / / 3. Get HTML extraction tool HtmlExtractor htmlExtractor = HtmlExtractor.getInstance (extractRegular); / / 4, extract web page String url = "http://money.163.com/08/1219/16/4THR2TMP002533QK.html";List extractResults = htmlExtractor.extract (url," gb2312 "); / / 5, output result int I = 1 For (ExtractResult extractResult: extractResults) {System.out.println ((iTunes +) +, web page "+ extractResult.getUrl () +"); for (ExtractResultItem extractResultItem: extractResult.getExtractResultItems ()) {System.out.print ("\ t" + extractResultItem.getField () + "=" + extractResultItem.getValue ());} System.out.println ("\ tdescription =" + extractResult.getDescription ()) System.out.println ("\ tkeywords =" + extractResult.getKeywords ());} Multi-machine distributed usage: 1. Run the master node, responsible for maintaining the extraction rules: break the sub-project html-extractor-web into a War package and deploy it to Tomcat. 2. Get an instance of HtmlExtractor (slave node). The sample code is as follows: String allExtractRegularUrl = "http://localhost:8080/HtmlExtractorServer/api/all_extract_regular.jsp";String redisHost =" localhost "; int redisPort = 6379 / HtmlExtractor htmlExtractor = HtmlExtractor.getInstance (allExtractRegularUrl, redisHost, redisPort); 3. Extract information, the sample code is as follows: String url =" http://money.163.com/08/1219/16/4THR2TMP002533QK.html";List extractResults = htmlExtractor.extract (url, "gb2312"). Int I = 1 for (ExtractResult extractResult: extractResults) {System.out.println ((iTunes +) + ", web page" + extractResult.getUrl () + "extraction result"); for (ExtractResultItem extractResultItem: extractResult.getExtractResultItems ()) {System.out.print ("\ t" + extractResultItem.getField () + "=" + extractResultItem.getValue ());} System.out.println ("\ tdescription =" + extractResult.getDescription ()) System.out.println ("\ tkeywords =" + extractResult.getKeywords ());} after reading the above, do you have any further understanding of how Java implements the template-based web page structured information precision extraction component HtmlExtractor? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.