In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
SpringBoot+WebMagic+MyBaties implementation of crawlers and data storage example analysis, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.
WebMagic is an open source crawler framework, the project grabs the data by using WebMagic in the SpringBoot project, and finally uses MyBatis to store the data.
Create the database:
In this example, the library is named article, the table is named cms_content, and the table contains three fields: contentId, title, and date.
CREATE TABLE `content` (`contentId` varchar (40) NOT NULL COMMENT 'content ID', `title` varchar (150) NOT NULL COMMENT' title', `date`varchar (150) NOT NULL COMMENT 'release date', PRIMARY KEY (`contentId`) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='CMS content table'; create a SpringBoot project:
1. Configuration depends on pom.xml
4.0.0 org.springframework.boot spring-boot-starter-parent 2.5.5 com.example Article 0.0.1-SNAPSHOT Article Article 1.8 UTF-8 true 3.8.1 3.1.0 5.1.47 1.1.17 1.3.4 1. 2.58 3.9 2.10.2 0.7.5 org.springframework.boot spring-boot-starter-web org.springframework.boot spring-boot-starter-test test org.springframework.boot spring-boot-configuration- Processor true mysql mysql-connector-java ${mysql.connector.version} com.alibaba druid-spring-boot-starter ${druid.spring.boot.starter.version} org.mybatis.spring.boot mybatis-spring- Boot-starter ${mybatis.spring.boot.starter.version} com.alibaba fastjson ${fastjson.version} org.apache.commons commons-lang3 ${commons.lang3.version} joda-time joda-time ${joda.time.version} us.codecraft webmagic-core ${webmagic.core.version} org.slf4j slf4j-log4j12 Org.apache.maven.plugins maven-compiler-plugin ${maven.compiler.plugin.version} ${java.version} ${java.version} ${project.build.sourceEncoding} Org.apache.maven.plugins maven-resources-plugin ${maven.resources.plugin.version} ${project.build.sourceEncoding} org.springframework.boot spring-boot-maven-plugin True true repackage Public aliyun nexus http://maven.aliyun.com/nexus/content/groups/public/ true public aliyun nexus http://maven.aliyun.com/nexus/content/groups/public/ True false 2 、 Create CmsContentPO.java
Data entity, corresponding to the 3 fields in the table.
Package site.exciter.article.model;public class CmsContentPO {private String contentId; private String title; private String date; public String getContentId () {return contentId;} public void setContentId (String contentId) {this.contentId = contentId;} public String getTitle () {return title;} public void setTitle (String title) {this.title = title } public String getDate () {return date;} public void setDate (String date) {this.date = date;}} 3, create CrawlerMapper.java
Package site.exciter.article.dao;import org.apache.ibatis.annotations.Mapper;import site.exciter.article.model.CmsContentPO;@Mapperpublic interface CrawlerMapper {int addCmsContent (CmsContentPO record);} 4, configuration mapping file CrawlerMapper.xml
Create a new mapper folder under resources and create a CrawlerMapper.xml under mapper
Insert into cms_content (contentId, title, date) values (# {contentId,jdbcType=VARCHAR}, # {title,jdbcType=VARCHAR}, # {date,jdbcType=VARCHAR}) 5, configure application.properties
Configure the database and mybatis mapping relationship.
# mysqlspring.datasource.name=mysqlspring.datasource.type=com.alibaba.druid.pool.DruidDataSourcespring.datasource.driver-class-name=com.mysql.jdbc.Driverspring.datasource.url=jdbc:mysql://10.201.61.184:3306/article?useUnicode=true&characterEncoding=utf8&useSSL=false&allowMultiQueries=truespring.datasource.username=rootspring.datasource.password=root# druidspring.datasource.druid.initial-size=5spring.datasource.druid.min-idle=5spring.datasource.druid.max-active=10spring.datasource.druid.max-wait=60000spring.datasource.druid.validation-query=SELECT 1 FROM DUALspring.datasource.druid.test- On-borrow=falsespring.datasource.druid.test-on-return=falsespring.datasource.druid.test-while-idle=truespring.datasource.druid.time-between-eviction-runs-millis=60000spring.datasource.druid.min-evictable-idle-time-millis=300000spring.datasource.druid.max-evictable-idle-time-millis=600000# mybatismybatis.mapperLocations=classpath:mapper/CrawlerMapper.xml6 、 Create ArticlePageProcessor.java
Parse the logic of html.
Package site.exciter.article;import org.springframework.stereotype.Component;import us.codecraft.webmagic.Page;import us.codecraft.webmagic.Site;import us.codecraft.webmagic.processor.PageProcessor;import us.codecraft.webmagic.selector.Selectable;@Componentpublic class ArticlePageProcessor implements PageProcessor {private Site site = Site.me () .setRetryTimes (3) .setSleepTime (1000) @ Override public void process (Page page) {String detail_urls_Xpath = "/ / * [@ class='postTitle'] / a [@ class='postTitle2'] / @ href"; String next_page_xpath = "/ / * [@ id='nav_next_page'] / a/@href"; String next_page_css = "# homepage_top_pager > div:nth-child (1) > a:nth-child (7)" String title_xpath = "/ / H2 [@ class='postTitle'] / a/span/text ()"; String date_xpath = "/ / span [@ id='post-date'] / text ()"; page.putField ("title", page.getHtml (). Xpath (title_xpath). ToString ()); if (page.getResultItems (). Get ("title") = null) {page.setSkip (true) } page.putField ("date", page.getHtml (). Xpath (date_xpath). ToString ()); if (page.getHtml (). Xpath (detail_urls_Xpath). Match ()) {Selectable detailUrls = page.getHtml (). Xpath (detail_urls_Xpath); page.addTargetRequests (detailUrls.all ()) } if (page.getHtml (). Xpath (next_page_xpath). Match ()) {Selectable nextPageUrl = page.getHtml (). Xpath (next_page_xpath); page.addTargetRequests (nextPageUrl.all ());} else if (page.getHtml (). Css (next_page_css). Match ()) {Selectable nextPageUrl = page.getHtml (). Css (next_page_css). Links () Page.addTargetRequests (nextPageUrl.all ());}} @ Override public Site getSite () {return site;}} 7, create ArticlePipeline.java
Deal with the persistence of data.
Package site.exciter.article;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.stereotype.Component;import site.exciter.article.model.CmsContentPO;import site.exciter.article.dao.CrawlerMapper;import us.codecraft.webmagic.ResultItems;import us.codecraft.webmagic.Task;import us.codecraft.webmagic.pipeline.Pipeline;import java.util.UUID;@Componentpublic class ArticlePipeline implements Pipeline {private static final Logger LOGGER = LoggerFactory.getLogger (ArticlePipeline.class) @ Autowired private CrawlerMapper crawlerMapper; public void process (ResultItems resultItems, Task task) {String title = resultItems.get ("title"); String date = resultItems.get ("date"); CmsContentPO contentPO = new CmsContentPO (); contentPO.setContentId (UUID.randomUUID (). ToString ()); contentPO.setTitle (title); contentPO.setDate (date); try {boolean success = crawlerMapper.addCmsContent (contentPO) > 0 LOGGER.info ("Save successfully: {}", title);} catch (Exception ex) {LOGGER.error ("Save failed", ex);} 8. Create ArticleTask.java
Perform a grab mission.
Package site.exciter.article;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.stereotype.Component;import us.codecraft.webmagic.Spider;import java.util.concurrent.Executors;import java.util.concurrent.ScheduledExecutorService;import java.util.concurrent.TimeUnit;@Componentpublic class ArticleTask {private static final Logger LOGGER = LoggerFactory.getLogger (ArticlePipeline.class); @ Autowired private ArticlePipeline articlePipeline; @ Autowired private ArticlePageProcessor articlePageProcessor; private ScheduledExecutorService timer = Executors.newSingleThreadScheduledExecutor () Public void crawl () {/ / scheduled task that crawls timer.scheduleWithFixedDelay (()-> {Thread.currentThread () .setName ("ArticleCrawlerThread") every 10 minutes) Try {Spider.create (articlePageProcessor) .addUrl ("http://www.cnblogs.com/dick159/default.html?page=2") / / the captured data is stored in the database .addURL (articlePipeline) / / start 5 threads Grab .thread (5) / / start the crawler asynchronously. } catch (Exception ex) {LOGGER.error ("scheduled fetching data thread execution exception", ex);}, 0,10, TimeUnit.MINUTES);}} 9, modify Application
Package site.exciter.article;import org.mybatis.spring.annotation.MapperScan;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.boot.CommandLineRunner;import org.springframework.boot.SpringApplication;import org.springframework.boot.autoconfigure.SpringBootApplication;@SpringBootApplication@MapperScan (basePackages = "site.exciter.article.interface") public class ArticleApplication implements CommandLineRunner {@ Autowired private ArticleTask articleTask; public static void main (String [] args) {SpringApplication.run (ArticleApplication.class, args) } @ Override public void run (String... Args) throws Exception {articleTask.crawl ();}} 10. Execute application and start grabbing data and storing it.
Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.