Example Analysis of crawler and data Storage implemented by SpringBoot+WebMagic+MyBaties 07/08 Update SLTechnology News&Howtos

Example Analysis of crawler and data Storage implemented by SpringBoot+WebMagic+MyBaties

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

SpringBoot+WebMagic+MyBaties implementation of crawlers and data storage example analysis, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

WebMagic is an open source crawler framework, the project grabs the data by using WebMagic in the SpringBoot project, and finally uses MyBatis to store the data.

Create the database:

In this example, the library is named article, the table is named cms_content, and the table contains three fields: contentId, title, and date.

CREATE TABLE `content` (`contentId` varchar (40) NOT NULL COMMENT 'content ID', `title` varchar (150) NOT NULL COMMENT' title', `date`varchar (150) NOT NULL COMMENT 'release date', PRIMARY KEY (`contentId`) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='CMS content table'; create a SpringBoot project:

1. Configuration depends on pom.xml

4.0.0 org.springframework.boot spring-boot-starter-parent 2.5.5 com.example Article 0.0.1-SNAPSHOT Article Article 1.8 UTF-8 true 3.8.1 3.1.0 5.1.47 1.1.17 1.3.4 1. 2.58 3.9 2.10.2 0.7.5 org.springframework.boot spring-boot-starter-web org.springframework.boot spring-boot-starter-test test org.springframework.boot spring-boot-configuration- Processor true mysql mysql-connector-java ${mysql.connector.version} com.alibaba druid-spring-boot-starter ${druid.spring.boot.starter.version} org.mybatis.spring.boot mybatis-spring- Boot-starter ${mybatis.spring.boot.starter.version} com.alibaba fastjson ${fastjson.version} org.apache.commons commons-lang3 ${commons.lang3.version} joda-time joda-time ${joda.time.version} us.codecraft webmagic-core ${webmagic.core.version} org.slf4j slf4j-log4j12 Org.apache.maven.plugins maven-compiler-plugin ${maven.compiler.plugin.version} ${java.version} ${java.version} ${project.build.sourceEncoding} Org.apache.maven.plugins maven-resources-plugin ${maven.resources.plugin.version} ${project.build.sourceEncoding} org.springframework.boot spring-boot-maven-plugin True true repackage Public aliyun nexus http://maven.aliyun.com/nexus/content/groups/public/ true public aliyun nexus http://maven.aliyun.com/nexus/content/groups/public/ True false 2 、 Create CmsContentPO.java

Data entity, corresponding to the 3 fields in the table.

Package site.exciter.article.model;public class CmsContentPO {private String contentId; private String title; private String date; public String getContentId () {return contentId;} public void setContentId (String contentId) {this.contentId = contentId;} public String getTitle () {return title;} public void setTitle (String title) {this.title = title } public String getDate () {return date;} public void setDate (String date) {this.date = date;}} 3, create CrawlerMapper.java

Package site.exciter.article.dao;import org.apache.ibatis.annotations.Mapper;import site.exciter.article.model.CmsContentPO;@Mapperpublic interface CrawlerMapper {int addCmsContent (CmsContentPO record);} 4, configuration mapping file CrawlerMapper.xml

Create a new mapper folder under resources and create a CrawlerMapper.xml under mapper

Insert into cms_content (contentId, title, date) values (# {contentId,jdbcType=VARCHAR}, # {title,jdbcType=VARCHAR}, # {date,jdbcType=VARCHAR}) 5, configure application.properties

Configure the database and mybatis mapping relationship.

# mysqlspring.datasource.name=mysqlspring.datasource.type=com.alibaba.druid.pool.DruidDataSourcespring.datasource.driver-class-name=com.mysql.jdbc.Driverspring.datasource.url=jdbc:mysql://10.201.61.184:3306/article?useUnicode=true&characterEncoding=utf8&useSSL=false&allowMultiQueries=truespring.datasource.username=rootspring.datasource.password=root# druidspring.datasource.druid.initial-size=5spring.datasource.druid.min-idle=5spring.datasource.druid.max-active=10spring.datasource.druid.max-wait=60000spring.datasource.druid.validation-query=SELECT 1 FROM DUALspring.datasource.druid.test- On-borrow=falsespring.datasource.druid.test-on-return=falsespring.datasource.druid.test-while-idle=truespring.datasource.druid.time-between-eviction-runs-millis=60000spring.datasource.druid.min-evictable-idle-time-millis=300000spring.datasource.druid.max-evictable-idle-time-millis=600000# mybatismybatis.mapperLocations=classpath:mapper/CrawlerMapper.xml6 、 Create ArticlePageProcessor.java

Parse the logic of html.

Package site.exciter.article;import org.springframework.stereotype.Component;import us.codecraft.webmagic.Page;import us.codecraft.webmagic.Site;import us.codecraft.webmagic.processor.PageProcessor;import us.codecraft.webmagic.selector.Selectable;@Componentpublic class ArticlePageProcessor implements PageProcessor {private Site site = Site.me () .setRetryTimes (3) .setSleepTime (1000) @ Override public void process (Page page) {String detail_urls_Xpath = "/ / * [@ class='postTitle'] / a [@ class='postTitle2'] / @ href"; String next_page_xpath = "/ / * [@ id='nav_next_page'] / a/@href"; String next_page_css = "# homepage_top_pager > div:nth-child (1) > a:nth-child (7)" String title_xpath = "/ / H2 [@ class='postTitle'] / a/span/text ()"; String date_xpath = "/ / span [@ id='post-date'] / text ()"; page.putField ("title", page.getHtml (). Xpath (title_xpath). ToString ()); if (page.getResultItems (). Get ("title") = null) {page.setSkip (true) } page.putField ("date", page.getHtml (). Xpath (date_xpath). ToString ()); if (page.getHtml (). Xpath (detail_urls_Xpath). Match ()) {Selectable detailUrls = page.getHtml (). Xpath (detail_urls_Xpath); page.addTargetRequests (detailUrls.all ()) } if (page.getHtml (). Xpath (next_page_xpath). Match ()) {Selectable nextPageUrl = page.getHtml (). Xpath (next_page_xpath); page.addTargetRequests (nextPageUrl.all ());} else if (page.getHtml (). Css (next_page_css). Match ()) {Selectable nextPageUrl = page.getHtml (). Css (next_page_css). Links () Page.addTargetRequests (nextPageUrl.all ());}} @ Override public Site getSite () {return site;}} 7, create ArticlePipeline.java

Deal with the persistence of data.

Package site.exciter.article;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.stereotype.Component;import site.exciter.article.model.CmsContentPO;import site.exciter.article.dao.CrawlerMapper;import us.codecraft.webmagic.ResultItems;import us.codecraft.webmagic.Task;import us.codecraft.webmagic.pipeline.Pipeline;import java.util.UUID;@Componentpublic class ArticlePipeline implements Pipeline {private static final Logger LOGGER = LoggerFactory.getLogger (ArticlePipeline.class) @ Autowired private CrawlerMapper crawlerMapper; public void process (ResultItems resultItems, Task task) {String title = resultItems.get ("title"); String date = resultItems.get ("date"); CmsContentPO contentPO = new CmsContentPO (); contentPO.setContentId (UUID.randomUUID (). ToString ()); contentPO.setTitle (title); contentPO.setDate (date); try {boolean success = crawlerMapper.addCmsContent (contentPO) > 0 LOGGER.info ("Save successfully: {}", title);} catch (Exception ex) {LOGGER.error ("Save failed", ex);} 8. Create ArticleTask.java

Perform a grab mission.

Package site.exciter.article;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.stereotype.Component;import us.codecraft.webmagic.Spider;import java.util.concurrent.Executors;import java.util.concurrent.ScheduledExecutorService;import java.util.concurrent.TimeUnit;@Componentpublic class ArticleTask {private static final Logger LOGGER = LoggerFactory.getLogger (ArticlePipeline.class); @ Autowired private ArticlePipeline articlePipeline; @ Autowired private ArticlePageProcessor articlePageProcessor; private ScheduledExecutorService timer = Executors.newSingleThreadScheduledExecutor () Public void crawl () {/ / scheduled task that crawls timer.scheduleWithFixedDelay (()-> {Thread.currentThread () .setName ("ArticleCrawlerThread") every 10 minutes) Try {Spider.create (articlePageProcessor) .addUrl ("http://www.cnblogs.com/dick159/default.html?page=2") / / the captured data is stored in the database .addURL (articlePipeline) / / start 5 threads Grab .thread (5) / / start the crawler asynchronously. } catch (Exception ex) {LOGGER.error ("scheduled fetching data thread execution exception", ex);}, 0,10, TimeUnit.MINUTES);}} 9, modify Application

Package site.exciter.article;import org.mybatis.spring.annotation.MapperScan;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.boot.CommandLineRunner;import org.springframework.boot.SpringApplication;import org.springframework.boot.autoconfigure.SpringBootApplication;@SpringBootApplication@MapperScan (basePackages = "site.exciter.article.interface") public class ArticleApplication implements CommandLineRunner {@ Autowired private ArticleTask articleTask; public static void main (String [] args) {SpringApplication.run (ArticleApplication.class, args) } @ Override public void run (String... Args) throws Exception {articleTask.crawl ();}} 10. Execute application and start grabbing data and storing it.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.