How to realize the function of full-text retrieval of disk files by integrating ES in springboot 04/19 Update SLTechnology News&Howtos

How to realize the function of full-text retrieval of disk files by integrating ES in springboot

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article introduces the knowledge of "how to integrate ES in springboot to achieve full-text retrieval of disk files". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Overall architecture

Considering that the disk files are distributed to different devices, the disk scanning agent mode is adopted to build the system, that is, the scanning service is deployed to the server where the target disk is located as an agent, and the index is uniformly established in ES as a scheduled task. of course, ES adopts the method of distributed high availability deployment, and the search service and scanning agent are deployed together to simplify the architecture and achieve distributed capabilities.

Fast retrieval architecture of disk files

Deploy ES

ES (elasticsearch) is the only third-party software that this project depends on. ES supports docker deployment. The following is the deployment process.

Docker pull docker.elastic.co/elasticsearch/elasticsearch:6.3.2docker run-e ES_JAVA_OPTS= "- Xms256m-Xmx256m"-d-p 9200 Xmx256m 9200-p 9300 Xms256m-name es01 docker.elastic.co/elasticsearch/elasticsearch:6.3.2

After the deployment is completed, open http://localhost:9200 through the browser. If it opens normally, the following interface appears, which means that the ES deployment is successful.

ES interface

Engineering structure

Dependency package

This project not only introduces the basic starter of springboot, but also needs to introduce ES related packages.

Org.springframework.boot spring-boot-starter-data-elasticsearch io.searchbox jest 5.3.3 net.sf.jmimemagic jmimemagic 0.1.4

Configuration file

You need to configure the access address of ES into application.yml, and in order to simplify the program, you need to configure the root directory (index-root) of the disk to be scanned, and the subsequent scanning task will recursively traverse all the indexable files in that directory.

Server: port: @ elasticsearch.port@spring: application: name: @ project.artifactId@ profiles: active: dev elasticsearch: jest: uris: http://127.0.0.1:9200index-root: / Users/crazyicelee/mywokerspace

Index structure data definition

Because the directory where the file is located, the file name, and the file body are required to be retrievable, define these as index fields, and add the JestId required by ES client to annotate the id.

Package com.crazyice.lee.accumulation.search.data;import io.searchbox.annotations.JestId;import lombok.Data;@Datapublic class Article {@ JestId private Integer id; private String author; private String title; private String path; private String content; private String fileFingerprint;}

Scan the disk and create an index

Because to scan all files under the specified directory, so use recursive method to traverse the directory, and identify the files that have been processed to improve efficiency, in the file type identification using two ways to choose, one is the file content more accurate judgment (Magic), the other is a rough judgment based on the file extension. This part is the core component of the whole system.

Here's a little trick.

The MD5 value of the target file content is calculated and stored in the index field of ES as a file fingerprint, and each time the index is rebuilt, it is judged whether the MD5 exists. If it exists, there is no need to repeat the index, which can avoid repeated file indexing and repeated traversal of the file after the system restart.

Package com.crazyice.lee.accumulation.search.service;import com.alibaba.fastjson.JSONObject;import com.crazyice.lee.accumulation.search.data.Article;import com.crazyice.lee.accumulation.search.utils.Md5CaculateUtil;import io.searchbox.client.JestClient;import io.searchbox.core.Index;import io.searchbox.core.Search;import io.searchbox.core.SearchResult;import lombok.extern.slf4j.Slf4j;import net.sf.jmimemagic.*;import org.apache.poi.hwpf.extractor.WordExtractor;import org.apache.poi.xwpf.extractor.XWPFWordExtractor Import org.apache.poi.xwpf.usermodel.XWPFDocument;import org.elasticsearch.index.query.QueryBuilders;import org.elasticsearch.search.builder.SearchSourceBuilder;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.stereotype.Component;import java.io.File;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException;@Component@Slf4jpublic class DirectoryRecurse {@ Autowired private JestClient jestClient / / convert the contents of the file to the string private String readToString (File file, String fileType) {StringBuffer result = new StringBuffer (); switch (fileType) {case "text/plain": case "java": case "c": case "cpp": case "txt": try (FileInputStream in = new FileInputStream (file)) {Long filelength = file.length () Byte [] filecontent = new byte [filelength.intValue ()]; in.read (filecontent); result.append (new String (filecontent, "utf8");} catch (FileNotFoundException e) {log.error ("{}", e.getLocalizedMessage ());} catch (IOException e) {log.error ("{}", e.getLocalizedMessage ());} break Case "doc": / / use the WordExtractor class in the HWPF component to extract text or paragraphs from an Word document try (FileInputStream in = new FileInputStream (file)) {WordExtractor extractor = new WordExtractor (in); result.append (extractor.getText ());} catch (Exception e) {log.error ("{}", e.getLocalizedMessage ());} break Case "docx": try (FileInputStream in = new FileInputStream (file); XWPFDocument doc = new XWPFDocument (in)) {XWPFWordExtractor extractor = new XWPFWordExtractor (doc); result.append (extractor.getText ());} catch (Exception e) {log.error ("{}", e.getLocalizedMessage ());} break;} return result.toString () } / / determine whether private JSONObject isIndex (File file) {JSONObject result = new JSONObject (); / / generate a file fingerprint with MD5, and search whether the fingerprint has been indexed String fileFingerprint = Md5CaculateUtil.getMD5 (file); result.put ("fileFingerprint", fileFingerprint); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder (); searchSourceBuilder.query ("fileFingerprint", fileFingerprint)) Search search = new Search.Builder (searchSourceBuilder.toString ()) .addIndex ("diskfile"). AddType ("files"). Build (); try {/ / execute SearchResult searchResult = jestClient.execute (search); if (searchResult.getTotal () > 0) {result.put ("isIndex", true);} else {result.put ("isIndex", false) }} catch (IOException e) {log.error ("{}", e.getLocalizedMessage ());} return result;} / a pair of file directories and content creation index private void createIndex (File file, String method) {/ / ignore temporary files, the file name starting with ~ $is if (file.getName (). StartsWith ("~ $")) return; String fileType = null Switch (method) {case "magic": Magic parser = new Magic (); try {MagicMatch match = parser.getMagicMatch (file, false); fileType = match.getMimeType ();} catch (MagicParseException e) {/ / log.error ("{}", e.getLocalizedMessage ()) } catch (MagicMatchNotFoundException e) {/ / log.error ("{}", e.getLocalizedMessage ());} catch (MagicException e) {/ / log.error ("{}", e.getLocalizedMessage ());} break; case "ext": String filename = file.getName (); String [] strArray = filename.split ("\\.") Int suffixIndex = strArray.length-1; fileType = strArray [suffixIndex];} switch (fileType) {case "text/plain": case "java": case "c": case "cpp": case "txt": case "doc": case "docx": JSONObject isIndexResult = isIndex (file) Log.info ("File name: {}, File Type: {}, MD5: {}, indexing: {}", file.getPath (), fileType, isIndexResult.getString ("fileFingerprint"), isIndexResult.getBoolean ("isIndex")); if (isIndexResult.getBoolean ("isIndex")) break; / / 1. Index (save) a document in ES Article article = new Article (); article.setTitle (file.getName ()); article.setAuthor (file.getParent ()); article.setPath (file.getPath ()); article.setContent (readToString (file, fileType)); article.setFileFingerprint (isIndexResult.getString ("fileFingerprint")); / / 2. Build an index Index index = new Index.Builder (article). Index ("diskfile"). Type ("files"). Build (); try {/ / 3. Execute if (! jestClient.execute (index) .getId () .isEmpty ()) {log.info ("Index built successfully!") ;} catch (IOException e) {log.error ("{}", e.getLocalizedMessage ());} break;}} public void find (String pathName) throws IOException {/ / get pathName's File object File dirFile = new File (pathName) / / determine whether the file or directory exists, and remind if (! dirFile.exists ()) {log.info ("do not exit"); return in the console output if it does not exist. } / / determine whether it is a file if it is not a directory, and output the file path if (! dirFile.isDirectory ()) {if (dirFile.isFile ()) {createIndex (dirFile, "ext");} return;} / / get all file names and directory names in this directory String [] fileList = dirFile.list () For (int I = 0; I

< fileList.length; i++) { //遍历文件目录 String string = fileList[i]; File file = new File(dirFile.getPath(), string); //如果是一个目录，输出目录名后，进行递归 if (file.isDirectory()) { //递归 find(file.getCanonicalPath()); } else { createIndex(file, "ext"); } } }} 扫描任务这里采用定时任务的方式来扫描指定目录以实现动态增量创建索引。 package com.crazyice.lee.accumulation.search.service;import lombok.extern.slf4j.Slf4j;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.beans.factory.annotation.Value;import org.springframework.context.annotation.Configuration;import org.springframework.scheduling.annotation.Scheduled;import org.springframework.stereotype.Component;import java.io.IOException;@Configuration@Component@Slf4jpublic class CreateIndexTask { @Autowired private DirectoryRecurse directoryRecurse; @Value("${index-root}") private String indexRoot; @Scheduled(cron = "* 0/5 * * * ?") private void addIndex(){ try { directoryRecurse.find(indexRoot); directoryRecurse.writeIndexStatus(); } catch (IOException e) { log.error("{}",e.getLocalizedMessage()); } }} 搜索服务这里以restFul的方式提供搜索服务，将关键字以高亮度模式提供给前端UI，浏览器端可以根据返回的JSON进行展示。 package com.crazyice.lee.accumulation.search.web;import com.alibaba.fastjson.JSONObject;import com.crazyice.lee.accumulation.search.data.Article;import io.searchbox.client.JestClient;import io.searchbox.core.Search;import io.searchbox.core.SearchResult;import io.swagger.annotations.ApiImplicitParam;import io.swagger.annotations.ApiImplicitParams;import io.swagger.annotations.ApiOperation;import lombok.extern.slf4j.Slf4j;import org.elasticsearch.index.query.BoolQueryBuilder;import org.elasticsearch.index.query.QueryBuilders;import org.elasticsearch.search.builder.SearchSourceBuilder;import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.lang.NonNull;import org.springframework.web.bind.annotation.PathVariable;import org.springframework.web.bind.annotation.RequestMapping;import org.springframework.web.bind.annotation.RequestMethod;import org.springframework.web.bind.annotation.RestController;import java.io.IOException;import java.util.HashMap;import java.util.List;import java.util.Map;@RestController@Slf4jpublic class Controller { @Autowired private JestClient jestClient; @RequestMapping(value = "/search/{keyword}",method = RequestMethod.GET) @ApiOperation(value = "全部字段搜索关键字",notes = "es验证") @ApiImplicitParams( @ApiImplicitParam(name = "keyword",value = "全文检索关键字",required = true,paramType = "path",dataType = "String") ) public List search(@PathVariable String keyword){ SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.query(QueryBuilders.queryStringQuery(keyword)); HighlightBuilder highlightBuilder = new HighlightBuilder(); //path属性高亮度 HighlightBuilder.Field highlightPath = new HighlightBuilder.Field("path"); highlightPath.highlighterType("unified"); highlightBuilder.field(highlightPath); //title字段高亮度 HighlightBuilder.Field highlightTitle = new HighlightBuilder.Field("title"); highlightTitle.highlighterType("unified"); highlightBuilder.field(highlightTitle); //content字段高亮度 HighlightBuilder.Field highlightContent = new HighlightBuilder.Field("content"); highlightContent.highlighterType("unified"); highlightBuilder.field(highlightContent); //高亮度配置生效 searchSourceBuilder.highlighter(highlightBuilder); log.info("搜索条件{}",searchSourceBuilder.toString()); //构建搜索功能 Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex( "gf" ).addType( "news" ).build(); try { //执行 SearchResult result = jestClient.execute( search ); return result.getHits(Article.class); } catch (IOException e) { log.error("{}",e.getLocalizedMessage()); } return null; }} 搜索restFul结果测试这里以swagger的方式进行API测试。其中keyword是全文检索中要搜索的关键字。搜索结果使用thymeleaf生成UI 集成thymeleaf的模板引擎直接将搜索结果以web方式呈现。模板包括主搜索页和搜索结果页，通过@Controller注解及Model对象实现。更多 document.querySelectorAll('.con-more').forEach(item =>

{item.onclick = () = > {item.style.cssText = 'display: none'; item [XSS _ clean] .querySelector ('. Con-preview'). Style.cssText = 'max-height: none;';}}); "how to integrate ES in springboot to achieve full-text retrieval of disk files" is introduced here, thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.