In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
01. Description-full-text search (concept)
When there is a large amount of data and high requirements, there is a lot of database content, and when database search is under great pressure on the database server, please use full-text search-Lucene framework
What is the search data?
Text (important)
Multimedia
What is the way to search?
Do not handle semantics
Search for articles with specified words
Scope of application?
Web search, post bar search, document search, etc.
What are the requirements of full-text search?
Search speed should be fast
The result should be accurate.
When multiple results are found, put the best matching results first, and sort the correlation
Case insensitive
Description of the function of 02-Lucene
Lucene.apache.org
Apache provides tomcat/struts/beanutils/dbUtils/..
The principle of full-text Retrieval based on Lucene
In a large amount of data, how can Lucene achieve fast retrieval?
Baidu search-users request to send to Baidu server-return results
│
Many crawlers grab the index library from the Internet in the data organization-specific (fast search) format and put it into the Baidu server.
│
Lucene manages the index database and provides search function.
Introduction to API and data structure of 03-Lucene
The index library stores a pile of binary data, which can be understood as a database.
How do I create a directory of index libraries?
Web pages, files, all represent an object in java, ordinary javabean, using Map to represent all the information in a javabean, such as Map
04-prepare the development environment for Lucene + HelloWorld (indexing)
Core package lucene-core-3.0.1.jar
Specific feature pack
Word splitter lucene-analyzers-3.0.1.jar
Highlight keyword lucene-highlighter-3.0.1.jar
Package lucene-memory-3.0.1.jar on which the highlight function depends
Fast operation mode shift + Alt + A
06-Internal structure of index database
The index database has two areas.
1. Directory area-participle divides keywords and stores (keywords and n articles) correspondence
two。 Data area-Storage document
Doc.add (new Field ("id", idStr, Store.YES, Index.ANALYZED))
Doc.add (new Field ("title", article.getTitle (), Store.YES, Index.ANALYZED))
Doc.add (new Field ("content", article.getContent (), Store.NO, Index.ANALYZED))
Store parameter
Used to specify whether the original value of a Field is stored in the database
YES-Storage. The value of this field is in the extracted Document.
NO-if it is not stored, the value of this field will not be found in the extracted Document
Index parameter
Used to specify whether to update text values in a Field to the catalog area
NO-does not update to the catalog area and cannot be searched by this field
ANALYZED-first participle the text value of the field, and update the result to the directory
NOT_ANALYZED-No word segmentation, directly update the text value of Field as a word to the directory, application scenarios: author, date, number, url, file address
Store
Index
Application scenario
YES
ANALYZED
Can search and display
YES
NO
Do not search by a field, but this field is displayed when it is displayed. For example, the author does not search by author, but the author is displayed when the article is displayed. Store some data in the index database, you can take it out directly, no longer need to ask the database for data, a query to display all the data, high efficiency, such as Baidu snapshot. If the data is particularly large, consider not saving the content in the index library.
NO
ANALYZED
You can search by this field, you can find the corresponding data number of this field, but the contents of this field are not displayed. For example, e-books can search by content, but do not display electronic content on the results page, only the title and author of the e-book.
NO
NOT_ANALYZED
Not allowed
07-Analysis of the process of indexing and searching. Build index-add, delete and modify index database
1. Convert Article to Document
Document doc = new Document ()
Doc.add (new Field ("id", idStr, Store.YES, Index.ANALYZED))
two。 Add to the index library
IndexWriter indexWriter =
New IndexWriter (directory, analyzer, MaxFieldLength.UNLIMITED)
IndexWriter.addDocument (doc)
Do two things: (1) save the Document to the data area, and an internal number will be assigned automatically
(2) update a field value (participle or no participle) to the directory.
two。 Search
Search process
1. Convert the query string to a Query object (only query from title by default)
QueryParser queryParser = new QueryParser (Version.LUCENE_30, "title", analyzer)
Query query = queryParser.parse (queryString)
2. Execute the query to get the intermediate results search directory
IndexSearcher indexSearcher = new IndexSearcher (directory); / / specify the index library to be used
TopDocs topDocs = indexSearcher.search (query, 100); / / return the first n results at most
Int count = total number of topDocs.totalHits;// associations
ScoreDoc [] scoreDocs = topDocs.scoreDocs
3. The real Document data is extracted from the processing result according to docId.
When searching, you should also divide the words into keywords to match the keywords in the directory.
Word separator
Word segmentation rule
Both indexing and search use the same word segmentation rule
MaxFieldLength
Limited 10000 default
Unlimited max
When creating an index catalog, only how many words in the first field are processed
08-scenarios for using Lucene in Web applications
What problems will be caused by adding, deleting and modifying databases and index libraries in web applications?
Both the database and the index database have information about the article, 1. Waste of storage space 2. State synchronization problem
1. Is it a waste?
The principle of storing data in the index database: the data that can be searched, for example, there are author tables and chapter tables in the database, and only chapter tables are stored in the index database. According to the author search, the articles written by the author can be searched, but the specific information of the author can not be found. Duplicate data is just data that needs to be searched.
One search shows all the data, reducing the pressure on the database.
The index database is necessary to realize full-text search. in order not to store duplicate information, can we cancel the database and store all the data in the index database?
no way! Some functions of the database can not be realized by the index library, such as transaction management.
two。 A scheme that ensures that the state of the index library is consistent with that of the data source:
(1) option 1: what is the problem when adding, deleting and modifying the database, while also adding, deleting and modifying the index database? On a program that others already exist, you are not allowed to modify their program, but I want to add a search function, what should I do? For example, Baidu goes to organize the data of other people's web pages into a search.
(2) Plan 2: when you can't control the data source, grab the data from the data source regularly-- the concept of crawler. Periodically rebuild the index library (or synchronize with the data source). Sometimes synchronization is not as fast as rebuilding the index database, unless the index library is not large. When you are old, you use crawlers to analyze whether the web page on the web has the latest data, such as MD5 summary of the contents of the page, (the content of MD5 summary is irreversible), two different strings to do MD5 summary is different, compare MD5 summary, update the index database, millions of data reconstruction is relatively fast.
Vertical search: there is a more detailed analysis of professional things, such as Taobao search for goods, can search for goods with characteristics according to prices.
Different schemes are adopted according to different data sources.
1. The data source is the first or second solution of the database.
two。 The data source is the web page usage plan 2, popular vertical search
3. The data source is the file using JNI technology: call C or C++ program in java to get the information of the operating system in time. Use plan two.
Eg: use struts2 to add, delete, modify and check the post bar.
ArticleAction {
Add () {
Form-- > article; dao.save (article);.. / / Save to database
IndexDao.save (article) / / Save to index library
}
Delete () {
Id = getparam ("id"); dao.delete (id);.. / / delete from the database
/ / remove from the index library
}
Modify () {
Article= getById (id); form-- > articel; dao.update (article); / / Update to the database
/ / Update to the index library
}
}
Add, delete, modify and check the index database
IndexDao {
Save (article)
Delete (id)
Update (article)
Search (str)
...
}
09-implementation of IndexDao (1)
The id of article must also be stored in the index database, which must be stored without word segmentation. This is a unique identifier, and you can accurately lock the article. For example, you can find and delete the article in the index database according to id.
The conversion of int types to string types uses the methods in lucene to store binary types of int types. Using toString directly will turn a 4-byte data into a dozen bytes of data, wasting space and difficult to sort.
Ctrl + T View inheritance relationship
test
12-implement IndexDao (4)-manage singleton IndexWriter
If the initialization IndexWriter is not closed and is in use, the cache on memory may not be flushed to the hard disk and resources will not be released. Web applications need to create an indexwriter object in the application listener anyway, and close the indexwriter object before the application exits. Java program, close the eindexWriter object before the virtual machine exits. Specifies a piece of code to execute before jvm exits.
13-optimized Index Library: a method for merging Files
There are many files in the indexDir folder, some of which have been marked as deleted files, why not delete them directly, but mark them? this is because this folder is very large, frequent modifications, additions and deletions of this folder may bring inefficiency, when to delete these marked files, when lucene is not busy, then merge or delete. How can these marked and deleted files not be checked out? Check out all the data first, and then filter out the del file. After optimization, there is no del file = = merge files, several small files are merged into one large file, reducing io operations.
Optimization is to merge files and merge several small files into one large file.
When will the files be merged?
Batch operation, re-create the index library every day, grab more files, merge into a large file and put it into the index library. When the number of files with the same extension reaches a certain number, the default is 10, the minimum is 2, and then merge = = automatic optimization
@ Test
Public void test () throws Exception {
LuceneUtils.getIndexWriter () .optimize ()
}
/ / automatically merge files
@ Test
Public void testAuto () throws Exception {
/ / configure that when the number of small files reaches, they will be automatically merged into one large file, with a default of 10 and a minimum of 2.
LuceneUtils.getIndexWriter () .setMergeFactor (5)
/ / build an index
Article article = new Article ()
Article.setId (1)
Article.setTitle ("preparing the development environment for Lucene")
Article.setContent ("if the information retrieval system sends out a search request and then goes to the Internet to find the answer, it will not be able to return the result in a limited time.")
New ArticleIndexDao () save (article)
}
07-implementation scheme of web crawler 1
Crawler: function, grab web pages.
Download all the pages on a website?
Scheme:
0. Initial condition: home page
1. Download the web page to get the content of the web page
two。 Get all the hyperlinks in it
3. Remove the downloaded links and the links outside the stack
4. Loop processing every valid hyperlink back to 1 stop the loop before a new link appears
Technology:
How to download URL-UrlConnection (Socket) of a website
Http protocol sends request response entity content
1. Download the web page
Public static String downLoad (String urlString) {
URL url = new URL (urlString)
URLConnection conn = url.openConnection ()
InputStream in = conn.getInputStream ()
/ / the data in the in stream is the web page content.
two。 Get all the hyperlinks
Dom + XPath
3. Remove the downloaded links and the links outside the stack
Put all the downloaded links in the database, or in a collection, and compare the new links with the links in the collection to see if they are included.
If you want to do better: if there is a network problem, many threads go to access, there are always a few threads can not access, can make 3 requests for links.
Multiple threads continue to take tasks from the task queue
The task queue (tasks that need to be completed) is put in the database, and it doesn't matter if there is a power outage. Multithreading queries and deletes tasks from the database. The queue form solves the problems of recursive memory overflow, power outage, and the inability to use multithreading.
Use LinkedList to represent queues in java
AddFirst () removeLast ()
AddLast () removeFirst ()
Using multiple modes: using schemas, mvc, inheritance
There are more classes, and the relationship becomes more complicated.
1. Suitable for more complex and stringent situations
two。 The code is readable.
3. Reasonable structure and easy to modify
4. Easy to expand and strong maintainability
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.