In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "what are the advantages and disadvantages of ElasticSearch and Solr". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn what are the advantages and disadvantages of ElasticSearch and Solr.
Carefully drawn and provided by ReyCG
What is full-text search?
What is a full-text search engine? The definition in Baidu encyclopedia:
Full-text search engine is a widely used mainstream search engine at present. Its working principle is that the computer index program scans every word in the article, builds an index on each word, and indicates the number and location of the word in the article. When the user queries, the retrieval program searches according to the pre-established index and feeds back the search results to the user's retrieval method. This process is similar to the process of looking up words through a search table in a dictionary.
From the definition, we can roughly understand the idea of full-text retrieval. In order to explain it in more detail, let's start with the data in life.
The data in our lives are generally divided into two types:
Structured data: data with fixed format or limited length, such as database, metadata, etc.
Unstructured data: unstructured data, also known as full-text data, refers to data with variable length or no fixed format, such as mail, Word documents, etc.
Of course, there will be a third kind of semi-structured data, such as XML,HTML, when it can be processed as structured data as needed, or pure text can be extracted as unstructured data.
According to two kinds of data classification, search is also divided into two kinds: structured data search and unstructured data search.
For structured data, we can generally store and search through the table of relational databases (MySQL,Oracle, etc.), and we can also build indexes.
There are two main ways to search for unstructured data, that is, full-text data:
Sequential scanning
Full-text retrieval
Sequential scanning: you can also know its general search method through the text name, that is, searching for specific keywords in the way of sequential scanning.
For example, give you a newspaper so that you can find out where the word "RNG" has appeared. You definitely need to scan the newspaper from beginning to end and mark where the keyword has appeared and where it appears.
This method is undoubtedly the most time-consuming and inefficient, if the typesetting font of the newspaper is small, and there are more pages or even more newspapers, you will be almost done by the time you finish scanning your eyes.
Full-text search: sequential scanning of unstructured data is very slow, can we optimize it? Can't we just find a way to structure our unstructured data?
Part of the information in the unstructured data is extracted and reorganized to make it have a certain structure, and then the data with a certain structure is searched, so as to achieve the purpose of relatively fast search.
This way constitutes the basic idea of full-text retrieval. This part of the information extracted from unstructured data and then reorganized is called an index.
Also take reading newspapers as an example, we want to follow the news of League of Legends S8 global finals. If they are all RNG fans, how can we quickly find the newspapers and sections of RNG News?
The way of full-text retrieval is to extract keywords from all sections of all newspapers, such as "EDG", "RNG", "FW", "team", "League of Legends" and so on.
Then index these keywords, through which we can correspond to the newspaper and section where the keyword appears. Pay attention to the difference between directory search engines.
Why use full-text search engine
Earlier, a colleague asked me why I used a search engine. All our data are available in the database, and Oracle, SQL Server and other databases can also provide query retrieval or clustering analysis functions, not directly through the database query?
Indeed, most of our query functions can be obtained through database queries. If the query efficiency is low, we can also improve efficiency by building database indexes, optimizing SQL and so on, and even speed up the return of data by introducing caches.
If the amount of data is larger, it can be divided into databases and tables to share the query pressure. Then why full-text search engines? We mainly analyze the following reasons:
Data type
Full-text index search supports the search of unstructured data, which can better quickly search the unstructured text of any word or phrase that exists in large numbers.
For example, Google, Baidu type of site search, they are based on the keyword in the page to generate the index, we enter the keyword when searching, they will be the keyword that the index matched to all the pages back; there are common projects in the application of log search, and so on.
For these unstructured data texts, relational database search is not well supported.
Maintenance of index
In general, full-text retrieval is very difficult in traditional databases, because generally no one uses data inventory text fields.
Full-text retrieval requires scanning the entire table, and even if the syntax of SQL is optimized, it has little effect if there is a large amount of data.
The index is established, but it is also cumbersome to maintain, and the index is rebuilt for both insert and update operations.
When to use a full-text search engine:
The data object searched is a large amount of unstructured text data.
The number of document records reaches hundreds of thousands or millions or more.
Support a large number of interactive text-based queries.
A very flexible full-text search query is required.
There are special requirements for highly relevant search results, but there is no available relational database to meet.
Situations where there is relatively little need for different record types, non-text data manipulation, or secure transaction processing.
Lucene,Solr,ElasticSearch?
Now the mainstream search engine is probably: Lucene,Solr,ElasticSearch.
Picture
Their index establishment is based on the inverted index. What is the inverted index?
Wikipedia: inverted indexing (English: Inverted index), also known as reverse indexing, placed files, or reverse files, is an indexing method used to store a mapping of the location of a word in a document or group of documents under full-text search. It is the most commonly used data structure in document retrieval system.
Lucene
Lucene is a Java full-text search engine written entirely in Java. Lucene is not a complete application, but a code base and API that can be easily used to add search capabilities to an application. Lucene provides powerful features through a simple API:
Scalable high-performance indexes:
Exceed 150GB / hour on modern hardware.
Small RAM requires only 1MB heap.
Incremental indexes are as fast as bulk indexes.
The index size is about 20-30% of the index text size.
Powerful, accurate and efficient search algorithm:
Ranking search: return the best results first.
There are many powerful query types: phrase query, wildcard query, proximity query, range query, etc.
On-site search (e.g. title, author, content).
Sort by any field.
Use the merge results for a multi-index search.
Allows both update and search.
Flexible facet, highlighting, connection and result grouping.
Recommendations for speed, memory efficiency and error tolerance.
Pluggable ranking models, including vector space model and Okapi BM25.
Configurable storage engine (codec).
Cross-platform solutions:
Available as open source software under the Apache license, allowing you to use Lucene in commercial and open source programs.
100%-pure Java .
Implementations in other programming languages available are index compatible.
Apache Software Foundation:
Get support from the Apache community of open source software projects provided by the Apache Software Foundation.
But Lucene is just a framework, and to make full use of its functionality, you need to use Java and integrate Lucene into the program.
It takes a lot of learning to understand how it works, and it's really complicated to be proficient in using Lucene.
Solr
Apache Solr is an open source search platform based on a Java library called Lucene. It provides Apache Lucene search capabilities in a user-friendly manner.
As an industry participant for nearly a decade, it is a mature product with a strong and extensive user community.
It provides distributed indexing, replication, load balancing queries, and automatic failover and recovery. If it is properly deployed and well managed, it can become a highly reliable, scalable and fault-tolerant search engine.
Many Internet giants, such as Netflix,eBay,Instagram and CloudSearch, use Solr because it can index and search multiple sites.
The main feature list includes:
Full-text search
Stand out
Faceted search
Real-time index
Dynamic clustering
Database integration
NoSQL features and rich document processing (such as Word and PDF files)
ElasticSearch
Elasticsearch is an open source (Apache 2 license) RESTful search engine built on the Apache Lucene library.
Elasticsearch was launched a few years after Solr. It provides a distributed, multi-tenant full-text search engine with a HTTP Web interface (REST) and unstructured JSON documents.
Elasticsearch's official client library provides Java,Groovy,PHP,Ruby,Perl,Python,.NET and Javascript.
The distributed search engine includes an index that can be divided into shards, and each shard may have multiple copies.
Each Elasticsearch node can have one or more shards, and its engine can also act as a coordinator, delegating operations to the correct shards.
Elasticsearch can be extended through near real-time search. One of its main functions is multi-tenancy. The main feature list includes:
Distributed search
Multi-tenant
Analysis and search
Grouping and aggregation
The choice of Elasticsearch vs Solr
Because of the complexity of Lucene, it is rarely considered as the first choice for search, excluding some companies that need to develop their own search framework and the underlying need to rely on Lucene.
So here we focus on which is better? What's the difference between them? Which one should you use?
Picture
Historical comparison
Apache Solr is a mature project with a large and active development and user community, as well as the Apache brand.
First released to open source in 2006, Solr has long dominated the search engine field and is the first choice for anyone who needs search capabilities.
Its maturity translates into rich functions, not just simple text indexing and search; such as facets, grouping, powerful filtering, pluggable document processing, pluggable search chain components, language detection, etc.
Solr has dominated the search field for many years. Then, around 2010, Elasticsearch became another option in the market.
At that time, it was far less stable than Solr, without the functional depth of Solr, without thought sharing, branding, and so on.
Although Elasticsearch is young, it has some advantages of its own. Elasticsearch is based on more modern principles, targeted at more modern use cases, and is built to make it easier to handle large indexes and high query rates.
In addition, because it is too young to have a community to cooperate with, it is free to move forward without any consensus or cooperation with others (users or developers), backward compatibility, or any other more mature software that usually has to deal with.
As a result, it exposes some very popular features before Solr (for example, near real-time search, English: Near Real-Time Search).
Technically, NRT's search capability does come from Lucene, which is the basic search base used by Solr and Elasticsearch.
Ironically, because Elasticsearch first made NRT search public, people associate NRT search with Elasticsearch.
Although both Solr and Lucene are part of the same Apache project, Solr is expected to have such demanding functionality in the first place.
Feature difference comparison
Both search engines are popular, advanced open source search engines. They are all built around the core underlying search base Lucene, but they are different.
Like everything, each has its advantages and disadvantages, and depending on your needs and expectations, each may be better or worse.
Both Solr and Elasticsearch are growing rapidly, so, without saying much, let's take a look at their list of differences:
Picture
Learn more: http://solr-vs-elasticsearch.com/
Comprehensive comparison
In addition, let's analyze it from the following aspects:
The popular trend of ① in recent years
Let's take a look at the Google search trends for both products. Google trends show that Elasticsearch is more attractive than Solr, but that doesn't mean Apache Solr is dead.
Although some people may not think so, Solr is still one of the most popular search engines, with strong community and open source support.
Picture
② installation and configuration
Compared with Solr, Elasticsearch is easy to install and very lightweight. In addition, you can install and run Elasticsearch in a few minutes.
However, this ease of deployment and use can be a problem if Elasticsearch is not managed properly.
JSON-based configuration is simple, but if you want to specify comments for each configuration in the file, it is not for you.
In general, if your application uses JSON, then Elasticsearch is a better choice.
Otherwise, use Solr because its schema.xml and solrconfig.xml are well documented.
③ community
Solr has a larger and more mature community of users, developers, and contributors. ES has a small but active user community and a growing community of contributors.
Solr is the true representative of the open source community. Anyone can contribute to Solr and select new Solr developers (also known as submitters) based on their strengths.
Elasticsearch is open source in technology, but not so important in spirit. Anyone can see the source, and anyone can change it and contribute, but only Elasticsearch employees can actually make changes to Elasticsearch.
Solr contributors and submitters come from many different organizations, while Elasticsearch submitters come from a single company.
④ maturity
Solr is more mature, but ES is growing rapidly, and I think it's stable.
⑤ document
Solr scored high here. It is a very documented product with clear examples and API use case scenarios.
Elasticsearch's documentation is well organized, but it lacks good examples and clear configuration instructions.
Summary
So, do you choose Solr or Elasticsearch? Sometimes it's hard to find a clear answer. Whether you choose Solr or Elasticsearch, you first need to understand the correct use cases and future requirements and summarize each of their attributes.
Keep in mind the following points:
Elasticsearch is more popular among new developers because of its ease of use. However, if you are used to working with Solr, continue to use it, because there are no specific advantages of migrating to Elasticsearch.
If you need it to handle analytical queries in addition to searching for text, Elasticsearch is a better choice.
If you need a distributed index, you need to select Elasticsearch. Elasticsearch is a better choice for cloud and distributed environments that require good scalability and performance.
Both have good business support (consulting, production support, integration, etc.).
Both have good operating tools, although Elasticsearch attracts more DevOps people because of its easy-to-use API, so you can create a more vivid tool ecosystem around it.
Elasticsearch dominates the open source log management use case, and many organizations index their logs in Elasticsearch to make them searchable. Although Solr can now be used for this purpose, it just misses the idea.
Solr is still more text-oriented. On the other hand, Elasticsearch is often used for filtering and grouping, analyzing query workloads, not necessarily text search.
Elasticsearch developers put a lot of effort into making such queries more efficient (reducing memory footprint and CPU usage) at the Lucene and Elasticsearch levels.
Therefore, Elasticsearch is a better choice for applications that require not only text search, but also complex search time aggregation.
Elasticsearch is easier to use, and you can start everything with one download and one command. Solr traditionally requires more work and knowledge, but Solr has recently made great strides in eliminating this, and now all it needs to do is to change its reputation.
In terms of performance, they are roughly the same. I say "roughly" because no one has done a comprehensive and unbiased benchmark test. For 95% of the use cases, either option would be good in terms of performance, and the remaining 5% would need to test both solutions with their specific data and specific access patterns.
Operationally speaking, Elasticsearch is relatively easy to use, it has only one process. Solr relies on Apache ZooKeeper,ZooKeeper in its fully distributed deployment model SolrCloud like Elasticsearch that Apache ZooKeeper,ZooKeeper is super mature, super widely used, and so on, but it is still another active part.
That is, if you are using Hadoop,HBase,Spark,Kafka or some other newer distributed software, you may already be running ZooKeeper somewhere in your organization.
Although Elasticsearch has a built-in ZooKeeper-like component Xen, ZooKeeper can better prevent the terrible brain-splitting problems that sometimes occur in Elasticsearch clusters.
To be fair, Elasticsearch developers are aware of this problem and are committed to improving this aspect of Elasticsearch.
If you like monitoring and metrics, then using Elasticsearch, you will go to heaven. This thing has more indicators than people who can squeeze in Times Square on New year's Eve! Solr exposes key indicators, but not nearly as much as Elasticsearch.
In short, both are feature-rich search engines, and as long as they are properly designed and implemented, they can provide more or less the same performance.
Thank you for your reading, these are the contents of "what are the advantages and disadvantages of ElasticSearch and Solr". After the study of this article, I believe you have a deeper understanding of the advantages and disadvantages of ElasticSearch and Solr, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.