Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the similarities and differences between Solr and ElasticSearch

2025-01-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what are the similarities and differences between Solr and ElasticSearch". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what are the similarities and differences between Solr and ElasticSearch"?

The similarities are both based on Lucene.

This looks like a piece of nonsense, because anyone who knows about these two products must know that both Solr and ElasticSearch are based on Luence. But in fact, this sentence has two meanings.

The first layer is the features supported by Luence, but they are slightly different in concept and usage. Luence is already such a powerful engine that Luence can implement most of the functions we see when using search engines. For example, the participle when indexing, the Facet when querying, highlighting, spell checking, search suggestions, similar pages, etc., in fact, these are all functions implemented in Luence, Solr and ElasticSearch are just packaging. In Luence, there are not only inverted indexes, but also TermVector, which is the forward index of each document, and DocValue, which is a column structure. Different data types support different search requirements. For example, for the need to sort by a column, if there is only an inverted index, get the ID of each doc and then do the seek to retrieve the column values to sort, DocValue this column storage structure can quickly extract the column values of the corresponding document, while reducing disk seek operations. Similarly, the location of each term in a doc is recorded in TermVector, which is the basis for keyword highlighting.

The second meaning of this sentence is that Solr and ElasticSearch have to find their own ways to achieve what Luence cannot do. For example, Luence does not support partial updates to Document. As a mixture of database and search engine, it certainly doesn't make sense for Solr and ElasticSearch not to support partial updates to Document, so both of them implement the logic of reading and writing the original document first. Of course, reading the original document requires that the original data of the document be present, so partial updates in Solr require that all fields in the schema be either stored or docvalue. On the other hand, ElasticSearch is more radical. Every Document has a _ source field by default, and the original document will be stored in it. Partial updates can be supported without any configuration. Another thing Luence can't do is the immediate visibility of the document. Luence's Document must be commit Segment before it can be searched. For a search engine, this is understandable, and it is very normal for the written data to be checked over a period of time. But many people use Solr and ElasticSearch as databases. If the written data is not immediately visible, it is unacceptable for a database. So both Solr and ElasticSearch have achieved a "Realtime Get" function, that is, the written data, if it is checked with id, it can be found out immediately. The principle of implementation is also very simple. In Solr/ES, log is written before index. When facing a Get request, you can look up log for data that is not visible in Luence. Another interesting example is page flipping. There is no SearchAfter interface before Luence3.5. In order to turn pages, the early versions of Solr can only find out all of them, and then skip the offset data. You can imagine how poor the performance will be if the user goes deep into it. When Luence starts to support SearchAfter and supports searching from a certain document, Solr also starts to support the Cursor function, marking the last document that turned the page last time, and the next page starts from this document, which greatly optimizes the paging function.

It's not just search.

In addition to being based solely on the search capabilities provided by Luence, Solr and ElasticSearch also provide rich capabilities. For example, ElasticSearch's Aggregation capabilities. ElasticSearch must be very confident about his Aggregation capabilities, because in the official documentation, Aggregation is the first chapter after introducing the basic concepts, installation and deployment. Users can use ElasticSearch as an analysis engine, and all kinds of avg,sum,count and aggregation methods can be implemented. Solr also provides Analytics Component. I have not used the analytical functions of either, so I have no say which one is more powerful. ElasticSearch also provides a powerful scripting to write complex evaluation functions in several scripting languages supported by ES. There is also streamExpression in Solr, which customizes rich syntax and functions to allow users to write expressions directly to query. Of course, the beloved SQL function is also essential, and both Solr and ElasticSearch support SQL and JDBC access. There may be some differences in the SQL features supported by the two, which I don't look into here.

Distributed architecture and high availability

Solr (Cloud) and ElasticSearch are both distributed architectures, which support the fragmentation of indexes and the distribution of indexes to different servers to implement scale out. At the same time, they all support setting multiple replica for each Shard to achieve high availability. Many replica of Shard are master-slave architecture, which is equivalent to one write and many reads. Both master and slave are replicated synchronously. Therefore, when writing, you must find the main replica, and when reading, you can read the latest data by randomly selecting replica. At the same time, Solr and ElasticSearch also support inter-cluster replication, but their implementation is slightly different. Solr uses push for inter-cluster replication, while ElasticSearch uses pull. Solr supports two-way replication, so it can achieve dual activity, while ElasticSearch can only implement master-slave replication as master and slave. In addition, some common database operation and maintenance functions, such as snapshot,backup&restore, support both.

Different points of distributed design ideas are different.

Solr started out as a stand-alone version, not a distributed architecture. When SolrCloud appeared, there was a distributed architecture. Solr chose Zookeeper as the coordinator under the distributed architecture. Every node in Solr is peer-to-peer, and the use of Zookeeper is mainly used to store the routing information of fragments and the predators of overseer nodes between each replica. The only different role in Solr is overseer, which is equivalent to the master role in Solr cluster. All admin operations such as Collection creation need to go through the overseer node. At the same time, overseer nodes can do some preset operations when nodes are lost and new nodes are added according to the configuration of the AutoScalling framework. For example, when a node is lost, a new replica is automatically add for the replica on the node, keeping the number of available replica at a fixed value. The AutoScalling framework was joined by Solr 7.0. only from here can Solr have a real sense of automatic operation and maintenance. before that, the automatic operation and maintenance capability of Solr is relatively weak, and even balance clusters need to be operated manually.

ElasticSearch has been a distributed architecture design since birth. Unlike Solr, ElasticSearch does not rely on other products to do distributed, but developed a set of Zen Discovery protocol to do distributed coordination. Therefore, ElasticSearch is not as dependent on a set of Zookeeper clusters as Solr (Cloud). Zen Discovery has done all the things such as node discovery, selection, broadcasting, etc. The role of Solr,ElasticSearch is richer than that of Solr,ElasticSearch. In addition to having master nodes similar to Solr overseer nodes, you can also configure ingest node that is only responsible for handling complex ingest pipeline (such as participle, transformation, and so on) during index writing, and node that does not store data, but is only responsible for Coordination (accepting user requests, sending to the corresponding replica, and then aggregating back to the client). Of course, Solr can also use AutoScalling to configure a node without any replica to achieve the effect of Coordination node, but this configuration is much more complicated. ElasticSearch has a strong ability of automatic operation and maintenance. Through simple configuration, you can realize automatic balance and other operation and maintenance operations of the cluster. When the node goes down, master node will also promote the slave replica master, and add a replica to keep the number of replica available. These are the default behaviors of ElasticSearch, and in Solr, you need to define the framework of AutoScalling to configure these behaviors.

Solr can be deeply customized while ElasticSearch is more important than out-of-the-box.

Recently, I have used both solr and ElasticSearch clients, which gives me the impression that ElasticSearch is easier to use. There are a lot of features that are already built into ElasticSearch and should not be defined by configuration. For example, if I want to define a field named name and type string in Solr, I need to configure two things in managed_schema (xml):

In other words, the default field type of Solr still needs to define which class in Solr is used. For example, the first line above defines that the type string uses the class solr.StrField. Then I can specify the type of id field as string. Can I confuse people if I want to mislead people by defining string as solr.IntPointField (int type)? Yes, of course. In ElasticSearch, various types are predefined, and all we need is a mapping of json to specify the type of field (keyword is string without participle).

PUT my_index {"mappings": {"properties": {"name": {"type": "keyword"}

The configuration of Solr feels more Geek because it generally configures the java class directly. The ElasticSearch package is better, and all types have been prefabricated. Solr read-write links can be deeply customized, you can add a variety of Processor and Component on the read-write link to add a variety of different functions. You can even define the handler class that handles your request.

True false terms

For example, the "/ terms" path defined above will be handled by the solr.SearchHandler class, and a component called solr.TermsComponent is added to the component of Search. If you like, you can provide a completely different behavior for each Collection when accessing the "/ terms" path.

For example, the following configuration can define the Processor through which a document needs to be written (if you are familiar with HBase, you can think of it as the Coprocessor of HBase)

The entire read-write link of Solr is defined through this configuration file, and customization is so free that if the configuration is not good, the normal read-write link will be configured. So Solr warns in the document that write requests will not actually be executed if you do not configure RunUpdateProcessorFactory in the updateRequestProcessorChain you are using.

And many functions of ElasticSearch are used out of the box and do not need to be configured by the user. The configuration of Solr is too flexible, giving users a lot of possibilities to make mistakes, while the design philosophy of ElasticSearch is to minimize the possibility of users making mistakes. ES also imposes many restrictions on the running environment to avoid some inexplicable errors in the running process, because many users are not experts in these fields, and they can't find the reasons for these errors. For example, ElasticSearch will do memory check at startup, whether the system limits the number of file descriptor, and so on, and even if you are running a version of JVM with a known bug, ElasticSearch will refuse to start. ElasticSearch will also use JarHell to check whether there are classes with the same name when starting up. I once integrated ElasticSearch in my test project to start a local ES cluster, which was really disgusted by JarHell. In a large Java project, it is really difficult not to have classes of the same name for different dependencies. After spending a long time following JarHell's error prompt to go to exclude dependency, I finally gave up, JarHell checking can not be turned off, I can only fake an empty JarHell class to bypass the check. All in all, ES lowers the barriers to use, while trying to avoid user mistakes and be friendly to novices.

Solr supports HDFS storage while ElasticSearch cannot

The ability of Solr to support HDFS as storage is a major feature of Solr, and on HDFS brings the advantage of separation of storage and computing. For example, on normal storage, move a replica from one node to another means a large number of copies of data. If on HDFS, move a replica does not need to move any data, each node can read the content on the HDFS, the other node only needs to open the data on the HDFS. After Solr on HDFS, with the AutoScalling framework, all you need is a main replica (if you don't want to spread the reading pressure). Because after a node dies, the Shard goes online quickly on another node, so there is no need to copy data. When I want to balance the entire cluster, the whole process is also very fast, because only the logical shard flows in the nodes, and the data does not need to move in the HDFS. At this point, ElasticSearch doesn't have the ability of on HDFS, so he can't do any of this.

Solr supports Shard split while ElasticSearch cannot

Although Solr/ES 's shard is hash fragmentation (based on doc id or user-defined field), it is inherently able to disperse hotspots. However, there may still be some doc that are hot and need to be dispersed. In other words, if a new batch of machines are added to the cluster, more shard is needed to ensure that the Collection can be utilized by every machine in the cluster. Solr supports Shard to do split operations, which can divide a Shard into multiple ones. However, ElasticSearch cannot be the split of shard. If an Index wants more shard, it can only create a new index with more shard, and then migrate the data there. Why can't ElasticSearch do this? This is determined by their routing strategy. The sharding of Solr defines the scope of hash. In split, you can split the range in half and split it into two sub-shard to be responsible for these two ranges. On the other hand, the route of ElasticSearch is hash and then mod the shard to decide which shard to land on. As a result, if a new shard is added, the mod will be messed up. ElasticSearch's shard cannot be split into thinking that when planning, you need to be very careful in predicting how much data your Index will have and how many shard it will need. If the estimation is wrong, the later migration requires a large number of copies of data.

ES supports richer features and ecology

The feature richness of ElasticSearch is really breathtaking. After all, there is a very powerful commercial company behind it. Anything can be done to attract customers. ELK suite in Log processing this is already a common solution in the industry. At the same time, Elastic also provides a wealth of functions to different levels of users through X-Pack. Although there are some commercial companies behind Solr, such as LucidWorks, they are generally not as well-known as Elastic and offer limited solutions. Here are some of the better features of ElasticSearch:

JSON-friendly: supports nested field, naturally works with JSON, while Solr only supports nested docment.

Support for Index Sorting: I think this is a killer feature in sorting scenarios. If the user's requests are all sorted by a certain field, the ElasticSearch can be sorted in Segment not by doc ID, but by that field. Thus, in the query process, the first n items can be scanned and the result set can be obtained quickly, thus the query can be completed ahead of time.

Support life cycle management of Index: for example, automatically delete Index for more than a few days, and add replica if Index is very hot. Or do a snapshot to index regularly and so on.

Support for reducing the accuracy of time series data: ElasticSearch is an important function in the field of time series data. Rollup background job can be configured to aggregate some field data, such as aggregating hour data into a day and storing it in another field or Index.

Support for triggers: when certain conditions are met, you can make a series of events, such as curl a web page, to achieve functions similar to database triggers.

At this point, I believe you have a deeper understanding of "what are the similarities and differences between Solr and ElasticSearch?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report