How to use ElasticSearch big data distributed flexible search engine 07/02 Update SLTechnology News&Howtos

How to use ElasticSearch big data distributed flexible search engine

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you how to use ElasticSearch big data distributed flexible search engine, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

1. Background

I had the opportunity to come into contact with elasticsearch two years ago, but I didn't do any in-depth study. I just used it at work. More and more found that es is a good good thing, so spent some time to study hard. In the process of learning also found some problems, most of the materials on the Internet are very scattered, most of them are experimental demo, many problems are not clear and do not systematically talk about a complete set of programs, so patiently explore and summarize some things to share.

After all, there will be a lot of problems when you use es with the standards used in production, which puts forward new standards for your learning.

For example, when you use elasticsearch servicewrapper for self-startup, don't you notice that there is a small bug in its configuration that causes load to fail the class in the elasticsearch jar package?

There are also great differences between different versions of es. For example, the distributed routing in 1. 0 has been modified greatly in 2. 0. Routing was originally configured with mapping, but dynamically followed index when it came to 2. 0. The essential purpose of this adjustment is good, so that different type of the same index have the opportunity to choose the chip key of shard. If you follow the mapping, you can only limit it to all the type of the current index.

Es is a good thing, and now more and more distributed systems need to use it to solve problems. From ELK, a system-level tool, to the design of the core business trading system of the e-commerce platform, it is needed to support real-time big data search and analysis. For example, the product center of tens of millions of sku need real-time search, and then to the massive online order real-time query need to search.

Es is required in some DevOps tools to provide powerful real-time search capabilities. It's worth taking some time to study and study.

As an e-commerce architect, there is no reason not to learn and use it to improve the overall service level of the system. This article will summarize my learning experience during this period of time and share it with you.

two。 Installation

First of all, you need several linux machines, or you can run a virtual machine. You can complete the installation and configuration on a virtual machine, and then clone the current virtual machine to modify IP, HWaddr and UUID, which is easy for you to use without having to repeat the installation and configuration.

1. I have three local Linux centos6.5,IP, 192.168.0.10, 192.168.0.20, 192.168.0.30.

We first perform the installation configuration on 192.168.0.10, and then when everything is ready, we clone this node to modify the configuration, then configure the cluster parameters, and finally form a working cluster instance of three node. )

two。 Since ElasticSearch is developed in the java language, we need to pre-install the java-related environment. I am using JDK8, which can be installed directly using yum. The yum repository has the latest sources.

First check whether your current machine has a java environment installed:

Yum info installed | grep java*

If the java environment already exists and it is not what you want, you can uninstall and reinstall the version you want. (yum-y remove xxx) if the uninstall is not clean, you can directly find to find the relevant files, and then directly physically delete. Linux's systems are file-based and can be deleted as long as they can be found.

Let's take a look at which versions are available:

Yum search java

Java-1.8.0-openjdk.x86_64: OpenJDK Runtime Environment (find this source)

Then perform the installation:

Yum-y install java-1.8.0-openjdk.x86_64

Check the parameters related to java after installation:

Java-version

We have done the preparatory work, and next we will perform the environment installation and configuration of ElasticSearch.

2.1. Find and download rpm packages, and perform rpm package installation

You can install it in several ways. Using yum repository is the fastest and most convenient, but generally the version in it should be lagging behind. So I went directly to the official website to download the rpm package.

Official download address of elasticsearch: https://www.elastic.co/downloads/elasticsearch

Find your corresponding system type file, and of course, if you are a windows system, just download the zip package and use it. I need the rpm file here.

You can also install the local yum source, and then use the yum command to install it.

I used the wget tool to download the RPM file directly to the local file. If your package has dependencies, it is recommended to install it in yum mode. )

(if your wget command doesn't work, remember to install: yum-y install wget first.)

Wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.4.0/elasticsearch-2.4.0.rpm

Then wait for the download to complete.

Here is one thing to remind you, that is, whether you want to install the latest version of elasticsearch, it is recommended to install a slightly lower version, I installed the local version is 2.3.4. Why do you emphasize ni in this way? because when you install a very high version, there is a big question whether the Chinese word splitter can support this version. From 2.3.5 to 2.4.0, I installed 2.3.5, and then I found a problem with the ik Chinese word splitter. I had to git clone down and compile before I could output the deployment file. Therefore, it is recommended that you install the 2.3.4 version, the 2.3.4 version of the Chinese word splitter can be directly downloaded and deployed in the linux server, very convenient.

Perform the installation:

Rpm-iv elasticsearch-2.3.4.rpm

Then wait for the installation to complete.

Without any accident, the installation should be completed. Let's check the basic installation information to see if any files are missing after installation. Because some packages lack some config configuration. If it's missing, we have to make it complete.

To make it easier to view the files involved in the installation, you can navigate to find in the root directory.

Cd /

Find. -name elasticsearch

. / var/lib/elasticsearch

. / var/log/elasticsearch

. / var/run/elasticsearch

. / etc/rc.d/init.d/elasticsearch

. / etc/sysconfig/elasticsearch

. / etc/elasticsearch

. / usr/share/elasticsearch

. / usr/share/elasticsearch/bin/elasticsearch

Basically, you have to see if config is missing, because I did not have it when I installed it.

Cd / usr/share/elasticsearch/

Drwxr-xr-x. 2 root root 4096 September 4 01:10 bin

Drwxr-xr-x. 2 root root 4096 September 4 01:10 lib

-rw-r--r--. 1 root root 11358 June 30 19:22 LICENSE.txt

Drwxr-xr-x. 5 root root 4096 September 4 01:10 modules

-rw-r--r--. 1 root root 150 June 30 19:22 NOTICE.txt

Drwxr-xr-x. 2 elasticsearch elasticsearch 4096 June 30 19:32 plugins

-rw-r--r--. 1 root root 8700 June 30 19:22 README.textile

Probably take a look that you should also be missing the config folder. We also need to build this folder and also need an elasticsearch.yml configuration file. Otherwise, it must have been wrong when it was started.

Mkdir config

Cd config

Vim elasticsearch.yml

Find the elasticsearch.yml configuration and post it, or you can send it as a file. These configurations are basic and need to be adjusted according to the situation later. Some configurations are not in the configuration file and need to be looked up officially. So it doesn't matter whether the configuration file is complete or incomplete. There is a lot of information about configuration items on the Internet, so it doesn't matter.

Save the elasticsearch.yml file.

You also need a logging.yml log configuration file. Log files are definitely required for services that es runs in the background of the server. This part of the log will be collected and monitored by your log platform for operation and maintenance health check. Logging.yml is essentially a configuration file for log4j, which should be familiar to everyone. Similar to elasticsearch.yml, either copy and paste or file transfer.

The output of the log is in the logs directory, which is automatically created. But I still like to create well, do not like to have uncertainties, maybe it will not be created automatically.

Mkdir logs

Finally, you need to set the execution permissions for the file we just added. Otherwise, your file name should be white and is not allowed to be executed.

Cd..

Chmod-R uplix config/

Now that the installation is basically complete, try cd to start es in the startup directory of the file to see if it starts properly.

Not surprisingly, you will receive a "java.lang.RuntimeException: don't run elasticsearch as root" exception. This means that we have completed the first step of the installation process, and we will look at the question of starting the account in the next section.

2.2. Configure elasticsearch exclusive accounts and groups

By default, es does not allow root accounts to start, for security reasons. Es has the function of groovy script engine embedded by default, and there are many plug-ins for plugin script engine, which are really not * * complete. When es first came out, there were groovy vulnerabilities, so it is recommended that es instance on the production line turn off this scripting function. Although it is not turned on by default, check your configuration for security reasons.

So we need to configure separate accounts and groups for es. Before creating an es dedicated account, check to see if you already have an es dedicated account in the system. Because we installed elasticsearch groups and users automatically when we installed rpm earlier. Check first, if your installation does not come with dedicated groups and users, then you are creating. In this way, you don't confuse what you add with what you create.

View the following groups:

Cat / etc/group

Check out the user:

Cat / etc/passwd

It's almost all set up. 499's group has also created a corresponding elasticsearch account in passwd.

If you do not automatically create a corresponding group and account in your system, you can create it yourself, as follows:

Create a group:

Groupadd elasticsearch_group

Create a user:

Useradd elasticsearch_user-g elasticsearch_group-s / sbin/nologin

Note: this account does not have login permission. Its shell is in / sbin/nologin.

To demonstrate that there are two sets of elasticsearch dedicated accounts on my computer, I will delete the accounts ending with "_ group" and "_ user" and use the startup account (elasticsearch) automatically installed by rpm as es.

2.3. Set the elasticsearch file owner

The next thing we need to do is to associate the es file with the elasticsearch account, and set the es-related files to elasticsearch as the owner, so that elasticsearch users can use all es files without any permission restrictions.

Navigate to the elasticsearch parent directory:

Cd / usr/share

Chown-R elasticsearch:elasticsearch elasticsearch/

At this point, the owner of your elasticsearch file is elasticsearch.

2.4. Switch to elasticsearch exclusive account to test whether it can be started successfully.

In order to test and start the es instance, we need to temporarily switch the elasticsearch user to / bin/bash. So we can su elasticsearch and then start the es instance.

Su elasticsearch

Cd / usr/share/elasticsearch/bin

. / elasticsearch

The startup is complete and no exception should have occurred at this time. Check to see if the system port starts to grow.

Netstat-tnl

Continue to check to see if the HTTP service starts properly.

Curl-get http://192.168.0.103:9200/_cat

Since we do not have any auxiliary management tools installed at this time, such as plugin/head. So it's convenient to use the built-in _ cat rest endpoit.

Curl-get http://192.168.0.103:9200/_cat/nodes

192.168.0.103 192.168.0.103 4 64 0.00 d * node-1

As you can see, only one node is currently working, 192.168.0.103, and it is a data node.

(note: in order to save time, I will use a 103clean environment for installation and environment construction demonstration for the time being. I will clone and modify the IP when setting up the cluster.)

2.5. Install the self-startup elasticsearch servicewrapper package

Es's system has an open source wrapper package available on startup. If you do not use this wrapper, you can also write your own shell script, but there are many parameters that you need to be very clear about, plus some key parameters need to be set. So it is recommended that modifications based on the elasticsearchwrapper package will be more efficient, and you can also see some deep configuration and principles of es in elasticsearch shell.

Note: if you are a. Neter, you can understand servicewrapper as an open source. Net topshelf. The essence is to package the program as a system service, which you can install, uninstall, start, stop, or simply run in the foreground. )

2.5.1. Download the elasticsearch servicewrapper package

Elasticsearchwrapper github home page, https://github.com/elastic/elasticsearch-servicewrapper

Copy the git repository address to the clipboard, and then clone directly to the local.

Git clone https://github.com/elastic/elasticsearch-servicewrapper.git

(you need to install the git client: yum-y install git on the current linux machine. I installed the default version 1.7. )

Then wait for the clone to complete.

Check the local warehouse files under clone. Go to elasticsearchwrapper and view the current git branch.

Cd / root/elasticsearch-servicewrapper

Git branch

* master

Everything is normal, which shows that we have no problem with clone, including the branches are also very clear. The service file is the installation file that we want to install.

We need to copy the service file to the elasticsearch/bin directory.

Cp-R service/ / usr/share/elasticsearch/bin/

Cd / usr/share/elasticsearch/bin/

The installation files in service need to work in the elasticsearch/bin directory.

Cd service/

. / elasticsearch

Refer to the instructions for using elasticsearchwrapper on github. Elasticsearch servicewrapper has a lot of functions, and status and dump are good tools for checking and debugging.

Before installing, we need to temporarily run the es instance in the foreground so that we can see if there are any anomalies in some log. The parameters of Parameter are clearly written, and here we use the console console output to start the es instance.

. / elasticsearch console

2.5.2 configuration of elasticsearch servicewrapper open source package small bug

At this point you should receive a prompt from Error:

WrapperSimpleApp Error: Unable to locate the class org.elasticsearch.bootstrap.ElasticsearchF: java.lang.ClassNotFoundException: org.elasticsearch.bootstrap.ElasticsearchF

I was a little confused when I saw this for the first time. What kind of object is this ElasticsearchF? Naming is a bit special, and if you take a closer look at the information about Exception, it's actually a ClassNotFoundException exception. Indicates that the ElasticSearchF class cannot be found.

There are two possibilities, the first is the problem of java elasticsearch-related packages, and this class is indeed missing. But this is highly unlikely because we were successful in running elasticsearch directly before. I used jd-gui to look through es's package, and there was really no such class.

The second is the configuration error here, it should be a mistake, there is really no such class as ElasticsearchF.

Let's check to see if this' elasticsearchF' string is in the service/elasticsearch.conf configuration file. (the wrapper package uses elasticsearch.conf in the current directory as a configuration file.)

Grep-I elasticsearchf elasticsearch.conf

There is indeed this string, and we edit it and save it, removing the last'F'.

And then we're making a startup attempt.

. / elasticsearch console

I don't know if you will, as in my case, prompt that the relevant commands are irregular.

This running link basically goes through three paths, the first is the service/elasticsearch shell startup script, and then get the command analysis command to start the relevant java servicewrapper program under exec.

This java servicewrapper program is version 3.5.14. According to the above idea, by looking at the elasticsearch shell program, it will start the java servicewrapper program under exec after receiving an external command. I want to try to edit the elasticsearch shell file and output some information to see if getting the relevant paths or parameters is causing the error. Don't be afraid of problems. At least we should keep up with them and see what's going on. )

Vim. / elasticsearch

Esc

: / console

Find out where the console is, then add the debug text message and output it to the interface.

Run again to see if there is something wrong with the command parameters.

After looking at it, there is no problem with the output parameters. I don't know for a moment. Curious, I wanted to take a closer look at the exec/elasticsearch-linux-x86-64.so file, but I found that I couldn't understand it when I opened it. So looking for another way, I looked for the windows version of servicewrapper, and found that the elasticsearchservicewrapper of windows does not have 32-bit servicewrapper. I try to run basically the same error, but windows's wrapper has a lot of error information, indicating the cause of the error.

I want to modify the output level of the log to see if I can output some useful information. Edit the service/elasticsearch.conf wrapper package specific configuration.

# Log Level for console output. (See docs for log levels)

Wrapper.console.loglevel=TRACE

# Log Level for console output. (See docs for log levels)

Wrapper.console.loglevel=TRACE

We set the log output level to trace. There are two things that need to be set. Let's look at the output information.

Is to output some useful information, you can view the details of the log file.

WrapperManager Debug: Received a packet LOGFILE: / usr/share/elasticsearch/logs/service.log

But there is only one piece of information about error.

That's the end of it. Our goal is to use console to run, and we want to check some run logs, but it doesn't matter if we can't run, we continue with the installation.

If any blogger knows where the problem can be shared, I think this problem is not an occasional problem, but should be encountered. I will first throw out the question, at least I can serve the future users. Thank you for this. )

In fact, it is also possible to download java serivcewrapper to wrap elasticsearch instead of using elasticsearch servicewrapper, and it is also very convenient to implement.

Let's get back to the topic, since we can't run console and can't see what happens when wrapper console is executed, we have to install it.

2.5.3 servicewrapper installation (elasticsearch init.d startup file settings user, openfile, configpath)

We perform the installation as instructed by the elasticsearch servicewrapper parameter parameter.

. / elasticsearch install

Installing the Elasticsearch daemon..

The daemon installation is complete. Let's go to the system directory to see if the installation is successful (it is necessary for technicians to maintain a rigorous mindset all the time. ) go to the / etc/init.d/ directory to check.

Ll / etc/init.d/

-rwxrwxr--. 1 root root 4496 October 4 01:43 elasticsearch

I have set up chmod Ubunx. / elasticsearch here. Don't forget to set the execution permissions for the file, which will result in our [Section 2.1], so we won't repeat it here.

Let's start editing the elasticsearch startup file.

This is the main paragraph, fill in the configured es special account (elasticsearch [2.2. Section], and the corresponding file path. The two configuration items MAX_OPEN_FILES and MAX_MAP_COUNT are ignored here, which are later [3. 3. Section] is explained in the configuration section.

2.5.4 chkconfig-add adds to the list of linux startup services

Add it to the system service so that it can be started automatically by the system.

Chkconfig-add elasticsearch

Chkconfig-list

It has been added to the list of system self-starting services.

Service elasticsearch start

Start the es instance, wait for the port to start, and check the port situation in a moment.

Netstat-tnl

Port 9300 starts earlier than port 9200 because port 9300 is the cluster internal management port. 9200 is the rest endpoint service port. Of course, this time will not be extended for very long.

After all the ports are started successfully, let's check whether the es instance can be accessed normally.

Curl-get http://192.168.0.103:9200/

{

"name": "node-1"

"cluster_name": "orderSearch_cluster"

"version": {

"number": "2.3.4"

"build_hash": "e455fd0c13dceca8dbbdbb1665d068ae55dabe3f"

"build_timestamp": "2016-06-30T11:24:31Z"

"build_snapshot": false

"lucene_version": "5.5.0"

}

"tagline": "You Know, for Search"

}

Let's use _ cat rest endpoint to see.

Curl-get http://192.168.0.103:9200/_cat/nodes

192.168.0.103 192.168.0.103 4 61 0.00 d * node-1

If you can access it locally, but not in an external browser, it is probably due to the setting of the firewall. You can set up the firewall.

Vim / etc/sysconfig/iptables

Restart the network service to load the firewall settings.

Service network restart

Then try to see if you can access it externally. if not, you can go under the telnet port.

Another reason for not being able to access is that it has something to do with a configuration item in elasticsearch.yml. See [Section 3.1.1].

Restart the machine to see if the es instance starts automatically.

Shutdown-r now

Wait a moment, then try to connect to the machine.

If there is no accident, it should be normal, and the port is started successfully. This indicates that we have completed the es instance self-startup function, which is now automatically managed as a linux system service.

Once installed as a service, elasticsearch servicewrapper doesn't have much to do with us. Because its parameter is all around, we use it based on servicewrapper.

2.6. Install the _ plugin/head management plug-in (secondary management)

In order to manage the cluster well, we need the corresponding tools. Head is popular and universal, and it is free. Of course, there are many other useful tools, such as Bigdesk, Marvel (commercial charges). The installation of plugin is more or less the same, so we use the general-purpose head tool here.

Let's take a look at the clear view of cluster node management that head brings to us.

This is an es cluster instance with three nodes. It is a two-dimensional matrix arrangement, the top horizontal is the index, the leftmost is the node, and the intersection is the shard information and proportion of the index.

It is quite convenient to install the head plug-in, or you can use it directly as a copy file. Under the home directory of elasticsearch, there is a plugins directory, which is the directory of all plug-ins, and all plug-ins will be found and loaded in this folder.

Let's take a look at how to install the head plug-in. There is a plugin executable in the elasticsearch/bin directory, which is a program specifically used to install plug-ins.

. / plugin-install mobz/elasticsearch-head

Plug-in search path has several elasticsearch official website is one, github is one. Here, you will first try to find it on github, wait a moment, and wait for the installation to complete. We try to access the head plug-in address rest address / _ plugin/head.

You can see that the interface is basically installed successfully, and node-1 defaults to the master node.

2.7. Install the elasticsearch client plug-in in chrom

There are many elasticsearch client plug-ins available in chrom, which are easy to develop and maintain, so it is recommended to use plug-ins in chrom directly. Just search for elasticsearch keywords and there will be a lot of them.

There are two more commonly used and easier to use, EalsticSearch Toolbox and Sense (automatic prompt dsl editing tool). Chrom plugins are so cool that they are pleasing to the eye.

Elasticsearch toolbox can easily query and export data.

Sense allows you to edit elasticsearch dsl-specific languages with startup tips, so writing complex dsl is efficient and error-prone. I haven't used other tools either. I feel like I can try to use them.

(note: if you can't access the chrom store center, you need special treatment, so I won't explain it here. )

2.8. Use the _ cat tool that comes with elasticsearch

In some special cases, you may not be able to use plugin directly to help you manage or view the cluster. At this point, you can directly use the rest _ cat that comes with elasticsearch to check the cluster situation. For example, you may find that some nodes of _ plugin/head have not come up, but you are not sure what happened, so you can use / _ cat/nodes to check all the node. Sometimes some nodes don't start up, but in most cases they go their own way (brain cleft), and you may need to get them to re-elect or speed up the election process.

Http://192.168.0.20:9200/_cat/nodes?v (check nodes situation)

The _ cat rest endpoint takes a parameter of v, which is a parameter to help you read. The _ search rest endpoint takes the pretty parameter, which is used to help query data read. Each endpoint basically has its own auxiliary reading parameters.

Http://192.168.0.20:9200/_cat/shards?v (check shards situation)

Http://192.168.0.20:9200/_cat/ (view all the features that can be cat)

You can view system aliases aliases, segments fragments (see commit version consistency for each fragment), indices index collections, and so on.

2.9.clone virtual machine (modify IP, HWaddr, UUID configuration, and finally modify system time)

When we have finished installing a machine, the next step is to build a distributed system. Distributed systems require multi-node machines, and according to es distributed cluster building best practices, you need at least three nodes. So we have installed this machine clone out of two, a total of three to form a working three-node distributed system.

First of all, the currently installed machine of clone, 192.168.0.103, can be started to modify several configurations. (because you are from clone, the configuration has been repeated, for example, network card address, IP address)

Edit the Nic configuration file:

Vim / etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=eth4

HWADDR=00:0C:29:CF:48:23

TYPE=Ethernet

UUID=b848e750-d491-4c9d-b2ca-c853f21bf40b

ONBOOT=yes

NM_CONTROLLED=yes

BOOTPROTO=static

BROADCAST=192.168.233.255

IPADDR=192.168.0.103

NETMASK=255.255.255.0

GATEWAY=192.168.0.1

DEVICE is the label of the network card, which can be changed to the corresponding one according to your local network card logo, which can be viewed through ifconfig. The address of the HWADDR network card can be modified at will to ensure that it will not be repeated in your network segment. UUID is modified in the same way as HWADDR.

Change the IP address to the IP you think is appropriate, and you'd better refer to the relevant configuration of your current physical machine. The GATEWAY gateway address should refer to the gateway address of your physical machine. If your virtual machine is using a bridging mode network connection, you need to set it here, otherwise the network will not be connected.

Restart the network service:

Service network restart

Wait a moment, ssh reconnect, then ifconfig to see if the network parameters are correct, and finally ping the external URL and the IP of your current physical machine to ensure that the network is smooth.

Finally, we need to modify the system time of linux, in order to prevent the server time from being inconsistent, which leads to many minor problems, such as the timestamp of master election in es cluster, the logging of log4j output, and so on. In distributed systems, clocks are very important.

Date-s' 20161008 20 4715 00'

You can also set the time zone if you need it, but you don't need it here for the time being.

According to your own needs, you clone several machines. By default, we roughly agreed that 192.168.0.10, 192.168.0.20, 192.168.0.30, these three machines will form an es distributed cluster.

3. Configuration

We have prepared the nodes of the cluster, and we are going to configure the cluster so that the three nodes can be connected together. The configuration involved here is relatively simple, only to complete a basic common function of the cluster. If you have special needs, you can check the official website of elasticsearch or Baidu. The information in this respect is already very rich.

Some of the configurations here have actually benefited from the simplification of elasticsearch servicewrapper.

From here on, we will configure three machines, 192.168.160.10, 192.168.160.20, 192.168.160.30.

3.1.elasticsearch.yml configuration

There are configuration files in the config directory of elasticsearch. Navigate to the cd / usr/share/elasticsearch/config directory.

3.1.1.IP access restrictions, default port modification 9200

There are two reminders here, the first is the IP access restriction, and the second is the default port number 9200 of the es instance. IP access restrictions can limit specific IP access servers, which has a certain security filtering effect.

# Set the bind address to a specific IP (IPv4 or IPv6):

Network.host: 0.0.0.0

If set to 0.0.0.0, no IP access is restricted. Generally, servers in production may be limited to several IP, which are usually used for administrative use.

The default port 9200 is also a bit risky in general, and you can change the default port to another one. Another reason is to avoid developers' misoperation and connect to the cluster. Of course, it doesn't matter if your company does a good job of network isolation.

# Set a custom port for HTTP:

Http.port: 9200

Transport.tcp.port: 9300

9300 here is the port used for communication within the cluster, and this can also be modified. Because there are two ways to connect to a cluster, you can also enter the cluster by acting as a cluster node, so for security reasons, modify the default port.

(note: remember to modify the same configuration of the three nodes, otherwise the connection between the nodes cannot be established, and an error will be reported. )

3.1.2. Cluster discovery IP list, node, cluster name

Then modify the IP address of the cluster node so that the cluster can work between the specified nodes. Elasticsearch, the default is to use the auto-discovery IP mechanism. That is, in the current network segment, as long as the IP can be automatically sensed, it can be automatically added to the cluster. This has both advantages and disadvantages. The advantage is automation, which is very convenient when your es cluster needs to be clouded. But it will also bring some unstable situations, such as the election of master and data replication.

One of the factors leading to the master election is that the cluster has nodes to enter. When data replication occurs, it also affects the cluster because of the need for balanced data replication and redundancy. It is possible to separate master clusters and eliminate the data node capabilities of master clusters.

Fixed-list IP finds that there are two ways to configure it, one is interdependent discovery, the other is full discovery. Each has its own advantages. I rely on discovery to do it. This is a very important reference standard, which is how fast your cluster is expanding. Because there is a problem, that is, when full discovery, there will be a big problem if the cluster is initialized, that is, the global master will be very long, and then the startup speed between nodes will be different. So I used a reliable dependency discovery.

You need to configure the elasticsearch in 192.168.0.20 to:

#-Discovery--

# Pass an initial list of hosts to perform discovery when new node is started:

# The default list of hosts is ["127.0.0.1", "[:: 1]"]

Discovery.zen.ping.unicast.hosts: ["192.168.0.10 9300"]

Let him find the machine of 10, push it inside, and complete the configuration of the remaining 30.

(note: there are many discovery configurations for different scenes on the Internet, you can throw a brick to attract jade in this regard, and those who are interested in this topic can have a lot of information on Baidu. )

Then you need to configure the cluster name, which is the name of the cluster where your current node is located, which helps you plan your cluster. Only the same cluster name can form a logical cluster.

#-Cluster--

# Use a descriptive name for your cluster:

Cluster.name: orderSearch_cluster

#-Node--

# Use a descriptive name for the node:

Node.name: node-2

And so on, complete the configuration of the other two nodes. The name of the cluster.name must remain the same. Then set the node.name separately.

3.1.3.master node initiates handover

Here is a little experience to share, that is, when I use a cluster, I often shut down and restart the cluster because I am a virtualized machine. Sometimes it is found that there is a problem with cluster master propaganda, that is, if your cluster is closed in the wrong way, it will directly affect the logic of the next master election.

I checked the general logic of the election, it will be based on the freshness of the data before and after as an important logic of the election. (log, data and time are all important indicators of the overall master of the cluster)

Because considering the problem of data consistency, of course, the latest data node is used as the master, and then the new data is copied and other node is refreshed.

If you find that a node is too late to join the cluster, you can try to restart the es service and make the cluster master global again.

3.2.linux open maximum number of files setting (used as the system threshold for index)

In linux systems, you need to apply to the operating system if you want to maximize system resources. Because elasticsearch needs to use a large number of file handle resources when index, it may not be enough under the original linux default resources. So here we need to set it up in advance when we use it.

This configuration is introduced as a key configuration in the book "ElasticSearch Extensible Open Source flexible search solution", and it is conceivable that there are still a lot of people who have stepped into the pit.

This configuration is configured for us in elasticsearch service wrapper.

Vim / etc/init.d/elasticsearch

This configuration is set to the es instance when it is started.

At this time, try to launch the es instances of the three machines to see if you can find the cluster status of the three machines in _ plugin/head. (remember to visit the machine where the head plug-in is installed. I installed it on the 10 machine.)

The red ones are the names of the node.name nodes you set up, and they work in a cluster.

3.3. Install the Chinese word splitter ik (pay attention to the corresponding version)

At this time, the cluster should be able to work, and we also need to configure a Chinese word separator. After all, the Chinese we use, elasticsearch's own word segmentation support for Chinese word segmentation is not very suitable for the mainland.

I am using the ik word splitter, the address on github: https://github.com/medcl/elasticsearch-analysis-ik

Don't worry about clone, let's take a look at the corresponding version of elasticsearch supported by the ik word splitter.

The version of elasticsearch we use is 2.3.4. So we need to find the corresponding version of ik, otherwise we will not be able to report and load the corresponding version of the ik plug-in directly at startup. Switch to the release version list, find the corresponding version, and download it.

You can download it directly to the Linux machine, or you can download it to your host machine and copy it to the virtual machine. If your elasticsearch version is up to date, you may need to download the ik source code and compile it before deploying it.

Of course, you can install it using git+maven. For detailed installation steps, please see: https://github.com/medcl/elasticsearch-analysis-ik

This is also relatively simple, so I will not repeat it here. Restart the es instance after installation.

3.4.elasticsearch cluster planning (master should not be used as a data node, stand-alone master is commander)

You can plan a cluster like this. There can be two master, both of which are used as commander to coordinate cluster-level affairs and cancel the data rights of these two. Then cancel the master rights of the three nodes after planning a three-node data cluster. Let them have peace of mind to do a good job of data storage and retrieval services. This is the smallest granularity of the cluster structure and can be extended based on this structure.

One advantage of this is that it has clear responsibilities, which can prevent master nodes from happening to data nodes as much as possible, leading to unstable factors. For example, data node data replication, data balance, routing and so on, directly affect the stability of master. In turn, the problem of brain fissure may occur.

4. Development

We enter the last link, everything is ready, we should not operate this powerful search engine. Come on .

4.1. Access cluster mode

When it comes to clustering, there will be corresponding problems, such as high availability, high concurrency, big data, scale-out, and so on. So what is the principle of elasticsearh's cluster?

First of all, client does not use vip drift to achieve high availability in order to ensure high availability when connecting to the cluster, similar to keepalived. Elasticserach uses the method of configuring multiple IP when the client connects to start with the load of the client sdk. This is already a common practice in distributed systems. Only centralized clusters such as DB and cache need to be used, because they are determined by their usage characteristics. (data consistency)

All nodes in elasticsearch can handle requests. The more nodes, the higher the concurrent QPS, and the corresponding TPS will decline, but the performance degradation is not proportional to the number of nodes. (it uses quorum (quorum) algorithm to ensure availability. ) so the replication of nodes is not what we take for granted.

There are two ways to connect to an es cluster. The one with higher performance is to directly play client as cluster node and enter the cluster while canceling your data rights. This is usually used for secondary development. You can github clone the source code to add your own scenarios and then enter the cluster. You may interfere with the election, sharding, or cluster balance.

Elasticsearch uses a set of DSL languages defined by itself, using restful, depending on the rest end point. For example, _ search, _ cat, _ query, and so on. These are the rest endpoints of pointing. Then you can post dsl to the elasticsearch server to process.

Elasticsearch search dsl: https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html

Elasticsearch dsl api: http://elasticsearch-dsl.readthedocs.io/en/latest/

Example:

POST _ search

{

"query": {

"bool": {

"must": {

"query_string": {

"query": "query some test"

}

"filter": {

"term": {"user": "plen"}

}

Readability is very strong, in the chrome plug-in Sense-assisted writing, will be more convenient.

However, this is not generally done, and sdk is usually used to connect to the cluster. Most of the people who use dsl directly are testing the data or debugging. See if the dsl output from sdk is correct. It's like debugging SQL.

4.1.1.net nest use (use pool to connect to the es cluster)

The. Net program has an open source package nest, so you can search for installation directly on Nuget.

Official website address: https://www.elastic.co/guide/en/elasticsearch/client/net-api/1.x/nest-connecting.html

Connect to the cluster using the high availability of pool.

Var node1 = new Uri ("http://192.168.0.10:9200");

Var node2 = new Uri ("http://192.168.0.20:9200");

Var node3 = new Uri ("http://192.168.0.30:9200");

Var connectionPool = new SniffingConnectionPool (new [] {node1, node2, node3})

Var settings = new ConnectionSettings (connectionPool)

Var client = new ElasticClient (settings)

At this point, the use of client object is soft load, it will be based on a certain policy to balance the connection between the three backend node. (it may be average, it may be weighted, but not studied)

4.1.2.java jest usage

For java, I use jest. We create a maven project and then add the corresponding jar package maven reference for jest.

Io.searchbox jest 2.0.3 org.elasticsearch elasticsearch 2.3.5

JestClientFactory factory = new JestClientFactory (); List nodes = new LinkedList (); nodes.add ("http://192.168.0.10:9200");nodes.add("http://192.168.0.20:9200");nodes.add("http://192.168.0.30:9200");HttpClientConfig config = new HttpClientConfig.Builder (nodes) .multithreaded (true) .build (); factory.setHttpClientConfig (config); JestHttpClient client = (JestHttpClient) factory.getObject (); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder () SearchSourceBuilder.query (QueryBuilders.queryStringQuery); searchSourceBuilder.field ("name"); Search search = new Search.Builder (searchSourceBuilder.toString ()). Build (); JestResult rs = client.execute (search); System.out.println (rs.getJsonString ())

{

"took": 71

"timed_out": false

"_ shards": {

"total": 45

"successful": 45

"failed": 0

}

"hits": {

"total": 6

"max_score": 0.6614378

"hits": [

{

"_ index": "posts"

"_ type": "post"

"_ id": "1"

"_ score": 0.6614378

"fields": {

"name": [

Wang Qingpei

]

}

{

"_ index": "posts"

"_ type": "post"

"_ id": "5"

"_ score": 0.57875806

"fields": {

"name": [

Wang Qingpei

]

}

{

"_ index": "posts"

"_ type": "post"

"_ id": "2"

"_ score": 0.57875806

"fields": {

"name": [

Wang Qingpei

]

}

{

"_ index": "posts"

"_ type": "post"

"_ id": "AVaKENIckgl39nrAi9V5"

"_ score": 0.57875806

"fields": {

"name": [

Wang Qingpei

]

}

{

"_ index": "class"

"_ type": "student"

"_ id": "1"

"_ score": 0.17759356

}

{

"_ index": "posts"

"_ type": "post"

"_ id": "3"

"_ score": 0.17759356

"fields": {

"name": [

Wang Qingpei

]

}

]

}

The returned data spans multiple indexes. You can use the constant debug to see if the link IP will initiate the switch and whether it will play a usability role. 4.2.index development

The general steps of index development are relatively simple, first of all, establish the corresponding mapping mapping, and configure the characteristics of field in each type.

4.2.1.mapping configuration

Mapping is the basis for the operation of each field when the es instance is used in index. For example, username, whether this field should be indexed, whether to store, the length size, and so on. Although elasticsearch can handle these dynamically, it is recommended to establish a corresponding index mapping for management and operation purposes, which can be saved in a file so that the index reference can be rebuilt in the future.

POST / demoindex

{

"mappings": {

"demotype": {

"properties": {

"contents": {

"type": "string"

"index": "analyzed"

}

"name": {

"store": true

"type": "string"

"index": "analyzed"

}

"id": {

"store": true

"type": "long"

}

"userId": {

"store": true

"type": "long"

}

This is the simplest mapping, defining a mapping with an index name of demoindex and a type of demotype. Each field is a json object, which contains whether the type is indexed or not.

This is edited in sense and then submitted directly by post.

{

"acknowledged": true

}

Check the created index information to see if it is the mapping setting you submitted.

4.2.2.mapping template configuration

It is always inefficient to create a similar mapping manually every time. Elasticserach supports creating a mapping template and then letting the template automatically match which mapping definition to use.

PUT log_template

{

"order": 10

"template": "log_*"

"settings": {

"index": {

"number_of_replicas": "2"

"number_of_shards": "5"

}

"mappings": {

"_ default_": {

"_ source_": {

"enable": false

}

Create an index mapping of type log. We set two basic properties, "number_of_replicas": "2" copy score, and "number_of_shards": "5" number of slices. The source field is not enabled by default in mappings.

This mapping template is automatically hit when we submit all indexes with the name "log_xxx".

You can view existing mapping templates through the _ template rest endpoint, or through the templates menu in the information in the upper-right corner of the head plug-in.

{"mq_template": {"order": 10, "template": "mq*", "settings": {"index": {"number_of_shards": "5", "number_of_replicas": "2"}} "mappings": {"_ default_": {"_ source_": {"enable": false}}, "aliases": {}}, "log_template": {"order": 10, "template": "log_*" "settings": {"index": {"number_of_shards": "5", "number_of_replicas": "2"}, "mappings": {"_ default_": {"_ source_": {"enable": false} "aliases": {}}, "error_template": {"order": 10, "template": "error_*", "settings": {"index": {"number_of_shards": "5", "number_of_replicas": "2"}} "mappings": {"_ default_": {"_ source_": {"enable": false}}, "aliases": {}} this is usually used in storage that businesses do not want to shut down. For example, logs, messages, major error warnings, and so on, can be set, as long as these repeated mapping are regular. 4.2.3.index routing indexed routing configuration

When slicing the data in es, it is done by taking the remainder of the hash, so you can pass a fixed key, then the key will be your fixed routing rule. You can set this _ routing parameter when you create a mappings. This is set up in version 1.0, which means that all document under your current type can only be done with this routing key. But after es2.0, routing follows the index metadata, so you can control the routing rules of a single index, and you can set the _ routing parameter separately when submitting the index, instead of setting it directly on the mappings.

The mappings configuration _ routing parameter is no longer supported after 2. 0.

Https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_20_mapping_changes.html#migration-meta-fields

In 1. 0, for example, you can use userid as a routing key, so that all the data of the current user can be put on one shard, which will speed up the query when querying.

{

"mappings": {

"post": {

"_ routing": {

"required": true

"path": "userid"

}

"properties": {

"contents": {

"type": "string"

}

"name": {

"store": true

"type": "string"

}

"id": {

"store": true

"type": "long"

}

"userId": {

"store": true

"type": "long"

}

This _ routing is set on mapping and works on all type. Userid will be used as the key for sharding. But in 2. 0, routing path must be specified explicitly.

After you add mappings, you must specify the & routing=xxx parameter when creating the current index. A big advantage of this is that you can freely adjust the sharding strategy according to different business dimensions.

The above content is how to use ElasticSearch big data distributed flexible search engine. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.