How does search engine work? 04/18 Update SLTechnology News&Howtos

How does search engine work?

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what is the working principle of search engine". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

The development process of search engine

The ancestor of the modern search engine is Archie, which was invented by Alan Emtage, a student at the University of Montreal in 1990. Even without the Internet, file transfers in the network are still quite frequent, and because a large number of files are scattered in various scattered FTP hosts, it is very inconvenient to query, so Alan Emtage came up with the idea of developing a system that can find files by file name, so there is Archie. The working principle of Archie is very similar to that of the current search engine. It relies on script programs to automatically search the files on the Internet, and then index the relevant information for users to query with certain expressions.

After the rise of the Internet, tools that can be monitored are needed. The world's first "robot" program to monitor the scale of the Internet was World wide Web Wanderer developed by Matthew Gray. At first, it was only used to count the number of servers on the Internet, and later developed to be able to retrieve the domain name of a website.

With the rapid development of the Internet, a large number of new websites and web pages are added every day, and it becomes more and more difficult to retrieve all the new web pages. therefore, on the basis of Matthew Gray's Wanderer, some programmers have made some improvements to the working principle of the traditional "spider" program. Modern search engines are developed on this basis.

Search engine classification

Full-text search engine

At present, the mainstream is full-text search engine, the more typical representatives are Google, Baidu. Full-text search engine refers to the information of each website extracted from the Internet (mainly web text), saved in the database established by oneself. After the user initiates the retrieval request, the system retrieves the relevant records that match the user's query conditions, and then returns the results to the user in a certain order. From the point of view of the source of search results, full-text search engines can be subdivided into two types, one is to have their own search program (Indexer), commonly known as "Spider" program or "Robot" program, and build their own web page database, and the search results are called directly from their own data storage layer; the other is to rent the database of other engines and arrange the search results in a custom format, such as Lycos engine.

Directory index class search engine

Although there is a search function, it cannot be called a real search engine in a strict sense, it is just a list of website links classified by directory. Users can find the information they need according to the classified catalogue and do not rely on keywords (Keywords) to query. The most representative of the catalogue index is the famous Yahoo and Sina classified catalogue search.

Meta search engine

When accepting the user's query request, the meta-search engine searches on a number of other engines at the same time and returns the results to the user. The famous meta-search engines are InfoSpace, Dogpile, Vivisimo and so on, and the representative Chinese meta-search engine is Soso Star search engine. In terms of search results arrangement, some search results are arranged directly by source engine, such as Dogpile, while others are rearranged and combined according to self-defined rules, such as Vivisimo.

Related implementation technology

Although search engine products generally have only one input box, there are many different business engines behind the services provided, each business engine has many different strategies, and each strategy has many modules to deal with together. and complicated.

The search engine itself includes knowledge of web crawling, web evaluation, anti-cheating, database building, inverted index, index compression, online retrieval, ranking sorting strategy and so on.

Web crawler technology

Web crawler technology refers to the crawling of network data. Because crawling data in the network is a related crawl, it crawls around the Internet like a spider, so we vividly call it web crawler technology. A web crawler is also known as a web robot or a web chaser.

Web crawlers obtain web information in exactly the same way as we usually use browsers to access web pages, which are obtained according to the HTTP protocol. The process mainly includes the following steps:

1) Connect to the DNS domain name server, and resolve the domain name of the URL to be crawled (URL--- > IP)

2) according to HTTP protocol, send a HTTP request to get the content of the web page.

A complete web crawler infrastructure is shown in the following figure:

The whole architecture consists of the following processes:

1) the demand side provides the URL list of seeds to be crawled, and according to the provided URL list and the corresponding priority, set up the URL queue to be crawled (first come first served)

2) crawling web pages according to the order of the URL queue to be crawled

3) download the acquired web content and information to the local web page library, and establish a crawled URL list (used to remove duplicates and determine the crawling process)

4) put the crawled web pages into the URL queue to be crawled, and perform circular crawling operations.

Indexes

From the user's point of view, the process of search is the process of finding specific content in a resource through keywords. From a computer point of view, there are two ways to achieve this process. One is to match all resources with keywords one by one and return all the content that satisfies the match; the other is to establish a corresponding table in advance, like a dictionary, to match the keywords with the contents of resources, and search for this table directly when searching. It is obvious that the second method is much more efficient. Creating this corresponding table is actually the process of building an inverse index (inverted index).

Lucene

Lucene is a high-performance java full-text retrieval toolkit that uses an inverted file index structure.

Full-text retrieval consists of two processes, index creation (Indexing) and search index (Search).

Index creation: the process of extracting information from all structured and unstructured data in the real world and creating an index.

Search index: the process of getting the user's query request, searching the created index, and then returning the results.

The information stored in unstructured data is what strings are contained in each file, that is, known files, and it is relatively easy to get strings, that is, the mapping from files to strings. And the information we want to search is which files contain this string, that is, the known string, the desired file, that is, the mapping from the string to the file. The two are just the opposite. So if the index can always save the mapping from the string to the file, it will greatly improve the search speed.

Because the mapping from string to file is the reverse process of file-to-string mapping, the index that holds this information is called reverse index.

The information saved by the reverse index is generally as follows:

Suppose I have 100 documents in my document collection. For ease of presentation, we number the documents from 1 to 100 to get the following structure

Each string points to the document (Document) linked list containing the string, which is called an inverted list (Posting List).

ElasticSearch

Elasticsearch is a real-time distributed search and analysis engine that can be used for full-text search, structured search and analysis, and of course you can combine the three. Elasticsearch is a search engine based on full-text search engine Apache Lucene ™, but Lucene is only a framework. In order to make full use of its functions, we need to use JAVA and integrate Lucene in the program. Elasticsearch uses Lucene as its internal engine, but when using it for full-text search, you only need to use a uniformly developed API, without knowing how the complex Lucene behind it works.

Solr

Solr is a Lucene-based search engine server. Solr provides flat search, hit eye-catching display, and supports a variety of output formats (including XML/XSLT and JSON formats). It is easy to install and configure and comes with a HTTP-based management interface. Solr has been used in many large websites, which is more mature and stable. Solr wraps and extends Lucene, so Solr basically follows the relevant terminology of Lucene. More importantly, the indexes created by Solr are fully compatible with the Lucene search engine library. With proper configuration of Solr, coding may be required in some cases, and Solr can read and use indexes built into other Lucene applications. In addition, many Lucene tools (such as Nutch, Luke) can also use indexes created by Solr.

Hadoop

A series of technical white papers released by Google led to the birth of Hadoop. Hadoop is a series of big data processing tools that can be used in large-scale clusters. Hadoop has evolved into an ecosystem that includes many components, as shown in the figure.

Cloudera is a company that uses Hadoop technology in search engines. Users can use full-text search to retrieve the data stored in HDFS (Hadoop distributed File system) and Apache HBase. In addition, the open source search engine Apache Solr,Cloudera provides search functions, and combines Apache ZooKeeper for distributed processing management, index segmentation and high-performance retrieval.

PageRank

Google Pagerank algorithm is based on random surfing model, the basic idea is based on mutual voting between websites, that is, we often say that websites point to each other. If you judge that a website is a high-quality site, then the website should be referenced by many high-quality websites or a large number of high-quality authoritative sites.

Internationalization

Frankly speaking, although Google has done a very good job, it is very good both in technology and product design. But internationalization is indeed very difficult to do, and there is still room for survival of other search engines in the field of segmentation. For example, in South Korea, Naver is the first choice for users. It is based on Yahoo's Overture system, while the advertising system is developed by itself. In the Czech Republic, we use Seznam more often. In Sweden, users are more likely to choose Eniro, which was originally a Swedish yellow pages development company.

Internationalization, personalized search, anonymous search, these are products such as Google can not fully cover, in fact, no one product can be applied to all needs.

Implement your own search engine

If we want to implement the search engine, the most important thing is the index module and the search module. The index module indexes the resources on different machines and transfers the index files to the same place (either on the remote server or locally). The search module uses the data collected from multiple index modules to complete the user's search request. Therefore, we can understand that the two modules are relatively independent, and the association between them is not through code, but through index and metadata, as shown in the following figure.

For index building, we need to pay attention to performance issues. When the number of resources that need to be indexed is small, a full index at regular intervals will not take a long time. However, in large-scale applications, the capacity of resources is huge, and if a complete index is carried out every time, the time spent will be amazing. We can solve this problem by skipping the indexed resource content, deleting the index of the resource content that no longer exists, and indexing incrementally. This may involve file checksum index deletion and so on. On the other hand, the framework can provide query caching function and improve query efficiency. The framework can establish a first-level cache in memory and use a cache framework such as OSCache or EHCache to implement a second-level cache on disk. When the content of the index does not change frequently, the use of query cache will obviously improve the query speed and reduce the resource consumption.

Search engine solution

Sphinx

Sphinx, an open-source full-text search engine for a Russian company, can contain up to 100 million records in a single index, and the query speed is 0.x seconds (milliseconds) in the case of 10 million records. The speed of Sphinx to create an index is very fast. According to the data on the Internet, it takes only 3-4 minutes for Sphinx to create an index of 1 million records, an index of 10 million records can be completed in 50 minutes, and an incremental index containing only the latest 100000 records takes only a few seconds to rebuild at a time.

OmniFind

OmniFind is an enterprise search solution launched by IBM. Based on UIMA (Unstructured Information Management Architecture) technology, it provides powerful indexing and information retrieval functions, supports a large number of various types of document resources (whether structured or unstructured), and is specially optimized for Lotus ®Domino ®and WebSphere ®Portal.

Next generation search engine

From a technical and product point of view, in the next few years, or even longer, no search engine should be able to shake Google's technology leadership and product position. But we can also find some phenomena, such as when searching for holiday rentals, people prefer to use Airbub rather than Google, which is for anonymous / personalized search needs that Google cannot fully cover, after all, the original data is not in Google. We can look at an example: DuckDuckGo. This is a search engine that is different from what the public understands. DuckDuckGo emphasizes the best answers, not more results, so when everyone searches for the same keywords, the results are different.

Another technological trend is the introduction of artificial intelligence technology. In the search experience, through the introduction of a large number of algorithms, analyze the content and access preferences of the user search, optimize the title summary to a certain extent, and present it to the user in a more easy-to-understand way. Google is ahead of other vendors in the search engine AI process, officially starting its own revolution in 2016 with the retirement of Amit Singhal and the succession of John Giannandrea. Giannandrea is a top expert in the research of deep neural networks, similar to those in the human brain. By analyzing sea-scale digital data, these neural networks can learn how to arrange, such as classifying pictures, recognizing the voice control of smartphones, etc., and can also be used in search engines. Therefore, the transition from Singhal to Giannandrea also means the transition from traditional human intervention rule-set search engine to AI technology. After the introduction of deep learning technology, the search engine, through continuous model training, it will deeply understand the content and provide customers with services closer to the actual needs, which is its useful or terrible place.

The workflow of Google search engine

Post a picture and feel it for yourself.

In more detail:

This is the end of the content of "how does the search engine work"? thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.