What are the advantages of Hbase's web crawler and search engine? 04/27 Update SLTechnology News&Howtos

What are the advantages of Hbase's web crawler and search engine?

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what are the advantages of Hbase's web crawler and search engine". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Web crawler architecture is a typical distributed offline batch processing architecture based on Nutch+Hadoop. It has excellent throughput and crawling performance and provides a large number of configuration customization options. Because the web crawler is only responsible for crawling the network resources, a distributed search engine is needed to index and search the network resources captured by the web crawler in real time.

Search engine architecture is based on ElasticSearch, is a typical distributed online real-time interactive query architecture, no single point of failure, high scalability, high availability. Indexing and searching a large amount of information can be done in near real time, and can quickly search billions of files and PB-level data in real time. At the same time, it provides all-round options and can be customized almost every aspect of the engine. API that supports RESTful, you can use JSON to invoke its various functions through HTTP, including search, analysis, and monitoring. In addition, native client class libraries are provided for Java, PHP, Perl, Python, Ruby and other languages.

By structurally extracting the captured data, the web crawler is submitted to the search engine for indexing for query analysis. Because the design goal of the search engine is near-real-time complex interactive query, the search engine does not save the original content of the indexed web page, so a near-real-time distributed database is needed to store the original content of the web page.

Distributed database architecture, which is based on Hbase+Hadoop, is a typical distributed online real-time random read-write architecture. Strong horizontal scalability, supporting billions of rows and millions of columns, can write the data submitted by the web crawler in real time, and cooperate with the search engine to obtain data in real time according to the search results.

Web crawler, distributed database and search engine all run on the cluster composed of ordinary commercial hardware. The cluster adopts distributed architecture, which can be extended to thousands of machines and has fault-tolerant mechanism. The failure of some machine nodes will not cause data loss or failure of computing tasks. Not only is it highly available, it can fail over quickly when a node fails, and it is highly scalable, which can scale horizontally and linearly by simply increasing the machine, improving data storage capacity and computing speed.

Relationships among web crawlers, distributed databases, and search engines:

1. After the web crawler parses the crawled HTML page, it adds the parsed data to the buffer queue, and the other two threads are responsible for processing the data. One thread is responsible for saving the data to the distributed database, and one thread is responsible for submitting the data to the search engine for indexing.

2. The search engine processes the user's search conditions and returns the search results to the user. If the user views the snapshot of the web page, the original content of the web page is obtained from the distributed database.

The overall architecture is shown in the following figure:

In physical deployment, crawler cluster, distributed database cluster and search engine cluster can be deployed to the same hardware cluster or separately to form 1-3 hardware clusters.

The web crawler cluster has a special web crawler configuration management system that is responsible for the configuration and management of the crawler, as shown below:

The search engine achieves high performance, high scalability and high availability through shard and replica. Slicing technology provides support for massively parallel indexing and search, greatly improves the performance of index and search, and greatly improves the ability of horizontal expansion; replica technology provides redundancy for data, and some machine failures do not affect the normal use of the system, ensuring the continuous high availability of the system.

The index structure of 2 shards and 3 copies is as follows:

A complete index is divided into two separate parts, 0 and 1, each with 2 copies, the gray section below.

In the production environment, with the increase of the data scale, it is only necessary to simply increase the hardware machine nodes, and the search engine will automatically adjust the number of shards to adapt to the increase of hardware. When some nodes are retired, the search engine will automatically adjust the number of shards to adapt to the reduction of hardware. At the same time, the number of copies can be changed at any time according to the level of hardware reliability and the change of storage capacity, all of which are dynamic. There is no need to restart the cluster, which is an important barrier for high availability.

This is the end of the content of "what are the advantages of Hbase's web crawler and search engine". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.