Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to deeply analyze the implementation of ip2region

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

How to deeply analyze the implementation of ip2region? in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

In the application of mobile Internet, it is often necessary to do some statistical analysis of user-side information according to the location information of users. In order to get the user's location information, there are generally two ways: the GPS location information and the user's IP address. Since every phone doesn't necessarily turn on GPS, and sometimes you don't need a precise location (to the city level), it's a good choice to analyze the user's location based on the IP address. To do this, you need a mapping library of IP and geolocation, and rely on this library to start an IP geolocation service. The following combines the ip2region to analyze the design of the mapping library and how IP can be quickly transformed into geographic locations.

Introduction

IP location services are very common, and many companies provide similar paid services, such as Ali, Gaud, Baidu, etc., of course, there are also open free services, such as GeoIP, pure IP and so on. These services are either parsed through HTML pages or requested through interfaces, but they can't do without a http request anyway, not to mention that most services have restrictions on QPS. The following table enumerates some common ways to obtain addresses through IP.

Open API service restrictions sample Taobao IP address library interface each user's QPS is less than 1curl-d "ip=218.97.9.25&accessKey=alibaba-inc" http://ip.taobao.com/outGetIpInfo Amap interface each user has 100000 access restrictions per day, and enterprise developers have 30 million access restrictions curl "https://restapi.amap.com/v3/ip?ip=218.97.9.25&key=f4cf14aca974dfbb0501c582ce3fce77"GeoIPHTML parsing"

Curl-d "ip=218.97.9.25&submit= submission" https://www.geoip.com Pure IPHTML Analysis

Curl http://www.cz88.net/ip/?ip=218.97.9.25

In daily work, it is usually necessary to convert IP in a large number of user request logs into user location information for subsequent analysis. The key to this is a large amount of data and fast processing. It is obvious that our daily needs cannot be met by requesting API public services every time.

Brute force generation IP library

For daily requirements, a simple and rude way is to obtain the location information of all public network IP through API in advance. According to the following TIPS, we can estimate that it will take 10 years to traverse 330 million of domestic IP addresses by accessing Taobao IP address library. If it is Gaode's corporate users, it will take about 11 days to traverse the domestic IP address. I feel that this 11 days is still acceptable.

TIPS: IPv4

The IP address currently refers to IPv4, which uses 32-bit (4-byte) addresses, so the address space is about 4.29 billion $2 ^ 32 = 4294967296 $, but some addresses are reserved for special purposes, such as private networks (about 18 million) and multicast addresses (about 270 million), which reduces the number of addresses that can be routed on the Internet. According to wiki statistics, the number of IPv4 in China has reached 330 million, compared with 1.54 billion in the United States.

Here we agree on the data format of location information: country | region | Province | City | ISP. If the field returned in the API does not have the corresponding information, the corresponding field is filled with 0. Then we can get the following file data by requesting the API service sequentially (the address is incremented in turn):

0.0.0.0 | 0 | 0 | 0 | Intranet IP | Intranet IP0.0.0.1 | 0 | 0 | 0 | Intranet IP | Intranet IP...1.0.15.255 | China | 0 | Guangdong Province | Guangzhou | Telecom. 255.255.255.255 | 0 | 0 | Intranet IP | Intranet IP

As long as you have this file, you can read it into memory and save it using a dictionary with the key IP address and the value location information. The program can return location information within O (1) time complexity, but we can roughly calculate the size of the program or file. Suppose we use utf-8 for storage, the shortest case for a record is 0.0.0.0 | 0 | 0 | 0 | 0, which takes up 17 bytes. The size of the IP library file is 1704294967296 = 73014444032 B = 71303MB = 71GB. This size is unacceptable to any program.

Spatial optimization IP library file optimization

From the above file data, it is found that a large number of adjacent IP have the same location information (customers will try to connect together when applying for an IP address), so we can synthesize such records into one record. The following file data (address fields are incremented in turn):

0.0.0.0 | 0.255.255.255 | 0 | 0 | 0 | Intranet IP | Intranet IP...1.0.8.0 | 1.0.15.255 | China | 0 | Guangdong Province | Guangzhou | Telecom. 224.0.0.0 | 255.255.255.255 | 0 | 0 | Intranet IP | Intranet IP

The latest ip.merge.txt in the ip2region library has 658207 records and a file size of 39m.

IP address optimization

From the file data above, it is found that a large number of IP addresses are stored as strings, while IPv4 uses 32-bit addresses. So converting it to an integer for storage can save a lot of space, such as the shortest string 0.0.0.0 occupies 7 bytes, the longest string 111.111.111.111 occupies 15 bytes, and if converted to an integer, they all occupy 4 bytes. 0.0.0.0 is int (0) and 111.111.111.111 is int (1869573999).

Location information optimization

From the above file data, it is found that the same location information will correspond to different IP segments (customers may apply for IP segments in different periods of time), so there is still a lot of location information in the IP library file. We can keep only one location information in memory and use the pointer or file offset + data length to obtain the corresponding location information.

Optimized IP library

Based on the optimization above, we can generate the final IP library: ip2region.db, which is only 8.1m long.

The structure of IP Library

The structure of IP library file ip2region.db is divided into four parts: super block, header index area, data area, index area. It is shown in the following figure:

Super block

To save the start and end addresses of the index block, the first index pointer points to the start position of the index block, that is, the first index block of the first index partition, and the last index pointer points to the end of the index block-12, that is, the header address of the last index block of the last index partition. In this way, the address range of the index block can be quickly obtained by directly reading 8 bytes of the super block when querying.

Header index area

The header index is a secondary index of index blocks, specifically for b+tree search. The total length of the index area divided by the index partition length 12 * (1024, 8, 12-1) is the actual number of indexes of the header index. The area is 2048-8 bytes in size and consists of 2048 8-bytes header index blocks. The first four bytes of the header index block store the starting IP value of the first index block of each index partition, and the last four bytes point to the address of the index block.

The header index area is defined as close to 16k because the entire header index area can be read through four disk reads, and then queried in memory. The result of the query can determine that the ip is in an index partition of the index area, and then read the 8k index partition to memory twice according to the address, and then query it in memory, thus reducing the number of disk reads.

Data area

The saved data is in the following format: China | South China | Guangdong Province | Shenzhen City | Dr. Peng, representing country, region, province, city and operator respectively.

Index area

The index area is made up of index blocks, each of which occupies 12 bytes, including the start IP, the end IP, and the data information. In the data information, the first three bytes hold the data address, and the last byte holds the data length. Each index block corresponds to a record in ip.merge.txt, representing the index of an IP segment.

In the retrieval, when the specified IP is between the start IP and the end IP of an index block, it means that the index is hit. Then through the data address and data length in the index block, the corresponding location information data can be read from the ip2region.db.

Generation of IP Library

The ip2region.db generation process is provided in ip2region's Github repository, which is written in JAVA, and the class diagram is shown below:

By being familiar with the source code for generating ip2region.db, a brief description of the generation process is as follows:

Use RandomAccessFile to reserve 8 bytes super blocks and 2048'8 bytes header index areas in the file

Scan the ip.merge.txt file and process each record as follows:

According to the starting IP, ending IP and data of each record, an index block is generated, the first four bytes store the starting IP, the middle four bytes store the end IP, and the last four bytes store the calculated data address (written by RandomAccessFile, where a dictionary of location information to the file location is maintained to ensure that the same location information is written only once. And temporarily stores the index block in the indexPool linked list This step determines all the location information for the data area.

After scanning all the records in ip.merge.txt, write all the index blocks in indexPool to the back of the data area. In this process, int (10248x12-1) = 681index blocks are formed into an index partition, and the starting IP and address information (header block) of the first index block of each index partition are recorded and temporarily stored in the headerPool linked list. In addition, the start and end positions of the index area are recorded.

Adjust the RandomAccessFile to the beginning of the file, and the start position of the write index area is stored in the first four bytes of the super block, and the end position is stored in the last four bytes of the super block.

Continue writing header blocks from headerPool to the header area.

Adjust the RandomAccessFile to the end of the file and write the timestamp and copyright information.

TIPS: global_region.csv data is also used in the ip2region warehouse. The file has five columns (line number, region, zip code), corresponding to the specific information of the region, which can be filled in each location information of the data area.

Quick search

Ip2region provides three query algorithms, and the worst queries are all ms-level. They are memory binary search, b+tree search, binary search. The time-consuming increases in turn. The search structure is as follows:

Binary search

The start and end positions of the index area can be obtained through the super block, and each index block is 12 bytes, in which the IP address is incremented, so binary search can be used to quickly obtain location information. The steps are as follows:

Convert the IP value to an integer by the ip2long method

Read the super block to get the start position and end position of the index area, which subtract + 1 to get the total number of index blocks.

Using dichotomy to solve the problem directly, comparing the size of the starting IP, the ending IP and the current IP in the index block, we can find the corresponding index block of the IP. According to the last four bytes of the index block, we can get the data address and data length, thus get the location information.

B+tree search

B+tree search uses the header index area, the first step is to use binary search in the header index area, locate an index partition, and then use binary search in the corresponding index partition. It is faster than binary search because the number of disk reads is much lower than that of binary search. The steps are as follows:

Convert IP values to integers through ip2long

The dichotomy is used to search the header index area, and the corresponding header index blocks and their corresponding index partitions are compared.

Read the corresponding index partition, and then locate the corresponding index block by dichotomy, so as to obtain the location information.

Binary search based on memory

This method is similar to the binary search method, except that the former reads all the ip2region.db into memory, while the latter reads the ip2region.db file continuously.

The ip2region library only solves a very common IP location problem, but makes the service small and fast (and, of course, provides a multilingual client), resulting in an 8.4k star on Github.

Small memory footprint

The location information of the adjacent IP is the same. The IP segment is used to solve the problem that the adjacent IP corresponds to the same location information, so as to avoid repeated storage of the location information.

IP is converted to INT, such as the string 111.111.111.111 is converted to int (1869573999), from 15Byte to 4Byte

Different IP segments also have the same location information, which points to specific location information through a pointer to ensure that the location information is saved only once (full scan is stored in the dictionary)

Search speed is fast

IP is ordered, using binary search to reduce the time complexity to O (logN)

The use of the secondary index header index area reduces the disk read and write frequency, first determines the index partition, then determines the index location from the index partition, and then determines the location information data.

Multilingual client support

Supports java, C#, php, c, python, nodejs, php extensions (php5 and php7), golang, rust, lua, lua_c, nginx.

This is the answer to the question about how to analyze the implementation of ip2region in depth. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report