Hadoop operation and maintenance record series (22) 07/02 Update SLTechnology News&Howtos

Hadoop operation and maintenance record series (22)

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I wrote the code for a while this afternoon, and then helped my colleagues solve a hbase-related fault analysis, located the root cause of the problem, felt more representative, and recorded it.

First, let's talk about the occurrence and background of the problem.

This fault is actually divided into two faults, the first is relatively simple, and the second is relatively complex.

Colleagues wrote a HBase-related test code, using Hbase native Java API, compiled and packaged and tested on two test servers, one can, the other can not.

The first failure occurred on two test servers:

Server An is a normally connected server with HBase,centos6 installed

Server B is an unreachable server without HBase,centos7 installed.

The appearance is that the A server can run the compiled jar package normally, while the B server can not run normally, and the card will not move when it connects to Zookeeper to get the table list.

You see, it has nothing to do with whether the server is equipped with HBase or not.

First of all, it is suspected that it is a firewall. First, turn off the firewalld of server B, which does not work, and then take a look with netstat. It is found that server A connects to Hbase using tcp, while server B connects to centos7, connects to hbase through tcp6, and hbase cluster uses centos6, no ipv6, disables tcp6 of server B, and then still does not work.

Netstat checks the HBase ZK server, establishes a connection with server B, and ESTABLISHED, but does not return data.

After that, alas, after 35 timeouts, the error log came out, which was due to the lack of a mapping of the A host name IP in the B server, and the colleague wrote the code to connect to the A host that does not exist in the B host, so it got stuck in the process of querying the host name and added the A host to / etc/hosts to solve the problem.

Then the judgment of the second fault is more interesting, still this program, we have two clusters, the computer room is in Wuxi and Hangzhou. This code needs to request the HBase cluster of Wuxi server room in Hangzhou to fetch data, and then it is exactly the same as the appearance of the first failure. It is impossible to obtain table list through zookeeper, but the entire operating system environment is certainly fine.

Although the appearance is the same, the reason is 108000 li different. Note: the × × × network segment in Wuxi is a class An address at the beginning of 10, and Hangzhou is a class B address at the beginning of 172. a link has been made between the Hangzhou server that needs to run data and the HBase in Wuxi by colleagues of operation and maintenance.

Then we used jstack,strace and all kinds of tools didn't see the problem.

The jstack cue card is on the thread of listTable, and the FUTEX_WAIT card is not visible, and then there is no problem with the error report after exceeding 35 timeouts. HBase's listtable method only connects to a single zk server, so there is no need to map all HBase hostnames to hosts.

After taking a look at netstat, I found that Hangzhou computer room sent a request for Hangzhou computer room 10 network segment zk 2181 port. I remember that it was parked on SYN_SEND, and there was no ESTABLISHED for a long time. Then I immediately took a look at Wuxi HBase's netstat and found that there was no connection established at all.

Analyzed with colleagues, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Because the operation and maintenance colleague who did × × was not there, I called and asked that it was indeed an one-way route. even if the problem was solved, it would be good to wait for the operation and maintenance colleague to come back and add the routing table.

The two problems have been solved for more than an hour. I will be off work soon before I finish writing the principal/keytab management page of kerberos. This is the story of another cluster.

It shows that the 40-year-old programmer is still useful, can solve some problems quickly, and is worth living.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.