In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
It took a long time to solve a spark in docker problem, record it.
This is a very strange question. I can't find the answer after searching Google. It's more accidental than analytical.
Let's start with the architecture and environment.
The Z machine is the host of docker, and each docker runs a Zeppelin for data analysis. The reason for this is that the customer requires that everyone's environment must be independent.
Zdocker is assumed to be a docker image running on Z with zeppelin.
Zdocker uses the default bridge method to connect to external clusters
The Hadoop cluster is observed in two parts according to the order in which they are joined. A is the first 31 machines to join the cluster, B is the last 15 machines to join the cluster.
Spark uses client submission and YARN for resource management.
Appearance:
Spark computing tasks submitted via Zeppelin in Zdocker were normal until 15 of B joined the cluster.
B's 15 configurations are identical from operating system to network to JVM to hadoop, with no difference.
After B is added, all Spark tasks submitted from Zdocker cannot run. When executor is executed on 15 machines, it will report NoRouteToHost gateway: 7337.
After B is added, all Spark missions submitted outside Zdocker are normal.
After B is added, all executors do not run normally on B machines
After B is added, all MapReduce tasks inside and outside Zdocker are normal.
15 machines in YARN with B removed, everything is normal.
Analysis:
7337 Yarn shuffle port for spark
the first phase
Since it is NoRouteToHost, the first reaction is DNS problem, check all DNS and hosts files, no problem found, check iptables, route table, all no problem, solve failure.
the second phase
Suspect is spark bug, I want to know where the gateway came from, flipped through the scala and java code related to spark error, did not find this gateway related things.
the third stage
Because B is normal before joining, B is abnormal after joining, check the DNS and hosts in docker, all normal, continue to fail.
the fourth stage
Suspect environment variables are different, check all A,B machine configurations, environment variables are exactly the same.
Because it is a yarn shuffle, suspect spark-assembly problem, check found no problem.
Phase V
Try brute force, add gateway to DNS resolution and force it to a machine in cluster B. This means that all spark external shuffles point to port 7337 on a machine. OK, no problem after running for one day, but there was a problem the next day. The machine that was forcibly assigned reported that it could not find the index file of the job in the local shuffle.
Phase VI
Try to find the problem in docker, look at docker routing table, found 172.17.42.1 is docker gateway, holding a dead horse mentality in docker inside the gateway hostname ping, found through. So crotch a tight, feel there is a door can enter.
Docker internal default gateway is 172.17.42.1, and then default to a hostname called gateway, perhaps this is it.
So I ran to A and found a machine ping the gateway host name, ping card dead, because no matter DNS or hosts, in any hadoop node is not resolved gateway this host name. Then go to B to find a machine ping, directly exit the ping: unknown host gateway.
Therefore, he began to ponder that the two machines had different network environments. Perhaps this was the problem.
A dozens of machines, because of security needs, did not open any external network access. Therefore, when ping gateway on A's machine, even the domain name cannot be resolved at all, so it will be stuck. Since more than a dozen machines in B are newly added, the O & M forgot to turn off external network access, so it can find the domain name resolution server of the public network, but the public network cannot resolve the gateway domain name, so it directly returns to unknown host and exits.
Ask O & M to close the public network access of machine B, and then submit the job from Zdocker. Everything is normal.
Reason:
After solving the problem, let's go back and analyze it again. It should be because zeppelin submits spark jobs from docker. Spark runs in client mode. Driver runs in docker. When driver applies for resource allocation executor from yarn outside docker, it takes the hostname gateway as an environment variable and passes it to the executor container. If there is no external network access, the executor will use the local port 7337 as the yarn shuffle port, and with external network access, the executor will query the gateway IP, but DNS returns an error, which will cause the executor to execute incorrectly. This is also a good explanation for why the spark job outside docker does not report errors whether B is started or not.
So this is actually a very low-level fucking error, but no one would have thought that the failure of spark execution could also be related to whether to access the public network. Just like how we solved MR's slow speed last time, who would have thought that the NIC could jump to 10Mbps by itself?
In the future, we need to continue to study the network mechanism of docker and the parameter transmission mechanism between the driver and executor of spark in order to completely solve this problem.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.