Experience the frontal attack of Intel CPU loophole 04/27 Update SLTechnology News&Howtos

Experience the frontal attack of Intel CPU loophole

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

As a programmer who has not written code for more than 3 years, this article should not be regarded as a technical article, but as an experience of ToB big data, a startup that serves thousands of customers, many people may not know much about our products, so let me briefly introduce our technology and business application scenarios. We have a number of SaaS products. There is a free game data analysis platform for game companies, an Ad Tracking system for monitoring effective advertisements, and a TrackingIO system that connects mobile advertising monitoring and multidimensional user behavior analysis data. among them, the system architecture is more complex, and more customers use TrackingIO, with billions of data points every day. The Intel CPU vulnerabilities mentioned in the title of this article have an impact on several of our SaaS products, and I mainly use TrackingIO as a case summary.

System architecture introduction:

In terms of the technical architecture of TrackingIO, we used the typical Haddop ecosystem as the basis of big data's architecture. The services we deployed using hybrid cloud in 2016 and migrated to AWS in 2017. Among them, the services collected by user Log were early used Scribed and Flume, but we found that there was data loss, so later we wrote a set of Java distributed log collection system. For real-time computing, we use the typical streaming computing architecture of Kafka + Storm. Part of the persistent NoSql database uses Redis, part of AWS uses DynamoDB, the part of real-time user behavior analysis combines Parquet + Kudu + Presto, offline computing uses AWS EMR + Hive, and we use Spark as an independent system for offline data mining. The system architecture is as follows:

Data flow:

Overall architecture:

Main business scenarios:

1. The client connects to the SDK of our Android, iOS or REST API on the client or server, and sends the data to the Cluster of our Log Receiver.

2After receiving the data in the cluster, do some business logic processing, part of the data is landed on the local disk, and part of the data is sent to the Kafka cluster.

3Gore Storm cluster reads data from Kafka and performs business logic processing. Part of the logic is to read and write Redis and part of the logic is to read and write Dynamo DB.

4. The real-time multi-dimensional user behavior analysis MR program reads the data from Kafka and writes it to Kudu, and synchronizes it to Hive. The offline data is handed over to Parquet for batch model processing, and finally the data is merged by Presto view.

5, offline procedures are all handed over to the ETL system to do processing, this time will not be introduced.

The process of discovering the impact of a vulnerability:

Our system is deployed on AWS. Normally, even during the peak period of data transmission every day, Storm consumes data from Kafka clusters without Message Lag. However, starting from the evening, our monitoring system began to find that there was an accumulation of Kafka data, and soon the data was squeezed to more than 200 million, as shown in the figure:

The problem of Kafka data accumulation may be caused by different factors. For example, we have encountered a situation in which the concurrent reading and writing of Dynamo DB leads to slow data consumption of Storm. In addition to the accumulation of Kafka data, we also found that Receiver Cluster has also experienced a decline in overall processing performance, Timeout and other problems.

In the case of no abnormal increase in data traffic, our program has not been updated. We still have a lot of doubts about this phenomenon, but solving the problem is the top priority. OPS gradually increased the number of servers in Storm cluster by 4 times and Receiver Cluster by 30%. The problem of Receiver was quickly solved, but it was found that the consumption of Kafka accumulation was not much faster, but squeezed more and more. Data consumption is less than 5 million every 10 minutes, and the number of connections is up from the monitoring data of Redis (see figure), but there are a lot of timeouts in Storm's Spout program.

After some servers were added to each node, when the whole cluster data was at its trough at 3 or 4 o'clock in the morning, the consumption speed of Storm still did not significantly improve. We OPS began to wonder if it was the impact of the Intel CPU vulnerability released by Google on January 2nd, but in the early morning we could not confirm the technical details with AWS. We had to wait until 1 went to work on the 5th, and we got confirmation from AWS. They upgraded Intel's CPU kernel patch to fix Intel CPU vulnerabilities, resulting in a decline in CPU performance on all servers, of which the type of r3.large (third-generation CPU) we used had the greatest impact.

Solution:

After the official confirmation of AWS, we upgraded the CPU type used in the entire Redis cluster to r4.xlarge. At the same time, after increasing the size of the Redis cluster server, the consumption rate of Storm began to recover, and the data consumption of the cluster also increased to more than 12 million per 10 minutes. According to monitoring, the data backlog began to decline, and because the backlog of data was too large, the cluster pressure on zookeeper also increased. In the middle, we also upgraded the disk space of zookeeper once, and all the data squeezed by the cluster was consumed in the early morning of the 1st and 6th, and none of the data was lost, as shown below:

For now, the solution to this Intel CPU vulnerability is to add machines, and for services that are CPU-intensive or rely on CPU caching, more servers are added. (this case is based on the deployment of the service on AWS)

If you do not use the cloud service, you will not be affected by delaying the repair of Intel CPU vulnerabilities and letting the server run naked with the vulnerabilities, but the risk of data theft will not be ruled out.

The impact:

1. Thousands of customers we serve are unable to view real-time data, resulting in all advertising customers unable to see the advertising monitoring data in real time, which has a significant impact on the delivery. The time is unprecedented since we provided the service.

2. Because of the decline in overall CPU performance, our overall computing power has declined. In order to solve the problem, we have to add more computing units. The assessment is as follows:

2.1 the overall computing power of Peregrine Redis decreased by more than 50%.

2.2 IO-intensive services such as Cluster, down by 30%.

2.3, compiled and executed programs, reduced by about 20%

2.4, other servers, down about 5%

Why the performance degradation of Redis is so significant:

On the one hand, it is because the third-generation Intel CPU of AWS is most seriously affected by vulnerabilities, and the performance is degraded the most. On the other hand, the design of Redis is particularly dependent on CPU's level 2 and 3 cache to improve performance. After the repair of this Intel CPU vulnerability patch, the cache capacity of CPU is reduced, thus affecting the performance of Redis (this piece also needs to do more professional technical research)

About Intel CPU vulnerabilities:

Original article: (need × ×)

Https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

What do you think of the Intel CPU design loophole that broke on January 2, 2018?

Https://www.zhihu.com/question/265012502/answer/289320187?utm_medium=social&utm_source=wechat_session&from=timeline&isappinstalled=0

Explain in detail how Intel vulnerabilities get kernel data:

Https://mp.weixin.qq.com/s/2OBig3rejp1yUupBH9O7FA

More professional articles that analyze Meltdown and Spectre vulnerabilities:

Https://meltdownattack.com/meltdown.pdfhttps://spectreattack.com/spectre.pdfhttps://securitytracker.com/id/1040071

Conclusion:

Judging from the impact of this vulnerability, there is still a lot of room for improvement in our system architecture, and the impact of this CPU-level vulnerability on the global computer and Internet industry is still terrible. We still hope that the IT departments of various companies will repair this Bug as soon as possible to prevent potential losses.

Finally, thank all our customers for their understanding and support, we will, as always, provide more and more perfect big data products and services! It is believed that our engineer team can do better and better in dealing with unexpected problems.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.