AlphaDev speeds up sorting algorithm by 70% author of the language library explains DeepMind's latest AI in detail 07/04 Update SLTechnology News&Howtos

AlphaDev speeds up sorting algorithm by 70% author of the language library explains DeepMind's latest AI in detail

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

AlphaDev, a new member of the Alpha family, has attracted a lot of attention and discussion recently. A researcher who once worked at Google explained the latest research in detail.

A few days ago, DeepMind launched AlphaDev, which directly speeds up the sorting algorithm by 70 per cent.

This new AI system is based on AlphaGo, a master chess player.

The study aroused the interest of Justine Tunney, a former Google researcher.

'AS an author of the C language library, I've been looking for opportunities to plan the best things, 'she says.

Let's take a look at how Justine details the DeepMind sorting algorithm.

The discovery of the DeepMind sorting algorithm DeepMind has won well-deserved attention, but unfortunately, they could have better explained AlphaDev.

Next, start with the assembly code released by DeepMind, which sorts an array of three items and translates from pseudo assembly to assembly:

I named this function move37 () because DeepMind's blog post compared it to the shocking "step 37" under AlphaGo.

In the man-machine war in 2016, AlphaGo made a counterintuitive move, a simple shoulder stroke, to beat legendary go player Lee se-dol.

So if you run the DeepMind code:

However, in my opinion, this is a mistake.

The array we gave it is {3men1 ~ (2)}, but move37 () sorts it as {2 ~ () 1 ~ (1) ~ 3}.

DeepMind must be deceiving us, because I don't believe that 2 comes before 1. Let's take a look at their open source contribution to LLVM libcxx, which is expected to clarify a few things:

So move37 () is not really a sorting function, but a sorting kernel designed to be used as a building block for the sort3 () function.

It would be nice if papers and blog posts could mention this, because it makes me feel very confused in the shortest possible time. Here are better versions of the code, including missing swap operations.

To explain why their code is important, let's consider how this algorithm works at a high level. When I first tried to solve the sort3 () problem myself, I thought of this:

Then I looked at libcxx and found that they were doing the same thing. The problem with the above code is that the compiler is not good at optimizing it.

If you try to compile the above code, you will notice that your compiler inserts a large number of branch instructions. This is where DeepMind tries to improve through LLVM contributions.

However, these technologies are often not easy to understand.

I actually like innocent code because if we squint, we can see a pattern that has the same basic idea as DeepMind's state-of-the-art assembly code.

The idea is that the problem essentially boils down to three comparison and exchange operations:

The above code is the most advanced technology for sorting networks before. Now, this is where DeepMind's new findings come into play. They found that sometimes the above mov instructions are unnecessary.

If you try to run the above code, you will find that it is 100% correct with or without deleted lines.

This line of code looks like it's doing something, but it doesn't actually do anything. So I'm not surprised that such things have been ignored by computer science for decades.

It is also time to know more about how AlphaDev works.

DeepMind basically builds an artificial intelligence that can fiddle with assembly code and delete something at random to see if it is damaged.

I'm not saying this to deny AlphaDev's intelligence, because if I say I didn't do the same thing, I'm lying.

There are also two mov instructions in the above code, which we may delete. By using the ARM64 instruction set to do this, it can provide smaller code for similar problems.

Here, we don't need any instructions to create temporary variables:

Arm has been in the limelight recently, and I think the above example can be used as evidence of their fame.

Arm is also one of the best companies in the open source field. For example, their MbedTLS library is one of the most undervalued gems I have ever seen.

When I started using it, I had a plan to modify the Arm code to make it work better on x86 hardware.

I wrote all these well-designed assembly optimizations to achieve the same performance as OpenSSL on x86.

MbedTLS is simple, portable, and crackable C code, so it's good news for anyone who wants a cryptographic library that is not an assembly generated by Perl.

I told people at Arm what I was doing, and they didn't think it was subversive.

I hope one day I can find time to do what DeepMind does and make changes upstream. Arm's optimization library is also prolific, and it is unassailable in quality and double conversion.

It is particularly interested in the C library because for decades the open source community has relied on mathematical functions written by Sun Microsystems in the early 1990s to make ends meet.

Arm has found a way to improve several of these functions, such as pow (xQuery y). Considering that this is one of the most basic operations in mathematics, this is a very influential thing.

For example, if you use Arm's solution in pure software to implement pow on x86 machines, it will be five times faster than Intel's native x87 instructions.

Fortunately, DeepMind joined the game, so I took the liberty of translating their libcxx diff into readable C code.

This is another thing I'd like to see in papers and blog posts, because in this code, you'll find the canonical techniques that experts use to get compilers to generate unbranched MOVcc instructions.

When I saw the Sort5 () function, I felt that I had a better understanding of the motivation for DeepMind research.

If you compile the Sort5 () function on ARM64, the compiler will generate a function that handles 11 registers. If you are reasoning a mathematical equation, can you keep 11 variables in your working memory at a time?

Probably not. This is why a good kernel function like PartialSort3 is so useful.

It is worth mentioning that Sort3 () and Sort5 () are kernels themselves because they are intended to be building blocks of traditional sorting capabilities.

Blog posts cover this topic, but I think it would be useful to share something that is actually portable and executable.

The above algorithm shows what the new and improved libcxx is doing. It's basically quicksort except it switches to the sorting kernels and insertion sort when recursing into smaller slices. With libcxx I think they even took the added step of schlepping in heapsort, which is kind of slow, but prevents adversaries from smashing your stack. The above algorithm shows what the new and improved libcxx is doing. It is basically a quick sort, except to switch to the sort kernel and insert sort when recursively to a smaller slice. For libcxx, I think they even take the extra step of moving through the heap sort, which is a bit slow, but prevents opponents from breaking your stack.

The main thing you may be wondering at this point is, can I use this? Do these sorting network kernels actually make sorting go faster? I would say yes and no. When all you want is to sort ascending longs, the code above will go 2x faster than the standard qsort () function provided by your C library. Except you don't need the kernels to do that. What I've determined so far is that, on my personal computer (which has an Intel Core i9-12900KS) the above function sorts longs at 255megabytes per second. However if I comment out the sorting kernels: at this point, the main thing you might want to know is, can I use this? Do these sorting network kernels really make sorting faster? I would say yes and no. When you just want to sort the ascending length, the above code will be twice as fast as the standard qsort () function provided by your C library. It's just that you don't need a kernel to do that. So far, I have determined that on my PC (which has an Intel Core i9-12900KS), the above functions are sorted at a speed of 255 megabytes per second. But if I comment out the sorting kernel:

Then my longsort () function runs at 275 megabytes per second, achieving a 7% performance improvement by simplifying the algorithm.

The advantage of long is that it is long enough to store int key-value pairs, and the ability to sort map entries quickly is a useful technique.

The above function is compiled with only 181 bytes of x86-64 machine code.

Since DeepMind's sort3 () is only 42 bytes, I want to be able to swap some sizes for performance advantage.

Because the next best algorithm I've found so far is to switch to cardinality sorting, which is 400 MB/s, but requires up to 763 bytes of binary footprint in addition to relying on malloc (). Therefore, it would be nice to see these kernels doing better.

This is not to say that DeepMind's idea is worthless.

I think it's worth noting that DeepMind was very generous and last year gave us their vectorized fast sort library (they were called Google Brain at the time), and in doing so achieved a sorting advantage that could never be challenged.

Vqsor sorts long periods of time at 1155 MB/s on my computer.

It is even slightly better than djbsor, one of the most popular libraries in the open source community, although it has never been extended to more data types than int.

Both of these two implementations are implemented through vector sorting networks. I think this is where sorting network technology really shines.

I think AlphaDev would do this if it wasn't a toddler as far as smart entities are concerned.

When you start with basic principles, only the baseline instruction set is very difficult to support. If we wait, then I think we can look forward to seeing the great achievements of AlphaDev in the future, because it is trying to meet stronger challenges.

I also like the fact that DeepMind makes the algorithm smaller, because it's something I don't see very often.

Size coding is one of my favorite hobbies. On this blog, I posted a 383byte lambda calculus virtual machine and a 436byte lisp machine with garbage collection mechanism.

I also introduced the size optimization techniques I used in the cosmpolitan c library on my blog.

I also like DeepMind's parent company, because Google gave me an open source peer bonus a few weeks ago, and I'm glad to see them share my enthusiasm for making the software smaller.

It's nice to see them use it to improve vectorized quick sorting.

Finally, I like the idea of artificial intelligence companies writing machines that code in machine language. Why don't they? The essence of a machine is a machine.

As a builder, I find that this is much less than the future that OpenAI is creating.

They have built a huge paternalistic machine to compete with every builder on the planet in a zero-sum economy, and then tempt rent-seekers around the world to control the machine through government regulation.

I don't think OpenAI's commitment to automate all my favorite tasks, such as coding, is a step forward. What I want is to be able to control a machine that can do things I can't do myself, such as discovering sorting kernels. This is real progress.

I think every assembly line we can cut is a step in the positive direction of this dream.

Reference:

Https://justine.lol/sorting/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.