15-minute parallel artifact gnu parallel getting started Guide 04/27 Update SLTechnology News&Howtos

15-minute parallel artifact gnu parallel getting started Guide

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

GNU Parallel is a shell tool for performing computing tasks in parallel on one or more computers. This article briefly introduces the use of GNU Parallel.

This cpu is multicore.

Generally speaking, the two cores work like this:

The fourth core works like this:

16 cores work like this:

Okay, it's not dark anymore. Intel is going to hit me again.

When I was bored one weekend morning, I spent half a day going through gnu parallel's man page and tutorial. Haha, I have to say that this half-day should be worth it, because it feels like it will save me more than half a day later.

This article does not attempt to translate gnu parallel's man page or tutorial. Because there is already a ready-made translation, you can look here, or here.

But I backed out after seeing several weird parallel placeholders: and the strange {} {.} placeholder. Such an ugly grammar is loveless. Fortunately, I directly took a look at a few example pressure shock, started to try it, and found that it was actually an artifact.

The main purpose of this article is to lure that you use this tool and to tell you why (why) and how (how) to use it.

Why

There is only one purpose to use gnu parallel, that is, to be fast!

Quick installation

(wget-O-pi.dk/3 | | curl pi.dk/3/) | bash

The author said to install it in 10 seconds. The actual situation in China may not be enough. But it won't take long. It's actually a perl single-file script with more than 10, 000 lines (yes, you read it right, all the modules are in this file, which is a feature). After that, I wrote fabric scripts and copied them directly to each node machine. Chmod the executive permissions again.

Then there is fast execution, which takes advantage of the multi-core execution of the system in parallel:

The image above:

Grep A 1G log.

Use parallel, and directly grep without using parallel. The result is obvious, the difference is 20 times. This is much more effective than using ack,ag optimization.

Note: this is the result of execution on a 48-core server.

How

The easiest way is to make an analogy to xargs. There is a parameter-P in xargs, which can take advantage of multicore.

For example:

$time echo {1.. 5} | xargs-n 1 sleepreal 0m15.005suser 0m0.000ssys 0m0.000s

This xargs passes the number of each echo as an argument to sleep, so it takes a total of 1 / 2 / 3 / 4 / 5 / 15 seconds to sleep.

If you use the-P parameter to assign 5 cores, each core will have 5 seconds of sleep 1, 2 and 4, so it will take a total of 5 seconds of sleep after execution.

$time echo {1.. 5} | xargs-n 1-P 5 sleepreal 0m5.003suser 0m0.000ssys 0m0.000s

The groundwork is over. In general, the first mode of parallel is to replace xargs-P.

For example, compress all the html files.

Find. -name'* .html'| parallel gzip-- best

Parameter transfer mode

The first mode is to use parallel to pass parameters. The commands coming in front of the pipe are passed as parameters to the following commands, which are executed in parallel.

such as

Huang$ seq 5 | parallel echo pre_placehoder_ {} pre_placehoder_1pre_placehoder_2pre_placehoder_3pre_placehoder_4pre_placehoder_5

Where {} is a placeholder that is used to place the position of the passed-in parameters.

In cloud computing operations, there are often batch operations, such as building 10 cloud disks

Seq 10 | parallel cinder create 10-- display-name test_ {}

Set up 50 CVMs

Copy code as follows: seq 50 | parallel nova boot-- image image_id-- flavor 1-- availability-zone az_id-- nic vnetwork=private-- vnc-password 000000 vm-test_ {}

Delete CVMs in batch

Nova list | grep some_pattern | awk'{print $2}'| parallel nova delete

Rewrite for loop

As you can see, I actually replaced a lot of places that need to write loops with parallel, enjoying the speed of parallelism by the way.

The truth is that when doing a for loop, parallelization is most likely, because the objects placed in the loop are context-free.

Universal abstraction, the cycle of shell:

(for x in `cat list`; do do_something $x done) | process_output

It can be written directly as

Cat list | parallel do_something | process_output

If there is too much content in loop

(for x in `cat List`; do do_something $x [. 100lines that do something with $x...] Done) | process_output

Then you'd better write a script.

Doit () {do_something $x [... 100 lines that do something with $x...]} export-f doit cat list | parallel doit

But also can avoid a lot of troublesome escape.

-- pipe mode

Another mode is parallel-- pipe.

At this time, the one in front of the pipe is not used as a parameter, but the standard input is passed to the following command.

For example:

Cat my_large_log | parallel-- pipe grep pattern

If you don't add-- pipe, the command that every line in mylog becomes grep pattern line expands. With the addition of-pipe, it is no different from cat mylog | grep pattern, but is assigned to each core for execution.

All right, that's all for the basic concepts! Others are just the specific use of various parameters, such as how many cores to use, place_holder replacement, various tricks to pass parameters, parallel execution but to ensure the sequential output of the results (- k), and magical cross-node parallel computing. Just take a look at man page.

Bonus

Having a gadget that converts to parallelism on hand not only makes your daily execution a little faster, but also has the advantage of testing concurrency.

Many APIs will have some bug under concurrent operations. For example, there are some judgments that there are no locks in the database, which are judged at the code level. As a result, when concurrent requests go on, each request is judged to pass when it reaches the server, and the limit is exceeded after being written together. Previously, writing a for loop does not trigger these problems because it is executed serially. But if you really want to test concurrency, you need to write a script, or use python's mulitiprocessing to encapsulate it. But I have parallel on hand, and I add the following two alias to bashrc

Alias p='parallel'alias pp='parallel-- pipe-K'

It's so convenient to make concurrency, just add a p at the end of the pipe, and I can always create concurrency to observe the response.

For instance

Seq 50 | p-N0-Q curl 'example.com'

Send the request at the same time according to the number of your core. -N0 means that the seq output is not passed as an argument to subsequent commands.

Gossip time: sister-in-law Xianglin of the Gnuworld

As a free software gossip lover, every time I find a novel software, I always go to google the keywords site: https://news.ycombinator.com and site: http://www.reddit.com/. See what the wind reviews are, and you can often get a windfall in the discussion.

Then I saw a complaint on hacker news, mainly saying that every time you trigger the execution of parallel, you will pop up a text saying, if you use this tool for academic purposes (a lot of life sciences are using this tool), you should quote his paper, otherwise you should pay him 10000 euros. As a result, I learned a word called Nagware, which refers to software that annoys you like a Tang monk and asks you to pay. Although I think we should quote the article when it is really used, as the student said:

I agree it's a great tool, except for the nagware messages and their content. Imagine if the author of cd or ls had the same attitude...

In addition, the author is so fond of being quoted from his software that I saw it in NEWS:

Principle time

A direct excerpt of the author's answer on stackoverflow

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

GNU Parallel instead spawns a new process when one finishes-keeping the CPUs active and thus saving time:

Conclusion

This article focuses on a true-parallel tool for Amway, explaining its two main patterns, along with a technique to gossip the unknown side of the gnuworld. I hope it works for you.

The above is the whole content of this article, I hope it will be helpful to your study, and I also hope that you will support it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.