The way of deduplicating two files in shell 04/19 Update SLTechnology News&Howtos

The way of deduplicating two files in shell

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

The main content of this article is to explain "the way to remove duplicates of two shell files". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Now let the editor to take you to learn the "shell two files to duplicate the way" it!

Preface

We all know that shell does have great advantages in text processing, such as multi-text merging, de-duplication and so on, but recently we have encountered a difficult problem, that is, the deduplication of two large data files. Let's take a look at the detailed introduction.

Request

There are txt files A.txt and B.txt.

Where An is the keyword and search volume, separated by commas, about 900000 lines.

B is the key word, about 4 million lines.

You need to find out the keywords that repeat with B from A.

I tried N postures, but the results were not satisfactory. The strangest thing is that some methods are useful for test files with a small amount of data. Once used on An and B, they will fail, which is really puzzling.

Posture one:

Awk-F,'{print $1}'A > keywords.txtcat keywords.txt B.txt | sort | uniq-d # first extract keywords from A.txt, then open them with B.txt, sort with sort, and uniq-d fetch duplicate lines

Posture two:

Awk-F,'{print $1}'A > keywords.txt# as usual, take out the keywords comm-1-2 keywords.txt B.txt# uses the comm command to display the lines that exist in both files

Posture 3:

Awk-F,'{print $1}'A > keywords.txtfor I in `cat keywords.txt`do A = `egrep-c "^ $i$" B.txt`if [$A! = 0] txt` if [$A! = 0] txt` if [$A! = 0] txt fidone # this posture is a little more complicated # first, take out the keyword, and then use the for loop to match one by one in B.txt (note regular writing ^ $i$). If the number of matching results is not 0, it means that the keyword is repeated. Then the advantage of outputting # is safe, but the disadvantage is that the efficiency is too low, 900000 words match 4 million words one by one, and shell does not have multithreading by default, which takes too long.

Pose 4:

Awk-F,'{print $1}'A > keywords.txtcat keywords.txt B.txt | awk'! a [$1] + +'# actually I don't quite understand the principle. The awk command is too powerful and profound, but this method is simple and fast.

There is actually another method of grep-v and grep-f, but I haven't tried it, so I won't list it here.

At this point, I believe that you have a deeper understanding of the "shell two files to duplicate the way", might as well come to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.