In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
The main content of this article is to explain "the way to remove duplicates of two shell files". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Now let the editor to take you to learn the "shell two files to duplicate the way" it!
Preface
We all know that shell does have great advantages in text processing, such as multi-text merging, de-duplication and so on, but recently we have encountered a difficult problem, that is, the deduplication of two large data files. Let's take a look at the detailed introduction.
Request
There are txt files A.txt and B.txt.
Where An is the keyword and search volume, separated by commas, about 900000 lines.
B is the key word, about 4 million lines.
You need to find out the keywords that repeat with B from A.
I tried N postures, but the results were not satisfactory. The strangest thing is that some methods are useful for test files with a small amount of data. Once used on An and B, they will fail, which is really puzzling.
Posture one:
Awk-F,'{print $1}'A > keywords.txtcat keywords.txt B.txt | sort | uniq-d # first extract keywords from A.txt, then open them with B.txt, sort with sort, and uniq-d fetch duplicate lines
Posture two:
Awk-F,'{print $1}'A > keywords.txt# as usual, take out the keywords comm-1-2 keywords.txt B.txt# uses the comm command to display the lines that exist in both files
Posture 3:
Awk-F,'{print $1}'A > keywords.txtfor I in `cat keywords.txt`do A = `egrep-c "^ $i$" B.txt`if [$A! = 0] txt` if [$A! = 0] txt` if [$A! = 0] txt fidone # this posture is a little more complicated # first, take out the keyword, and then use the for loop to match one by one in B.txt (note regular writing ^ $i$). If the number of matching results is not 0, it means that the keyword is repeated. Then the advantage of outputting # is safe, but the disadvantage is that the efficiency is too low, 900000 words match 4 million words one by one, and shell does not have multithreading by default, which takes too long.
Pose 4:
Awk-F,'{print $1}'A > keywords.txtcat keywords.txt B.txt | awk'! a [$1] + +'# actually I don't quite understand the principle. The awk command is too powerful and profound, but this method is simple and fast.
There is actually another method of grep-v and grep-f, but I haven't tried it, so I won't list it here.
At this point, I believe that you have a deeper understanding of the "shell two files to duplicate the way", might as well come to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.