Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use awk in linux to delete duplicate lines in a file

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to use awk in linux to delete duplicate lines in the file. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

TL;DR

To keep the original order and delete the duplicate lines, use:

How awk'! visited [$0] + + 'your_file > deduplicated_file works

This script maintains an associative array where the index (key) is the deduplicated row in the file, and the value of each index is the number of occurrences of the row. For each line of the file, if the number of occurrences of this line (before) is 0, the value is added by 1, and the line is printed, otherwise, the line is not printed.

I'm not familiar with awk before, and I want to figure out how such a short script is implemented. I did some research, and here are the findings:

This awk "script"! visited [$0] + is executed on every line of the input file.

Visited [] is an associative array (aka mapping) type variable. Awk initializes it the first time it executes, so we don't need to initialize it.

The value of the $0 variable is the content of the row currently being processed.

Visited [$0] accesses the value in the map with a key equal to $0 (the row being processed), that is, the number of occurrences (which we set below).

! Invert the value that represents the number of occurrences:

If the value is empty, awk automatically converts it to 0 (number) followed by 1.

Note: the plus 1 operation is performed after we get the value of the variable.

If the value of visited [$0] is a number greater than 0, it is inverted and parsed into false.

If the value of visited [$0] is a number equal to 0 or an empty string, it is inverted and parsed into true.

In awk, the value of any non-zero number or any non-empty string is true.

The default initial value of the variable is an empty string, or 0 if converted to a number.

In other words:

+ + represents the value of the variable visited [$0] plus 1.

In general, the whole expression means:

True: if indicates that the number of occurrences is 0 or an empty string

False: if the number of occurrences is greater than 0

Awk consists of a pattern or expression and an action associated with it:

{}

If the pattern is matched, the following actions are performed. If you omit the action, awk prints (print) input by default.

The omitted action is equivalent to {print $0}.

Our script consists of an awk expression statement that omits the action. So write like this:

Awk'! visited [$0] + + 'your_file > deduplicated_file

It is equivalent to writing:

Awk'! visited [$0] + + {print $0} 'your_file > deduplicated_file

For each line of the file, if the expression matches, the line is printed to the output. Otherwise, no action is performed and nothing is printed.

Why not use the uniq command?

The uniq command can only de-duplicate adjacent lines. This is an example:

$cat test.txtAAABBBAACCCBBA$ uniq

< test.txtABACBA其他方法使用 sort 命令 我们也可以用下面的 sort 命令来去除重复的行,但是原来的行顺序没有被保留。 sort -u your_file >

Sorted_deduplicated_file uses cat + sort + cut

The above method produces a deduplicated file, and the lines are sorted based on the content. This problem can be solved through the pipe connection command.

Cat-n your_file | sort-uk2 | sort-nk1 | cut-f2-

working principle

Suppose we have the following file:

Abcghiabcdefxyzdefghiklm

Cat-n test.txt displays the sequence number before each line:

1 abc2 ghi3 abc4 def5 xyz6 def7 ghi8 klm

Sort-uk2 sorts based on the second column (K2 option) and retains the same value for the second column only once (u option):

1 abc4 def2 ghi8 klm5 xyz

Sort-nk1 is sorted based on the first column (option K1) and treats the values of the column as numbers (- n option):

1 abc2 ghi4 def5 xyz8 klm

Finally, cut-f2-prints each line from the second column to the last content (- f2-option: note-suffix, which means that the rest of the line is included).

This is the end of abcghidefxyzklm's article on "how to use awk in linux to delete duplicate lines in a file". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report