The method of removing duplicate data lines of files in linux 07/12 Update SLTechnology News&Howtos

The method of removing duplicate data lines of files in linux

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article focuses on "how to remove duplicate data lines from linux". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "the method of removing duplicate data lines from files in linux".

First, remove the adjacent duplicate data rows

The code is as follows:

$cat data1.txt | uniq

Output:

Beijing

Wuhan

Beijing

Wuhan

2. Remove all duplicate data rows

The code is as follows:

$cat data1.txt | sort | uniq

Note:

Only the uniq command simply removes the adjacent duplicate rows of data.

If you sort first, all duplicate data rows will be turned into adjacent data rows, and then uniq, all duplicate data rows will be removed.

Output:

Beijing

Wuhan

Attached: data1.txt

The code is as follows:

[root@syy ~] # cat data1.txt

Beijing

Wuhan

Beijing

Wuhan

Note: it is useful to filter the IP address in the log.

Delete lines with duplicate fields in big data file under Linux

Recently wrote a data acquisition program to generate a file containing more than 10 million rows of data, the data is composed of four fields, according to the requirements need to delete the second field duplicate row, looking for linux also did not find a suitable tool, sed/gawk and other stream processing tools can only deal with one line per line, and can not find duplicate fields of the row. It seems that I have no choice but to python a program. I suddenly remembered to use mysql, so I made a big shift:

1. Import data into the table using mysqlimport-- local dbname data.txt. The table name should be the same as the file name.

two。 Execute the following sql statement (unique fields are required to be uniqfield)

The code is as follows:

Use dbname

Alter table tablename add rowid int auto_increment not null

Create table t select min (rowid) as rowid from tablename group by uniqfield

Create table T2 select tablename. * from tablename,t where tablename.rowid= t.rowid

Drop table tablename

Rename table t2 to tablename

At this point, I believe you have a deeper understanding of "the method of removing file duplicate data lines in linux". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.