Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Awk command line or script that helps you sort text files (recommended)

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

Awk is a powerful tool for performing tasks that may be accomplished by other common utilities, including sort.

Awk is a ubiquitous Unix command for scanning and processing text that contains predictable patterns. However, because of its functional function, it can also be reasonably called a programming language.

Confusingly, there is more than one awk. (or, if you think there is only one, then the others are clones. There is awk (the original program written by Aho, Weinberger, and Kernighan), followed by nawk, mawk, and GNU versions of gawk. The GNU version of awk is a highly portable free software version of the utility with several unique features, so this article is about GNU awk.

Although its official name is gawk, on GNU+Linux systems, its alias is awk and is used as the default version of the command. On other systems that do not have GNU awk, you must first install it and call it gawk, not awk. The terms awk and gawk are used interchangeably in this article.

Awk is both a command language and a programming language, which makes it a powerful tool for handling tasks that were originally left to sort, cut, uniq, and other common utilities. Fortunately, there is a lot of redundant space in open source, so if you are faced with the question of whether to use awk or not, the answer may be "whatever".

The flexibility of awk is that if you have decided to use awk to accomplish a task, you can continue to use awk no matter what happens next. This includes the eternal need to sort the data rather than the order in which it is delivered to you.

Sample data set

Before you explore the sorting method of awk, generate a sample dataset to use. Keep it simple so you don't get bothered by extreme situations and unexpected complexities. This is the sample set used in this article:

Aptenodytes;forsteri;Miller,JF;1778;EmperorPygoscelis;papua;Wagler;1832;GentooEudyptula;minor;Bonaparte;1867;Little BlueSpheniscus;demersus;Brisson;1760;AfricanMegadyptes;antipodes;Milne-Edwards;1880;Yellow-eyedEudyptes;chrysocome;Viellot;1816;Sothern RockhopperTorvaldis;linux;Ewing,L;1996;Tux

This is a small dataset, but it provides a variety of data types:

Genus and species names, related but separate surnames, sometimes an acronym beginning with a comma that represents an integer for a date, any term all fields are separated by semicolons

Depending on your educational background, you may think of this as a two-dimensional array or table, or just a row-separated set of data. What you think of it is just your problem, while awk only knows the text. It's up to you to tell awk how you want to parse it.

Just want to sort.

If you only want to sort the text dataset by specific definable fields, such as "cells" in a spreadsheet, you can use the sort command.

Fields and records

Regardless of the format of the input, you must find a pattern in it to focus on the parts of the data that are important to you. In this example, the data is delimited by two factors: rows and fields. Each row represents a new record, as you can see in a spreadsheet or database dump. In each row, different fields are separated by a semicolon (;) (treated as cells in a spreadsheet).

Awk processes only one record at a time, so when you are constructing this instruction to awk, you can focus on only one record. Write down what you want to do on a row of data, then test it on the next line (either psychologically or with awk), and then do some other tests. Finally, you need to make assumptions about the data to be processed by your awk script so that it can be provided to you according to the data structure you want.

In this example, it is easy to see that each field is separated by a semicolon. For simplicity, suppose you want to sort the list by the first field of each row.

Before sorting, you must be able to get awk to focus only on the first field of each row, so this is the first step. The syntax of the awk command in the terminal is awk, followed by the relevant options, and finally the data file to be processed.

$awk-- field-separator= ";"'{print $1;} 'penguins.listAptenodytesPygoscelisEudyptulaSpheniscusMegadyptesEudyptesTorvaldis

Because field delimiters are characters that have a special meaning to Bash shell, semicolons must be enclosed in quotation marks or preceded by a backslash. This command is only used to prove that you can focus on specific fields. You can use the number of another field to try the same command to see the contents of another column of the data:

$awk-- field-separator= ";"'{print $3;} 'penguins.listMiller,JFWaglerBonaparteBrissonMilne-EdwardsViellotEwing,L

We haven't done any sorting yet, but this is a good foundation.

Script programming

Awk is not just a command, it is a programming language with indexes, arrays, and functions. This is important because it means you can get a list of fields to sort, store the list in memory, process it, and then print the resulting data. For a series of complex operations like this, it is easier to operate in a text file, so create a new file called sort.awk and enter the following text:

#! / bin/gawk-fBEGIN {FS= ";";}

This establishes the file as an awk script that contains the lines that are executed.

The BEGIN statement is a special setting function provided by awk for tasks that need to be executed only once. Define the built-in variable FS, which represents the field delimiter field separator, and is the same as the value you set with-- field-separator in the awk command. It only needs to be executed once, so it is included in the BEGIN statement.

Arrays in awk

You already know how to collect the values of a particular field by using the $symbol and field number, but in this case, you need to store it in an array instead of printing it to the terminal. This is done through the awk array. The important thing about the awk array is that it contains keys and values. Imagine the content of this article; it looks like this: author: "seth", title: "How to sort with awk", length:1200. Elements such as author, title, and length are keys, followed by values.

The advantage of doing this in the context of sorting is that you can assign any field as a key, assign any record as a value, and then use the built-in awk function asorti () (sort by index) to sort by key. Now, just assume that you just want to sort by the second field.

Awk statements that are not cited by the special keywords BEGIN or END are loops that are executed in each record. This is part of a script that scans the patterns in the data and processes them accordingly. Every time awk turns its attention to a record, the statement in {} is executed (unless it starts with BEGIN or END).

To add keys and values to an array, create a variable that contains the array (in this example script, I call it ARRAY, which is not very authentic, but clear), and then assign it the key in square brackets, concatenating the values with an equal sign (=).

{# dump each field into an array ARRAY [$2] = $R;}

In this statement, the content of the second field ($2) is used as the keyword, while the current record ($R) is used as the value.

Asorti () function

In addition to arrays, awk has some basic functions that you can use as quick and easy solutions to common tasks. One of the functions introduced in GNU awk, asorti (), provides the ability to sort arrays by key (index) or value.

You can only sort the array after it has been populated, which means that this operation cannot be triggered for every new record, but only at the end of the script. To do this, awk provides a special END keyword. In contrast to BEGIN, the END statement is triggered only once after all records have been scanned.

Add these to your script:

END {asorti (ARRAY,SARRAY); # get length j = length (SARRAY); for (I = 1; I

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report