This article introduces the Gawk pattern scanning language. This was published in two parts in the "Linux For You" magazine in the June and July 2003 issues Visit LFY Website

Introducing Gawk : Part 1

Gawk is a language that is both easy to use and powerful. Learn the basics of gawk and use it for your common tasks.


Introduction

AWK is a pattern scanning and processing language created by Alfred Aho, Peter Weinberger and Brian Kernighan. GAWK is the GNU Project's implementation of the AWK programming language that is included in most GNU/Linux distributions.

AWK (hereafter referred to as gawk) is not just another cryptic language that comes free with Linux. It can be an indispensable tool for all UNIX programmers or administrators. Being highly flexible and quite powerful, gawk is just perfect when you need to perform commonplace tasks on your system. The important fact is that unlike other heavy-duty languages such as perl or C, gawk programs are neither too cryptic nor too lengthy for a given task.

As quoted from the info pages of gawk, you can use gawk to...

  • manage small, personal databases
  • generate reports
  • validate data
  • produce indexes, and perform other document preparation tasks
  • even experiment with algorithms that can be adapted later to other computer languages

Here we shall explore the basic facilities of gawk, using it for the first chore mentioned in the list above: manipulating a simple, small text based database. You'll soon realize that gawk is indeed refreshingly easy to use for such tasks.


My Database

Computer users often need to store some small amounts of information for which maintaining a full-fledged database would in all probability be overkill. It is simpler (and more efficient) to write down the information in a plain text file. However, one of the fundamental uses of a database: querying data may be tedious if done entirely manually. This is one situation where gawk fits the bill perfectly.

Assume that you want to have a database wherein you would like to store details about your mp3 collection. Fire up vi (or the editor of your choice), and type in the details as shown below:

Contents of the file songs:

Bryan Adams:Summer of 69:Rock:3.5:1990
Steps:Five-Six-Seven-Eight:Dance:3.4:1998
Guns n Roses:November Rain:Rock:8.8:1995
Los Del Rio:Macarena:Dance:3.7:1997
Las Ketchup:The Ketchup Song:Dance:4.0:2002
Sting:Desert Rose:Pop:4.7:2001
Jon Bon Jovi:Its My Life:Rock:3.9:2001
Fatboy Slim:Rockafeller Skank:Techno:3.5:2000

In the file, each line contains a record. The fields are <artist>, <title>, <genre>, <duration in minutes>, <year> respectively. They are separated by colons.

Gawk Basics

In general, gawk works with textual data stored in files. As stated, each line represents a record. The fields in a record are separated by a special character called the "field separator", which by default is the space character. However in our situation the space character is not ideal since fields such as title or artist may have spaces in between. Gawk allows us to override this default behaviour by explicitly setting the field separator. This can be done by setting the gawk variable FS or using the -F flag (more on this later).

Most statements in gawk consist of two parts: a pattern and a corresponding action. Whenever a pattern is matched, the corresponding action is executed. Such pattern-action pairs are the key to understanding gawk. The concept is quite simple and we confront numerous such pairs in our everyday life. Consider the following pairs:

Feeling hungry: Have some food
Raise in salary: Throw a party
Down with fever: Call a doctor
Elevator out of order: Use the stairs

These are (frivolous) pattern-action pairs. For instance, when you are hungry (the pattern), you reach for some food (the action).

Gawk may be invoked from a shell prompt or using a gawk program which is stored in a file. In the first case gawk is invoked as follows:

gawk '/pattern/{action}' data_file

In the latter case the gawk program is invoked as follows:

gawk -f program_file data_file

In this part, we shall use gawk in the former mode, thus familiarizing ourselves with the simple aspects of gawk.

Pattern matching in gawk

In gawk, the patterns to be matched are enclosed in a pair of forward slashes, and the actions are in a pair of braces as shown:

/patternA/{actionA}
/patternB/{actionB}
/patternC/{actionC}

This format clearly separates the patterns and corresponding actions. Most gawk scripts are sets of these pattern-action pairs, one after the other. Pattern matching is the simplest use of the gawk language. Now two types of exceptions to the pattern-action pair might exist.

1. Pattern does not exist:
In this case every line of input is treated as a match and the corresponding action is performed.

2. Action does not exist:
The default action is to print the entire record where the pattern matches.

Now let us do some pattern matching on the file "songs".

If we want to search for the string "Dance" in each record of our database we issue a command as shown:

$ gawk -F ":" '/Dance/' songs

Since our field separator is colon, we use gawk -F ":" to override the default space field separator. The default action being to print the entire record, the output of the command is:

Steps:Five-Six-Seven-Eight:Dance:3.4:1998
Los Del Rio :Macarena:Dance:3.7:1997
Las Ketchup:The Ketchup Song:Dance:4.0:2002

This shows all lines (records) with the string "Dance" in them. In this case, gawk acts similar to the "grep" utility. However, gawk is capable of much more, as we shall soon see.

Next, suppose we wish to list all titles and their corresponding durations. We type in the following command to achieve this:

$ gawk -F ":" '{print $2,$4}' songs

Gawk automatically separates the input lines into fields named as $1 (first field), $2 (second field) and so on. $0 signifies the entire record. This yields the following output:

Summer of 69 3.5
Five-Six-Seven-Eight 3.4
November Rain 8.8
Macarena 3.7
The Ketchup Song 4.0
Desert Rose 4.7
Its My Life 3.9
Rockafeller Skank 3.5

Now let us try to search for all rock songs in our database by a simple gawk command:

$ gawk -F ":" '/Rock/' songs

The result is:

Bryan Adams:Summer of 69:Rock:3.5:1990
Guns n Roses:November Rain:Rock:8.8:1995
Jon Bon Jovi:Its My Life:Rock:3.9:2001
Fatboy Slim:Rockafeller Skank:Techno:3.5:2000

Here, we run into a problem. Since we are simply searching for the pattern "Rock" in all records, the last record is displayed even though it does not belong to the "Rock" genre - its title begins with "Rock". To solve this, we need to check only the <genre> field in the file. Let's try again with this command:

$ gawk -F ":" '$3 == "Rock" {print $0}' songs

The output is just what we wanted:

Bryan Adams:Summer of 69:Rock:3.5:1990
Guns n Roses:November Rain:Rock:8.8:1995
Jon Bon Jovi:Its My Life:Rock:3.9:2001

Points to be noted regarding the previous command are:

1. There are no slashes because we are not matching patterns but evaluating something. Without a search pattern, gawk matches all records and performs the actions on each one.

2. The == sign means 'is equal to' (comparison); the command means if $3 is equal to "Rock" then print the record. Recall that $3 signifies the <genre> field in this case.

Suppose that of the previous results, we are interested in only those whose duration is less than five minutes. The following command achieves just that:

$ gawk -F ":" '($3 == "Rock") && ($4 < 5.0) {print $0}' songs

The output of the command:

Bryan Adams:Summer of 69:Rock:3.5:1990
Jon Bon Jovi:Its My Life:Rock:3.9:2001

This demonstrates the following important concepts:

1. Logical operators such as && (logical AND), || (logical OR) can be used in gawk to combine conditions.

2. Relational operators ( ==, !=, <, >, <=, >= ) can be incorporated in each condition to filter results.

3. In gawk, all fields are treated as strings but mathematical operations can be performed on them if they are numeric in nature. This is why the second condition ( $4 < 5.0 ) worked in this example.

Numerical computations using gawk

Gawk is quite powerful in handling numeric data and has several mathematical functions inbuilt. The basic arithmetic operators in gawk are +, -, *, /, ^ (exponentiation) and % (remainder). The inbuilt functions include the following

Function Description
sqrt(x) square root of x
sin(x) sine of x in radians
cos(x) cosine of x in radians
atan2(x,y) arctangent of x/y
log(x) natural logarithm of x
exp(x) e raised to the power x
int(x) integral part of x
rand() random number between 0 and 1
srand(x) sets x as seed for the rand() function


Let's put this knowledge to practice by printing the durations of each track in a "minutes and seconds" format. For this, we need to first extract the fractional part of each track length, and multiply it with 60 to obtain the "seconds" part. The integral part of the track length is the "minutes" part. Type the following command:

$ gawk -F ":" '{print $2,int($4),"min",($4-int($4))*60,"sec"}' songs

Here the inbuilt function int($4) extracts the integral part. The fractional part is therefore ($4 - int($4)). The output is:

Summer of 69 3 min 30 sec
Five-Six-Seven-Eight 3 min 24 sec
November Rain 8 min 48 sec
Macarena 3 min 42 sec
The Ketchup Song 4 min 0 sec
Desert Rose 4 min 42 sec
Its My Life 3 min 54 sec
Rockafeller Skank 3 min 30 sec
Metacharacters

Suppose that we do not know whether the artist Fatboy Slim has been spelt as "Fatboy Slim" or "Fatboy slim" in the database (note the difference in case). Conforming to Linux/UNIX style environments, gawk is case sensitive in its pattern matches. However we wish to search for both. For this, let us try another gawk command:

$ gawk -F ":" '$1 ~ /Fatboy [Ss]lim/' songs
The output is:
Fatboy Slim:Rockafeller Skank:Techno:3.5:2000

The tilde operator (~) signifies "is matched by" and inside the slashes the sequence [Ss] signifies "either S or s". The !~ operator signifies the opposite i.e. "is not matched by".

Special characters such as [] are known as metacharacters. Metacharacters are a powerful tool for filtering results and enhancing regular expressions. To prove it, assume that for some really strange reason, you wish to list records whose first field (artist) begins with 'L' and ends with an 'o'. Sounds impossible? Well, type the following (carefully):

$ gawk -F ":" '($1 ~ /^L/) && ($1 ~ /o$/)' songs

The output is:
Los Del Rio:Macarena:Dance:3.7:1997

Here's an explanation of the above cryptic command. The metacharacter '^' stands for the "first character" in the field. The part ($1 ~ /^L/) is true if the first character of the first field is 'L'. The metacharacter '$' signifies the "last character". Thus the part ($1 ~ /o$/) is true if the first field ends with an 'o'. When combined with the logical AND operator, &&, we get the desired result. There are several other metacharacters in gawk, details of which may be found in the man or info pages of gawk.

Conclusion

Phew, that was enough gawk for one session! By now, you should realize that gawk does achieve a lot in very few (or just one) lines of code.

In the next concluding part we'll see how gawk programs are written. We'll familiarize ourselves with all the programming constructs gawk has to offer for more powerful and sophisticated jobs. Also we'll try the various formatting options available that give gawk its powerful report-writing capabilities. Till then, happy experimenting!

Continue to part 2 of this article

Copyright (c) 2004 Rajorshi Biswas
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation. A copy of the license is included here.