This article introduces the Gawk pattern scanning language. This was published in two parts in the "Linux For You" magazine in the June and July 2003 issues Visit LFY Website

Introducing Gawk : Part 2

In this concluding part learn how you can write gawk programs and generate pretty reports using gawk



Read the first part
Looking Back

In the last part, you were introduced to the GNU implementation of the awk pattern matching and processing language, known as gawk. I showed you the basics of gawk by using it to manage a simple text based database. In this part we'll take a programming language view of gawk as we explore the various programming constructs it offers. Finally we'll see how gawk can format all the output so that what we get in the end is visually pleasing.

Gawk, as we have seen before, works on text files where each line is a record and is a list of fields. The fields are separated by a special character called the field separator. Here we will use the same file we used in the last part, which is reproduced here for convenience:

<Contents of file "songs">

Bryan Adams:Summer of 69:Rock:3.5:1990
Steps:Five-Six-Seven-Eight:Dance:3.4:1998
Guns n Roses:November Rain:Rock:8.8:1995
Los Del Rio:Macarena:Dance:3.7:1997
Las Ketchup:The Ketchup Song:Dance:4.0:2002
Sting:Desert Rose:Pop:4.7:2001
Jon Bon Jovi:Its My Life:Rock:3.9:2001
Fatboy Slim:Rockafeller Skank:Techno:3.5:2000

This file is a database containing details of songs. The fields are <artist>, <title>, <genre>, <duration in minutes> and <year> respectively. The field separator in this case is the colon character.

Writing Gawk Programs

Instead of typing gawk instructions each time it needs to be executed, the instruction, along with many others may be put in a file. This gawk-script may be invoked with the -f option. A gawk script is essentially a full-bodied program, complete with possible if-else, while, do-while constructs (borrowed from C) as well as input (using functions such as getline) and output (using the C output workhorse printf). A gawk script is executed as follows:

$ gawk -f program_file data_file

A gawk program usually takes the following form:

#this is a comment
BEGIN { #statements here are executed at the beginning of reading the file }
#/pattern/{action} pairs
#other statement(s)
END { #statements here are executed at the end of reading the file }

As a simple example, let's search for the pattern "Pop" in our database and print the record. Type the following in an editor of your choice and save the file as pattern.awk.

Listing of pattern.awk:

#pattern.awk
BEGIN { FS = ":" }
/Pop/{ print $0 }
Now type in the following command in the terminal:
$ gawk -f pattern.awk songs
The output is as expected:
Sting:Desert Rose:Pop:4.7:2001

As another example, let us print the titles of songs released on or after the year 2000.

Listing of year.awk:

#year.awk
BEGIN { FS = ":" }
$5 >= 2000 { print $2, "was released in the year", $5 }
The output is:
The Ketchup Song was released in the year 2002
Desert Rose was released in the year 2001
Its My Life was released in the year 2001
Rockafeller Skank was released in the year 2000

There are some points to be noted here:

1. The BEGIN block is executed at the very beginning. As such it is an ideal place to initialize variables. We have set the field separator here.

2. FS in an inbuilt gawk variable which represents the field separator. Setting FS = ":" is equivalent to using gawk -F ":" <whatever> in the command line.

3. Comments in a gawk script begin with a hash (#) character, as in shell scripts.

Gawk allows the user to define his own variables and functions if required. We'll encounter such examples as we go along. Gawk has other inbuilt variables too; some of them are enlisted below:

Variable Description
NR The number of records read so far
FNR The number of records read from the current file
FILENAME The name of the input file
FS Field separator (default is whitespace)
RS Record separator (default is newline)
OFMT Output format for numbers
OFS Output field separator
ORS Output record separator
NF The number of fields in the current record


Programming constructs in gawk

Typical of a programming language, gawk provides all the necessary control structures. Some are demonstrated in this section.

The if-else condition

In the following program, we find the total length of all tracks belonging to the rock genre and that of all other genres combined. This demonstrates the use of the if-else condition in gawk programs.

Listing of if.awk

BEGIN {
FS = ":"
len_rock = 0 #total length of rock tracks
len_other = 0 #total length of all other tracks
}

{
if($3 == "Rock")
len_rock += $4 #add length of this track
else
len_other += $4

}

END {
print "Total length of rock tracks :", len_rock, "minutes"
print "Total length of other tracks :", len_other, "minutes"
}
The output is:
Total length of rock tracks : 16.2 minutes
Total length of other tracks : 19.3 minutes

The program introduces some new features:

1. We have used user-defined variables, 'len_rock' and 'len_other'. Notice that most of the syntax is similar to that in C, for instance, the '+=' operator. If you are not familiar with C, remember that the expression

variable += number

increases the value of the 'variable' by 'number'. In our example, len_rock += $4 increases the value of the variable 'len_rock' by $4 i.e. the length of the current track.

2. The if construct also is similar to C's if-else syntax. However variable declaration is not required in gawk and statement termination with a semicolon is not necessary either.

2. The END block is useful for printing results of computations that require all records to be processed. Thus the total lengths have been printed in this block.

The for loop

The 'for' loop in gawk looks and behaves very much like its counterpart in C. The next script prints all the records separately, showing the fields separately as well.

Listing of for.awk

BEGIN { FS = ":"

arr[1] = "Artist" #descriptions of fields
arr[2] = "Title"
arr[3] = "Genre"
arr[4] = "Duration (mins)"
arr[5] = "Year"
}
{
i=0
print "\nRecord Number ",NR," :\n"

for(i=1;i<=NF;i++)
{
printf("%-15s : %s\n",arr[i],$i)
}
}
The output is:
Record Number  1  :

Artist          :  Bryan Adams
Title           :  Summer of 69
Genre           :  Rock
Duration (mins) :  3.5
Year    		:  1990

Record Number  2  :

Artist          :  Steps
Title           :  Five-Six-Seven-Eight
Genre           :  Dance
Duration (mins) :  3.4
Year            :  1998

Record Number  3  :

Artist          :  Guns n Roses
Title           :  November Rain
Genre           :  Rock
Duration (mins) :  8.8
Year            :  1995

(... and so on, for all the 8 records)

There are some important features of this gawk program:

1. We have used inbuilt gawk variables NR (which is the number of records read so far) and NF (the number of fields in the record)

2. Gawk supports arrays and accessing array elements are easy. Arrays may be declared as "arrayname[index] = value". We have used an array (arr[]) to remember descriptions of each field in this program.

3. You might have been surprised to see C's all-powerful output function printf in action. The availability of printf gives the capability to nicely format gawk's output (more on this later).

The while loop

Finally we will use the while looping construct. In the next program, we wish to output all the records in the reverse order. This is similar to the 'tac' utility (as opposite to the commonly used 'cat' utility) - try "tac <textfile>" on any text file in a console window.

Listing of reverse.awk

#reverse the lines in a file
BEGIN {
print "The File Reverser...\n"
print "The Original File...\n" }

{ line[NR] = $0 # remember each line
print $0 # $0 signifies the entire record
}

END{
print "\nThe Reversed File...\n"
var = NR

while(var>0)
{
print line[var]
var--
}
print "\nDone!\n"
}

In this example,there is an array line[] that stores all the records in the main section of the program (between BEGIN and END). The variable var keeps track of which record to output, which is decreased from NR to 1. At the END block NR contains the total number of records read. When var is equal to NR, line[var] contains the last record and when NR is equal to 1, line[var] contains the first record. Note that we have not FS in this example as it is simply not necessary since we are not working with individual fields.

The output is:

The File Reverser...
 
The Original File...
 
Bryan Adams:Summer of 69:Rock:3.5:1990
Steps:Five-Six-Seven-Eight:Dance:3.4:1998
Guns n Roses:November Rain:Rock:8.8:1995
Los Del Rio:Macarena:Dance:3.7:1997
Las Ketchup:The Ketchup Song:Dance:4.0:2002
Sting:Desert Rose:Pop:4.7:2001
Jon Bon Jovi:Its My Life:Rock:3.9:2001
Fatboy Slim:Rockafeller Skank:Techno:3.5:2000
 
The Reversed File...
 
Fatboy Slim:Rockafeller Skank:Techno:3.5:2000
Jon Bon Jovi:Its My Life:Rock:3.9:2001
Sting:Desert Rose:Pop:4.7:2001
Las Ketchup:The Ketchup Song:Dance:4.0:2002
Los Del Rio:Macarena:Dance:3.7:1997
Guns n Roses:November Rain:Rock:8.8:1995
Steps:Five-Six-Seven-Eight:Dance:3.4:1998
Bryan Adams:Summer of 69:Rock:3.5:1990
 
Done!
Formatting in gawk

With the power of printf at its disposal, gawk can format the output very nicely. This gives gawk its report writing capability, as can be seen by trying out our final gawk program:

Listing of report.awk

BEGIN { FS = ":"
# Clear the screen...
system("clear")
# Print header...
printf("\n%30sSongs Database Report\n"," ")
printf ("\n%-25s%-25s%-15s%-10s%8s",
" Artist"," Title","Genre","Duration","Year");
printf("\n%-25s%-25s%-15s%-10s%8s\n",
" ======"," =====","=====","========","====");
}
{
printf("\n%1s%-21s%-28s%-15s%-4s%4s%10s"," ",$1,$2,$3,$4,"mins",$5);
}
END { print "\n\nThat's all folks!" }

There are a few notable features about this program:

1. The system("clear") function executes the shell command 'clear' to clear the screen before the report is shown. Other than that the entire program is essentially formatting using 'printf'.

2. The printf function offers much more sophisticated printing capabilities than the print statement. It takes the general format

printf("format_string",argument_list)

In our program, for example the string "%-15s" prints the corresponding field in a left-justified column 15 characters wide. The details of printf are beyond the scope of this article. In case you are unfamiliar with it, do read the manpages for in-depth information (type in the following command at a terminal : man 3 printf).

The output of the last program is :

                        Songs Database Report

Artist          Title                    Genre      Duration       Year
======          =====                    =====      ========       ====
 
Bryan Adams     Summer of 69             Rock       3.5 mins       1990
Steps           Five-Six-Seven-Eight     Dance      3.4 mins       1998
Guns n Roses    November Rain            Rock       8.8 mins       1995
Los Del Rio     Macarena                 Dance      3.7 mins       1997
Las Ketchup     The Ketchup Song         Dance      4.0 mins       2002
Sting           Desert Rose              Pop        4.7 mins       2001
Jon Bon Jovi    Its My Life              Rock       3.9 mins       2001
Fatboy Slim     Rockafeller Skank        Techno     3.5 mins       2000
 
That's all folks!

Pretty, isn't it ? I'll assume that now you have gained enough familiarity with gawk to explore on your own. The man and info pages of gawk offer detailed information on all aspects of the language. So what are you waiting for - open a terminal and hack away!


Return to part 1

Copyright (c) 2004 Rajorshi Biswas
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation. A copy of the license is included here.