binary instruction consists of the op-code (2 digits) and an address (3 digits). an assembly...

13
—Spring 2004—18 520 — Principles of Programming Languages 18: Awk Christian Collberg [email protected] Department of Computer Science University of Arizona – p. 520—Spring 2004—18 Introduction This lecture was prepared using information taken from the Gawk manual. The Awk utility interprets a special-purpose programming language that makes it possible to handle simple data-reformatting jobs easily with just a few lines of code. The GNU implementation of Awk is called gawk. There are other implementations for various platforms, mawk, nawk, .... Awk is a text-manipulation and prototyping language – it allows you to experiment with algorithms that can be adapted later to other (more efficient) languages. – p. —Spring 2004—18 Introduction. . . The name Awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. O u t p u t F i l e I n p u t F i l e Line 2 Line n Line 1 ... Line 2 Line m Line 1 ... Program Awk Interpreter Awk – p. 520—Spring 2004—18 The Structure of a Program An Awk program. . . functions as a filter: it reads a text file as input and produces a text file as output. is stored in a normal text file. It doesn’t have to be compiled but it does have to be made executable. consists of function definitions and pattern-action definitions. – p.

Upload: dinhthu

Post on 13-Apr-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

520—Spring 2004—18

520 — Principles of ProgrammingLanguages

18: AwkChristian Collberg

[email protected]

Department of Computer Science

University of Arizona

– p. 1 520—Spring 2004—18

Introduction

This lecture was prepared using information taken fromthe Gawk manual.

The Awk utility interprets a special-purposeprogramming language that makes it possible to handlesimple data-reformatting jobs easily with just a few linesof code.

The GNU implementation of Awk is called gawk. Thereare other implementations for various platforms, mawk,nawk, . . . .

Awk is a text-manipulation and prototyping language – itallows you to experiment with algorithms that can beadapted later to other (more efficient) languages.

– p. 2

520—Spring 2004—18

Introduction. . .

The name Awk comes from the initials of its designers:Alfred V. Aho, Peter J. Weinberger, and Brian W.Kernighan.

Output

File

Input

File

Line 2

Line n

Line 1

...

Line 2

Line m

Line 1

...

Program

AwkInterpreter

Awk

– p. 3 520—Spring 2004—18

The Structure of a Program

An Awk program. . .

functions as a filter: it reads a text file as input andproduces a text file as output.

is stored in a normal text file. It doesn’t have to becompiled but it does have to be made executable.

consists of function definitions and pattern-actiondefinitions.

– p. 4

520—Spring 2004—18

The Structure of a Program. . .

#!/bin/gawk -ffunction FunName1 (Args) {

Function Body 1}function FunName2 (Args) {

Function Body 2}

· · · · · · · · ·pattern1 { action1 }pattern2 { action2 }

· · · · · · · · ·BEGIN { action }END { action }

– p. 5 520—Spring 2004—18

The Structure of a Program. . .

An Awk program reads one line of input at a time andstores it in a variable called $0.

Awk then goes through every pattern{action} in theprogram. If $0 matches the pattern (a regularexpression) then action is executed. A missingpattern (as below) matches any string.

An action can be any sequence of Awk statements. Theprint statement prints its arguments to standardoutput. length returns the length of a string.

Example Awk Program (len):#!/bin/gawk -f# This program prints the length of every input line:{print length($0)}

– p. 6

520—Spring 2004—18

The Awk Interpreter

Input File

AWK Program

EXIT!

Read a linefrom theinput fileinto $0.

1?

$0

matchespattern

Performaction1!

2?

$0

matchespattern

Performaction2!

input?

End of

PerformactionE!

action B!

Perform

BEGIN {actionB}

pattern1 {action1}

pattern2 {action2}

END {actionE}

Line 2

Line n

Line 1

...

Yes Yes Yes

No

– p. 7 520—Spring 2004—18

Running an Awk Program

An Awk program has to be made executable usingchmod.

Run the program by giving its name, followed by theinput file (or several input files).

Since len is a text file, we can run it on itself.

We can also give the program directly on the commandline.

> chmod a+x len

> len len

14

33

22

18

> gawk -f ’{print length($0)}’ len– p. 8

520—Spring 2004—18

Pattern Matching

Patterns are regular expressions. A regexp issurrounded by slashes, like this: /[a-z]/.

Example 1: ‘BBS-list’:bites 555-1675 2400/1200/300camelot 555-0542 300core 555-2912 1200/300fooey 555-1234 2400/1200/300foot 555-6699 1200/300macfoo 555-6480 1200/300sdace 555-3430 2400/1200/300sabafoo 555-2127 1200/300

– p. 9 520—Spring 2004—18

Pattern Matching. . .

Program 2: getfoo:#!/bin/gawk -f# Prints every line containing "foo"./foo/ {print $0}

> getfoo BBS-listfooey 555-1234 2400/1200/300foot 555-6699 1200/300macfoo 555-6480 1200/300sabafoo 555-2127 1200/300

– p. 10

520—Spring 2004—18

Pattern Matching. . .

This program prints out every line that contains 1200 or2400:

Program 3: getfast:#!/bin/gawk -f/1200/ {print $0}/2400/ {print $0}

– p. 11 520—Spring 2004—18

Defining Functions

Functions are defined like this

#!/bin/gawk -ffunction FunName (Args) {

Fun Body}

Example Function:#!/bin/gawk -ffunction max (m,n) {

if (m > n) return melse return n

}

– p. 12

520—Spring 2004—18

Defining Functions. . .

Calling a Function:BEGIN { print max(4,5) }

ACHTUNG I!

No space between the function name and theparenthesis:Wrong ⇒ f () ⇐ Wrong.

– p. 13 520—Spring 2004—18

Defining Functions. . .

Awk functions can have a varying number of arguments.

function Message (Pos, Kind, Msg, Arg1, Arg2) {

print Pos ": <" Kind "> " Msg " " Arg1 " " Arg2

}

BEGIN {

Message("[1,2]", "WARNING", "Unused identifier")

Message("[2,3]", "ERROR", "Undeclared identifier", "K")

Message("[10,2]", "INFO", "Compile time:", 17, "secs")

}

Output:

[1,2]: <WARNING> Unused identifier

[2,3]: <ERROR> Undeclared identifier K

[10,2]: <INFO> Compile time: 17 secs

– p. 14

520—Spring 2004—18

Defining Functions. . .

Local variables are defined by adding “extra” (unused)function arguments.

It’s a good idea to separate the locals from thearguments by putting the locals on a line by themselves.

#!/bin/gawk -ffunction Expr Arith (node, Oper,

LNode, RNode, LType, RType) {LNode = Child1[node];RNode = Child2[node]Env[LNode] = Env[RNode] = Env[node]Expr(LNode); Expr(RNode);LType = Type[LNode];RType = Type[RNode]

}– p. 15 520—Spring 2004—18

Awk Operators

x + y Addition.x - y Subtraction.- x Negation.x * y Multiplication.x / y Division. Since all numbers are double-precision

floating point, the result is not rounded to an integer:3 / 4 = 0.75.

x % y Remainder.x ˆ y Exponentiation: x raised to the y power.

– p. 16

520—Spring 2004—18

Awk Operators. . .

x ** y Exponentiation.String concatenation. Putting two operands next toeach other will convert them both to strings, and thenconcatenate them: X= 5 6; print X; ⇒ 56.

x < y True if x is less than y. True is 1 and 0 is False.x <= y True if x is ≤ y.x > y True if x is greater than y.x >= y True if x is ≥ y.x == y True if x is equal to y.x != y True if x is not equal to y.

– p. 17 520—Spring 2004—18

Awk Operators. . .

x ∼ y True if the string x matches the regexp denoted by y.x !∼ y True if the string x does not match the regexp de-

noted by y.x in y True if array y has an element with the subscript x.|| Or (Disjunction).&& And (Conjunction).! Not.

– p. 18

520—Spring 2004—18

Awk Operators. . .

ACHTUNG II!The operands of a relational operator are compared asnumbers if they are both numbers. Otherwise they areconverted to, and compared as, strings.

Strings are compared character-by-character. Thus,"10" is less than "9".

If you want to be sure that you are making a numericalcomparison, convert both operands to numbers: if((0+x)<(0+y)) print x.

ACHTUNG III!

Beware (as in C) of the difference between = and ==:if (a==b) print x is quite different from if (a=b)print x!

– p. 19 520—Spring 2004—18

Control Statements

if (condition)then-body

elseelse-body

if (condition)then-body

The condition is considered false if its value is zero orthe null string (""), true otherwise.

if (x % 2 == 0)print "x is even"

elseprint "x is odd"

ACHTUNG III! (rerun)

Beware (as in C) of the difference between = and ==!– p. 20

520—Spring 2004—18

Control Statements. . .

while (cond)body

dobody

while (cond)for (init; condition; incr)

bodyfor (i in array)

do something with array[i]

Example:i = 1do {

print $0i++

} while (i <= 10)

– p. 21 520—Spring 2004—18

Control Statements. . .

break continue

break jumps out of the innermost for, while, or do-whileloop that encloses it.

continue skips over the rest of the loop body, startingthe next loop cycle.

– p. 22

520—Spring 2004—18

Example

Find the smallest divisor of any integer, and identifyprime numbers:

function Divisor (num, d) {for (d = 2; ; d++) {

if (num % d == 0) {printf "Smallest: %d\n", num, dbreak }

if (d*d < num) {printf "%d is prime\n", numbreak }

}

– p. 23 520—Spring 2004—18

Control Statements. . .

next exitexit expr

exit makes Awk stop executing the current rule and tostop processing input; any remaining input is ignored.

exit e returns e as the exit status code for the Awkprocess. exit returns status zero (success).

next forces Awk to stop processing the current inputline and go on to the next line.

{if (length($0) < 10) {print "line skipped, too short"next }

}

– p. 24

520—Spring 2004—18

Associative Arrays

Awk has one-dimensional associative arrays.“Associative” means that an array is a collection ofpairs: an index, and its corresponding array elementvalue.

Arrays don’t have to be declared and will grow in size asnecessary.

Arrays are essentially string-to-string mappings:A["foo"] = "bar" inserts the element "bar" atindex "foo".

There really isn’t any order between the elements of anarray, the way there is in a Pascal or C array.

It is possible to simulate multi-dimensional arrays. Seethe manual.

– p. 25 520—Spring 2004—18

Associative Arrays. . .

The following program takes a list of lines, eachbeginning with a line number, and prints them out inorder of line number.

The first rule keeps track of the largest line numberseen so far; it also stores each line into the array arr, atan index that is the line’s number.

$1 is the first field (word) on an input line.

{if ($1 < max) max = $1arr[$1] = $0

}END {

for (x = 1; x <= max; x++) print arr[x]}

– p. 26

520—Spring 2004—18

Associative Arrays. . .

Input:5 I am the Five man2 Who are you? The new number two!4 . . . And four on the floor1 Who is number one?3 I three you.

Output:1 Who is number one?2 Who are you? The new number two!3 I three you.4 I am the Five man5 I am the Five man

– p. 27 520—Spring 2004—18

Built-in Functions

int(x) The integer part of x, truncated toward 0.sqrt(x) The positive square root of x.exp(x) The exponential of x.log(x) The natural logarithm of x.sin(x) The sine of x, with x in radians.cos(x) The cosine of x, x in radians.rand() A random number, uniformly-distributed be-

tween 0 and 1.srand(x) Set the seed for generating random numbers.

– p. 28

520—Spring 2004—18

Built-in Functions. . .

int(3.14) ⇒ 3

function randint(n) {return int(n * rand())

}

# Roll a simulated die.function roll(n) {

return 1 + int(rand() * n)}

– p. 29 520—Spring 2004—18

Built-in Functions. . .

length(str) The number of characters in str.

index(in, find) Search the string in for the firstoccurrence of the string find. Return the positionwhere that occurrence begins. If find is not found,index returns 0.

match(str, regexp) Search the string str for thelongest, leftmost substring matched by regexp. Thevariable RSTART is set to the position, and RLENGTH isset to the length of the matched substring. If no matchis found, RSTART is set to 0. match returns RSTART.

index("peanut", "an") ⇒ 3match("xx1233yy",/[0-9]+/) ⇒ RSTART=3, RLENGTH=4

– p. 30

520—Spring 2004—18

Built-in Functions. . .

split(str, arr, sep) Divide str up into piecesseparated by sep, and store the pieces in arr. splitreturns the number of pieces found.

tolower(str) Return a copy of the string str, with all theupper-case characters translated to their correspondinglower-case counterparts. Nonalphabetic characters areleft unchanged.

toupper(str) Same as tolower, but converting toupper-case.

split("auto-da-fe", a, "-") ⇒a[1]="auto"a[2]="da"a[3]="fe"

– p. 31 520—Spring 2004—18

Built-in Functions. . .

sub(regexp, repl, str) Search str for the leftmostsubstring matched by the regular expression regexp.Replace the matched text with repl.

gsub(regexp, repl, str) Same as the sub function,except gsub replaces all of the matching substrings itcan find.

substr(str, start, len) Return alen-character-long substring of str, starting atcharacter number start.

sub("Bart","Fart","Bart-Bart") ⇒ "Fart-Bart"gsub("Bart","Fart","Bart-Bart") ⇒ "Fart-Fart"substr("Fart-Bart",6,4) ⇒ "Bart"

– p. 32

520—Spring 2004—18

Example

Here’s a recursive function that prints a stringbackwards:

function rev (str, len) {if (len == 0) {

printf "\n"return

}printf "%c", substr(str, len, 1)rev(str, len - 1)

}

rev("abcd") ⇒ "dcba"

– p. 33 520—Spring 2004—18

Example

maxelt returns a value for the largest number amongthe elements of an array.

Arrays are passed by reference, scalars are passed byvalue.

function maxelt (vec, i, ret) {for (i in vec) {

if (ret == "" || vec[i] < ret)ret = vec[i]

}return ret

}

A[1]=5;A[2]=10;A[3]=2; print maxelt(A); ⇒ 10

– p. 34

520—Spring 2004—18

Example

Print the number of occurrences of each word in itsinput.

Each input line is automatically split up into “fields”(strings separated by blanks). The fields are named$1,$2,...,$NF.

Variables are automatically initialized to 0 or "".

#!/bin/gawk -f{ for (i = 1; i <= NF; i++) freq[$i]++ }

END {for (word in freq)

printf "%s\t%d\n", word, freq[word]}

– p. 35 520—Spring 2004—18

Calling Awk Programs

There is more than one way to call an Awk program.For quick-and-dirty programs you’d want to use #1 & #2where you can enter the program on the command line.Option #2 reads the input directly from the keyboard,instead of having to go through a file.

1. gawk ’program’ input-file1 input-file2...

2. gawk ’program’

3. gawk -f program-file input-file1input-file2 ...

– p. 36

520—Spring 2004—18

Useful One-Liners

# Print the total number input lines (≡ ’wc -l’):

gawk ’END {print NR}’

# Print the total number of fields in all input lines:

gawk ’{num fields=num fields+NF}; END {print num fields}’

# Print 7 random numbers between 0 to 100:

gawk ’BEGIN {for (i=1; i<=7; i++) print int(101*rand())}’

# Print the total number of bytes used by files.

ls -l files | gawk ’{x += $4}; END { print "total: " x}’

– p. 37 520—Spring 2004—18

Useful One-Liners. . .

# Print the last field of every line:

gawk ’{print $NF}’

# Print every line after erasing the second field:

gawk ’{$2 = ""; print $0}’

# Exchange the first two fields of every line:

gawk ’{Temp=$1; $1=$2; $2=Temp; print $0}’

# Print the sums of the fields of every line:

gawk ’{sum=0; for(i=1;i<=NF;i++) sum+=$i; print sum}’

– p. 38

520—Spring 2004—18

Assembler/Interpreter

Write an assembler & interpreter for a hypotheticalmachine.

The computer has one register (“the accumulator”), teninstructions, and a 1000 word memory.

One machine word holds 5 decimal digits.

A binary instruction consists of the op-code (2 digits)and an address (3 digits).

An assembly language instruction has three fields:label, operation, operand. Any field may be empty;labels must begin in column 1.

– p. 39 520—Spring 2004—18

Assembler/Interpreter. . .

The first pass of the assembler does lexical andsyntactic analysis. Comments are discarded, labels aregiven a memory location and stored in a symbol table,and the result is written to a temporary file.

Pass 2 reads the temporary file, converts symbolicoperands to the memory location computed during pass1, encodes the operations and operands, and stores themachine-language program into the array mem.

The interpreter is a loop that fetches an instruction frommem, decodes it into an operator and operand, and thensimulates the instruction. pc is the program counter.

– p. 40

520—Spring 2004—18

Assembler/Interpreter. . .

ld zero # initialize sum to zero

st sum

loop get # read a number

jz done # no more input if number is zero

add sum # add in accumulated sum

st sum # store new value back in sum

j loop # go back and read another number

done ld sum # print sum

put

halt

zero const 0

sum const

– p. 41 520—Spring 2004—18

Assembler/Interpreter. . .

OPCODE INSTR MEANING

01 get read a number from the input into the ac-cumulator

02 put write the contents of the accumulator tothe output

03 ld M load accumulator with contents of mem-ory location M

04 st M store contents of accumulator in locationM

05 add M add contents of location M to accumulator

– p. 42

520—Spring 2004—18

Assembler/Interpreter. . .

OPCODE INSTR MEANING

06 sub M subtract contents of location M fromaccumulator

07 jpos M jump to location M if accumulator ispositive

08 jz M jump to location M if accumulator iszero

09 j M jump to location M

10 halt stop executionconst C

– p. 43 520—Spring 2004—18

Assembler/Interpreter. . .

# asm - assembler and interpreter for a simple computer

# usage: gawk -f program-file data-files

function Init (n, i) { # create table of op codes

n=split("const get put ld st add sub jpos jz j halt",x)

for (i=1; i<=n; i++) op[x[i]] = i-1

}

BEGIN {

srcfile = ARGV[1]; ARGV[1]="" # other files are data

Init()

ASSEMBLER PASS 1()

ASSEMBLER PASS 2()

INTERPRETER()

}

– p. 44

520—Spring 2004—18

Assembler/Interpreter. . .

getline < file reads a line from the file file.getline returns -1 if the read failed.

print X > "file" writes X onto the file file. Filesare opened automatically and closed with close.

Semicolons are unnecessary at the end-of-line. Onlyuse them when separating multiple statements on thesame line.

– p. 45 520—Spring 2004—18

Assembler/Interpreter. . .

function ASSEMBLER PASS 1 (nextmem){

FS = "[ \t]+"

while (getline <srcfile > 0) {

sub(/#.*/, "") # strip comments

# remember label location

symtab[$1] = nextmem

# save op, addr if present

if ($2 != "") {

print $2 "\" $3 >tempfile

nextmem++

}

}

close(tempfile)

}– p. 46

520—Spring 2004—18

Assembler/Interpreter. . .

function ASSEMBLER PASS 2 (nextmem) {

while (getline <tempfile > 0) {

# if symbolic addr, replace by numeric value

if ($2 !∼ /ˆ[0-9]*$/)

$2 = symtab[$2]

# pack into word

mem[nextmem++] = 1000 * op[$1] + $2

}

}

– p. 47 520—Spring 2004—18

Assembler/Interpreter. . .

function INTERPRETER (pc, addr, mem, code) {

for (pc = 0; pc >= 0; ) {

addr = mem[pc] % 1000; code = int(mem[pc++] / 1000)

if (code == op["get"]) { getline acc }

else if (code == op["put"]) { print acc }

else if (code == op["st"]) { mem[addr] = acc }

else if (code == op["ld"]) { acc = mem[addr] }

else if (code == op["add"]) { acc += mem[addr] }

else if (code == op["sub"]) { acc -= mem[addr] }

else if (code == op["jpos"]) { if (acc> 0) pc=addr }

else if (code == op["jz"]) { if (acc==0) pc=addr }

else if (code == op["j"]) { pc = addr }

else if (code == op["halt"]) { pc = -1 }

else { pc = -1 }}}

– p. 48

520—Spring 2004—18

Readings and References

There is a book: The Awk Programming Language publishedby Addison-Wesley.

http://dmoz.org/Computers/Programming/Languages/Awk.

gawk, Gnu’s awk: http://www.gnu.org/software/gawk.

Some of this material is taken from the gawk manual:http://www.gnu.org/manual/gawk/index.html.

– p. 49 520—Spring 2004—18

Summary

Awk is an interpreted text-processing language with aC-like syntax. The interpreter reads one line of input ata time, looks at each pattern in the program(sequentially, from the top), and executes the actions forthose patterns that match. An action can be anysequence of Awk statements.

The syntax of an Awk pattern-action is pattern {action }. Either pattern or action can be missing:1. pattern ⇔ pattern { print $0 }.2. { action } ⇔ 1 { action }.

prog VAR1=VAL1 VAR2=VAL2 gives the variableVAR1 (VAR2) the value VAL1 (VAL2) within the Awkprogram prog.

– p. 50

520—Spring 2004—18

Summary. . .

As long as there is only one statement per line,semicolons can be left out.

There are two special patterns:1. BEGIN{run before input is read}.2. END{run after input is read}.

Long lines can be broken by appending a backslash (\)at the end.

– p. 51 520—Spring 2004—18

Summary. . .

Awk has a number of built-in variables:NR Number of records (lines) read.NF Number of fields in current record.FILENAME Name of current file.ARGC,ARGV Number/Array of command line arguments.

Awk can communicate with the operating system:system(s) Execute the command s.print | c Send output to a pipe.

– p. 52