1 parsing obtain text from somewhere (file, user input, web page,..) analyze text: split it into...

1

Parsing

• Obtain text from somewhere (file, user input, web page, ..)

• Analyze text: split it into meaningful tokens • Extract relevant information, disregard irrelevant

information• ‘Meaningful’, ‘relevant’ depend on application:

what are we looking for?– Search phone book for all people named “Ole Hansen”

– Search phone book for all phone numbers starting with 86

– Search phone book for all people living in Ny Munkegade

2

Example: Torleif game

Sort of like Master Mind with words and letters:• Two players, each finds 5-letter noun• Take turns in guessing• Score each guess by

– Number of correctly placed letters also present in the hidden word

– Number of incorrectly placed letters also present in hidden word

sport trofæ 1 correct, 2 incorrectfrygt 1 correct, 1 incorrect..

3

Let’s write a computer player:

1. Pick random word (from homepage of Dansk Sprognævn).

2. Ask for a guess

3. Was the guess correct?

4. Otherwise score the guess

5. Go to 2.

4Dansk Sprognævn, dictionary web page

We are looking for 5-letter strings in bold followed by the string “sb” in italics

Ask for all words starting with ..

Page displays at most 50 words at a time

5

Parsing the web pageThe source code of the dynamically generated web page has 370 lines. Some of it looks like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><HTML><HEAD> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <TITLE>Retskrivningsordbogen på nettet fra Dansk Sprognævn</TITLE> <META name="Author" content="Erik Bo Krantz Simonsen, www.progresso.dk"> <META name="Description" content="The official Danish orthography dictionary on the web"> <META name="KeyWords" content="RO2001, Retskrivningsordbogen, ordbog, dictionary, orthography, Dansk

Sprognævn"> <LINK rel="STYLESHEET" href="http://www.dsn.dk/ordbog.aux/ro2001ie.css" type="text/css"><SCRIPT language="JavaScript" type="text/javascript"> <!--self.focus(); // frame focus if (document.searchForm && document.searchForm.P)src="http://www.dsn.dk/ordbog.aux/lowerRight.gif"></td></tr></table></TD><TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0><TR><TD rowspan="2" valign="top"><TABLE BORDER=0 CELLSPACING=0 CELLPADDING=7 WIDTH=390><TR BGCOLOR="#d0e0d0"><TD>spondæisk adj., itk. d.s.</TD></TR><<TR><TD>sporstof sb., -fet, -fer.</TD></TR><TR BGCOLOR="#d0e0d0"><TD>sport sb., -en, i sms. sports-, fx sportsstævne.</TD></TR></HTML>

6

Algorithm for picking a random word

• Pick a random initial letter x (weighted – count total number of words beginning with each letter)

• Pick random index (in the list of all words starting with x)

• Ask website for webpage with next 50 x-words starting at chosen index

• Parse webpage and look for first 5-letter noun • If none is found, ask for next 50 (wrap-around)

7

get_random_word.py moduleimport urllibimport sysimport reimport random

def getRandom5letterNoun():

# q has weight 0 since there are no 5-letter Danish nouns starting with q! hyppighed = (3194, 4540, 759, 2651, 1556, 5221, 2658, 3141, 1890, 526,4979, 2327, 3086, 1665, 2074, 3480, 0, 2455, 8460, 3845, 2315, 2230, 78, 20, 102, 77, 262, 252, 175) # sum: 64018

r = random.randrange(0, 64018) sum = hyppighed[0] startbogstav = 0 while sum<r: # pick random (weighted) starting letter startbogstav+=1 sum+=hyppighed[startbogstav]

bogstavhyppighed = hyppighed[startbogstav] startindex = random.randrange(0, bogstavhyppighed) # pick random index

if startbogstav == 26: # translate from chosen character code into actual letter startbogstav = 'æ' elif startbogstav == 27: startbogstav = 'ø' elif startbogstav == 28: startbogstav = 'å' else: startbogstav = chr(startbogstav+97)

8 found_word = 0

while not found_word:

try: # get next 50 words, starting from chosen index, from website:

myurl = "http://www.dsn.dk/cgi-bin/ordbog/ronet?M=1&P=%s&L=50&F=%d&T=%d” \

%(startbogstav, startindex, bogstavhyppighed)

tempfile = urllib.urlopen(myurl)

tekst = tempfile.read()

tempfile.close()

except IOError:

print "Kan ikke få fat på Dansk Sprognævn"

sys.exit(1)

tekst = tekst.replace("æ", "æ") # replace special codes with corresponding letters

tekst = tekst.replace("ø", "ø")

tekst = tekst.replace("å", "å")

wordRE = "([a-zæøå]{5}) sb" # look for 5-letter noun

compiled_word = re.compile( wordRE )

resultat = compiled_word.search( tekst )

if resultat:

word = resultat.group(1)

found_word = 1

else:

# get next 50 words from website

startindex += 50

if startindex > bogstavhyppighed:

startindex = 0

return word

fatwa

pligt

areal

intet

synål

ceder

tvist

9

Game programfrom get_random_word import getRandom5letterNoun

ord = getRandom5letterNoun()

g = ""svar = "\n"

while ord != g:

g = "" while len(g) != 5: g = raw_input("Dit 5-bogstavers bud? ").strip()

guess = g kopi = ord r = 0 # number of correctly placed matching letters f = 0 # number of incorrectly placed matching letters for b in range(5): if guess[b] == kopi[b]: r += 1 kopi = kopi[0:b] + '*' + kopi[b+1:] guess = guess[0:b] + '@' + guess[b+1:]

for b in range(5): index = kopi.find(guess[b]) if index >= 0: f += 1 kopi = kopi[0:index] + '*' + kopi[index+1:] guess = guess[0:b] + '@' + guess[b+1:]

svar = svar + "%s %dr %df\n"%(g, r, f) print svar

Dit 5-bogstavers bud? sport

sport 1r 1f

Dit 5-bogstavers bud? stang

sport 1r 1f

stang 1r 2f

Dit 5-bogstavers bud? satin

sport 1r 1f

stang 1r 2f

satin 3r 0f

Dit 5-bogstavers bud? salon

sport 1r 1f

stang 1r 2f

satin 3r 0f

salon 5r 0f

10

Intermezzo 1 – find it on the web:

1. Copy the get-random-word module:/users/chili/CSS.E03/ExamplePrograms/get_random_word.py

2. Make a new program that imports this module and prints out 5 random words. 3. Make a new version of the get_random_word module so that it returns a random

noun of between 5 and 10 letters. Import this module and print out 5 random words.

4. Make a new version of the get_random_word module so that it finds a random word which has an alternative spelling and returns a tuple of both versions. E.g. (sponsering, sponsorering). Import this module and print out 5 random such word pairs using e.g. print "%s or %s" %getWords()

(Hint: See this sample webpage generated by Dansk Sprognævn's website and find the word sponsorering which has an alternative spelling ("el." is short for "eller" which means or). Then look at the source code of this page which you can find here:/users/chili/CSS.E03/ExamplePrograms/dsn_page.txt.Check how exactly the words sponsering and sponsorering appear in the html. Use that example to write a new regular expression.)

http://www.daimi.au.dk/~chili/CSS/Intermezzi/9.10.1.html

http://www.daimi.au.dk/~chili/CSS/ExamplePrograms/get_random_word.py






11

solution

# 5-10 letter nouns:wordRE = "([a-zæøå]{5,10}) sb"

# words with alternative spelling (look at html first):

wordRE = "([a-zæøå]+) \(el. ([a-zæøå]+)"compiled_word = re.compile(wordRE) resultat = compiled_word.search(tekst)if resultat: word = resultat.group(1) word2 = resultat.group(2)..return (word, word2)

..

<TR><TD>

sponsere (el. sponsorere) vb., -ede.

</TD></TR>

..

12

Sequence formats

>FOSB_MOUSE Protein fosB. 338 bp

MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA

ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT

DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD

LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY

TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

Say we get this sequence in fasta format from some database:

Now we need to compare this sequence to all sequences in some other database. Unfortunately this database uses the phylip format, so we need to translate:

Phylip Format:

The first line of the input file contains the number of species, the number of sequences and their length (in characters) separated by blanks.

The next line contains the sequence name, followed by the sequence in blocks of 10 characters.

13

Sequence formats

>FOSB_MOUSE Protein fosB. 338 bp

MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA

ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT

DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD

LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY

TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

1 1 338

FOSB_MOUSE MFQAFPGDYD SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM

PGSFVPTVTA ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP

GTSYSTPGLS AYSTGGASGS GGPSTSTTTS GPVSARPARA RPRRPREETL

TPEEEEKRRV RRERNKLAAA KCRNRRRELT DRLQAETDQL EEEKAELESE

IAELQKEKER LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD LPGSTSAKED

GFGWLLPPPP PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY

TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL NSPSLLAL

fasta

phylip

So we copy and paste and translate the sequence:

and all is well.

Then our boss says “Do it for these 5000 sequences.”

14

We need automatic filter!

• Need a program that reads any number of fasta sequences and converts them into phylip format (want to run sequences through a filter)

• Program structure:1. Open fasta file2. Parse file to extract needed information3. Create and save phylip file

• We will use this definition for the fasta format (and assume only one sequence per file):

– The description line starts with a greater than symbol (">").

– The word following the greater than symbol (">") immediately is the "ID" (name) of the sequence, the rest of the line is the description.

– The "ID" and the description are optional. – All lines of text should be shorter than 80 characters. – The sequence ends if there is another greater than symbol (">")

symbol at the beginning of a line and another sequence begins.

15

Pseudo-code fasta→phylip filter

1. Open fasta file2. Find line starting with >3. Parse this line and extract first

word after the > (sequence name)4. Read the sequence (count its

length)5. Open phylip file6. Write “1 1” followed by seq. length7. Write seq. name 8. Write sequence in blocks of 109. Close files

16

The other way too: pseudo-code phylip→fasta filter

1. Open phylip file

2. Find first non-empty line, ignore!

3. Parse next line and extract first word (sequence name)

4. Read rest of line and following lines to get the the sequence

5. Open fasta file

6. Write “>” followed by seq. name

7. Write sequence in lines of 80

8. Close files

17

More formats?

• Boss: “Great! What about EMBL and GDE formats?”

Coding, coding,.. : 12 filters!

fastaphylip

fasta - phylip

phylip-fasta

18

More formats?

• Boss: “Super. And Genebank and ClustalW..?”• Coding, coding, coding, ..: 30 filters

• Next new format: 12 new filters! I.e., this doesn’t scale.

19

Intermediate format

• Use our own internal format as intermediate step:

• Two formats: four filters

fasta

phylip

internal

phylip-internal

internal-phylip

fasta - internal

internal-fasta

20

Intermediate format

• Six formats: 12 filters (not 30)

• New format: always two new filters only

i-format

21

Let’s build a structured program!

• Each x2internal filter module: parse file in x format, extract information, return sequence(s) in internal format

• Each internal2y filter module: save each i-format sequence in separate file in y format

• Example: Overall phylip-fasta filter:

– import phylip2i and i2fasta modules

– obtain filenames to load from and save to from command line

– call parse_file method of the phylip2i module

– call the save_to_files method of the i2fasta module

22

Our internal format revisited

Isequence:

"""Definition of abstract data type representing a sequence in I-format

- internal format"""

def __init__(self, t = "unknown“, n = "unknown“, i = "unknown“ ):

"""Initialize fields to given values"""

self.type = t

self.name = n

self.id = i

self.sequence = "" # represent the sequence itself as a string

Thus, the information we keep about a sequence is type, name, id; all other information is disregarded

23

Example: fasta/phylip filter

• Each x2internal filter module: parse file in x format, extract information, return sequence(s) in internal format

from Isequence import Isequence

class Parser: # loads and parses fasta file into list of i-sequences def init__(self): self.iseqlist = [] # initialize empty list

def parse_file(self, loadfilename): <<load file, save content in variable lines>>

for line in lines: if line[0] == '>': # new sequence starts items = line.split()

# assume: dna, first word after > is the id, next two words are the name. self.iseq = Isequence("dna", " ".join(items[1:3]), items[0][1:])

self.iseqlist.append(self.iseq) #put new Isequence object in list

elif self.iseq: # we are currently building an iseq object, extend its sequence self.iseq.extend_sequence(line.strip()) # skip trailing newline

return self.iseqlist

24• Each internal2y filter module: save each i-format sequence in

separate file in y format


class SaveToFiles: # save i-sequences in phylip format

def save_to_files(self, iseqlist, savefilename): try: for seq in iseqlist: <<create appropriate suffix for the savefilename (a unique file per sequence)>> savefile = open(savefilename + suffix, "w")

seqstring = seq.get_sequence() print >> savefile, "1 1 %d" %len( seqstring )

prefix = "%-10s " %seq.get_name() # write name savefile.write( prefix ) prefix = " " * len( prefix ) # on remaining lines write spaces instead of name counter = 1 for char in seqstring: savefile.write( char ) if counter%10 == 0: savefile.write( " " ) if counter%50 == 0: savefile.write( "\n%s" %prefix ) counter += 1 savefile.close() except IOError, message: sys.exit(message)

25

Command-line arguments

• Python stores command-line arguments in a list called sys.argv

• The first argument is the name of the program that the user is running from the command-line

# filename: command_line_arguments.py

import sys

print "first argument is program name:", sys.argv[0]

print "arguments for the program start at index 1:"

for arg in sys.argv[1:]:

print arg

threonine:~...ExamplePrograms% python command_line_arguments.py 1 2 3 qq

first argument is program name: command_line_arguments.py

arguments for the program start at index 1:

1

2

3

qq

26

import Isequencefrom i2phylip import SaveToFiles from fasta2i import Parserimport sys

# Now SaveToFiles is a class that can save i-format sequences in phylip format,# and Parser is a class that reads a fasta file and parses it into i-format.

# load a fasta file, save each sequence in its own file in phylip formatif len(sys.argv) != 3: sys.exit("""Program takes two arguments: file to load fasta sequence(s) from and file (prefix) to save phylip sequences in.""")

loadfilename = sys.argv[1]savefilename = sys.argv[2]

# parse file and store each sequence in Isequence object:input_parser = Parser()iseq_list = input_parser.parse_file(loadfilename)

# save each Isequence in required format in separate files:save_object = SaveToFiles()save_object.save_to_files(iseq_list, savefilename)

NB: nothing about phylip and fasta below this point..

1. import phylip2i and i2fasta modules

2. obtain filenames to load from and save to from command line

3. call parse_file method of the phylip2i module

4. call the save_to_files method of the i2fasta module

Overall fasta/phylip filter

27

i2embl filter module..?


class SaveToFiles: # same class name

def save_to_files(self, iseqlist, savefilename): # same method name try: for seq in iseqlist: <<create appropriate suffix for the savefilename (a unique file per sequence)>> savefile = open(savefilename + suffix, "w")

<<convert i-sequence to embl format and write to file>> savefile.close()

except IOError, message: sys.exit(message)

28

import Isequencefrom i2embl import SaveToFiles # import same method name from different modulefrom fasta2i import Parserimport sys

# Now SaveToFiles is a class that can save i-format sequences in embl format,# and Parser is a class that reads a fasta file and parses it into i-format.

# load a fasta file, save each sequence in its own file in phylip formatif len(sys.argv) != 3: sys.exit("""Program takes two arguments: file to load fasta sequence(s) from and file (prefix) to save embl sequences in.""")

loadfilename = sys.argv[1]savefilename = sys.argv[2]

# parse file and store each sequence in Isequence object:input_parser = Parser()iseq_list = input_parser.parse_file(loadfilename)

# save each Isequence in required format in separate files:save_object = SaveToFiles()save_object.save_to_files(iseq_list, savefilename)

Fasta/embl filter..?

29

Intermezzo 2, on the web:

Oh no, the phylip format has been changed by its designers! • The first line of a file with a sequence in the new phylip format is a comment line and begins with

"@@". In this comment line the name of the author of the file should appear, the year of creation, and the name of the author's favorite football player, separated by commas.

• In the next lines the sequence is written, starting with "##".

• In the final line (starting with "!!"), the sequence name is written.

Thus, a phylip format file might look like this:

@@Jakob Fredslund, 2003, Zinedine Zidane ##cgactaagcttagcacggatcgatcggaattctagagcgacgacgtctagcagcgcgtaacgtatagctcgcgaggaaagctctgtaggggactgcgagaagatgg !!Tyrannosaurus Rex

Rewrite the fasta/phylip filter to incorporate the changed phylip format. Find all needed files here. I.e.: 1. Copy the needed files – remember Isequence.py 2. Run the overall fasta/phylip filter on the given example fasta file and check the resulting phylip

files to see how it works. 3. Make the necessary changes in the right places.

http://www.daimi.au.dk/~chili/CSS/Intermezzi/9.10.2.html

http://www.daimi.au.dk/~chili/CSS/exampleprograms.html

30

solution

• We only need to modify the i2phylip module (neat! – another good reason to use an intermediate format).

# save each sequence in separate file:

try:

for seq in iseqlist:

suffix = ".phylip"

if len(iseqlist) > 1:

suffix = "_" + seq.get_id() + suffix

savefile = open(savefilename + suffix, "w")

seqstring = seq.get_sequence()

print >> savefile, "@@Jakob Fredslund, 2003, Zidane"

print >> savefile, "##%s" %seqstring

print >> savefile, "!!%s" %seq.get_name()

savefile.close()

1 parsing obtain text from somewhere (file, user input, web page,..) analyze text: split it into...

Documents

time slide

hansen search phone

phone numbers

placed letters

hidden word number

ny munkegade slide

random word

dictionary web page