1 parsing obtain text from somewhere (file, user input, web page,..) analyze text: split it into...
Post on 19-Dec-2015
218 views
TRANSCRIPT
1
Parsing
• Obtain text from somewhere (file, user input, web page, ..)
• Analyze text: split it into meaningful tokens • Extract relevant information, disregard irrelevant
information• ‘Meaningful’, ‘relevant’ depend on application:
what are we looking for?– Search phone book for all people named “Ole Hansen”
– Search phone book for all phone numbers starting with 86
– Search phone book for all people living in Ny Munkegade
2
Example: Torleif game
Sort of like Master Mind with words and letters:• Two players, each finds 5-letter noun• Take turns in guessing• Score each guess by
– Number of correctly placed letters also present in the hidden word
– Number of incorrectly placed letters also present in hidden word
sport trofæ 1 correct, 2 incorrectfrygt 1 correct, 1 incorrect..
3
Let’s write a computer player:
1. Pick random word (from homepage of Dansk Sprognævn).
2. Ask for a guess
3. Was the guess correct?
4. Otherwise score the guess
5. Go to 2.
4Dansk Sprognævn, dictionary web page
We are looking for 5-letter strings in bold followed by the string “sb” in italics
Ask for all words starting with ..
Page displays at most 50 words at a time
5
Parsing the web pageThe source code of the dynamically generated web page has 370 lines. Some of it looks like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><HTML><HEAD> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <TITLE>Retskrivningsordbogen på nettet fra Dansk Sprognævn</TITLE> <META name="Author" content="Erik Bo Krantz Simonsen, www.progresso.dk"> <META name="Description" content="The official Danish orthography dictionary on the web"> <META name="KeyWords" content="RO2001, Retskrivningsordbogen, ordbog, dictionary, orthography, Dansk
Sprognævn"> <LINK rel="STYLESHEET" href="http://www.dsn.dk/ordbog.aux/ro2001ie.css" type="text/css"><SCRIPT language="JavaScript" type="text/javascript"> <!--self.focus(); // frame focus if (document.searchForm && document.searchForm.P)src="http://www.dsn.dk/ordbog.aux/lowerRight.gif"></td></tr></table></TD><TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0><TR><TD rowspan="2" valign="top"><TABLE BORDER=0 CELLSPACING=0 CELLPADDING=7 WIDTH=390><TR BGCOLOR="#d0e0d0"><TD><B>spondæisk </B><I>adj., itk. d.s.</I></TD></TR><<TR><TD><B>sporstof </B><I>sb., </I>-fet, -fer.</TD></TR><TR BGCOLOR="#d0e0d0"><TD><B>sport </B><I>sb., </I>-en, <I>i sms. </I>sports-, <I>fx </I>sportsstævne.</TD></TR></HTML>
6
Algorithm for picking a random word
• Pick a random initial letter x (weighted – count total number of words beginning with each letter)
• Pick random index (in the list of all words starting with x)
• Ask website for webpage with next 50 x-words starting at chosen index
• Parse webpage and look for first 5-letter noun • If none is found, ask for next 50 (wrap-around)
7
get_random_word.py moduleimport urllibimport sysimport reimport random
def getRandom5letterNoun():
# q has weight 0 since there are no 5-letter Danish nouns starting with q! hyppighed = (3194, 4540, 759, 2651, 1556, 5221, 2658, 3141, 1890, 526,4979, 2327, 3086, 1665, 2074, 3480, 0, 2455, 8460, 3845, 2315, 2230, 78, 20, 102, 77, 262, 252, 175) # sum: 64018
r = random.randrange(0, 64018) sum = hyppighed[0] startbogstav = 0 while sum<r: # pick random (weighted) starting letter startbogstav+=1 sum+=hyppighed[startbogstav]
bogstavhyppighed = hyppighed[startbogstav] startindex = random.randrange(0, bogstavhyppighed) # pick random index
if startbogstav == 26: # translate from chosen character code into actual letter startbogstav = 'æ' elif startbogstav == 27: startbogstav = 'ø' elif startbogstav == 28: startbogstav = 'å' else: startbogstav = chr(startbogstav+97)
8 found_word = 0
while not found_word:
try: # get next 50 words, starting from chosen index, from website:
myurl = "http://www.dsn.dk/cgi-bin/ordbog/ronet?M=1&P=%s&L=50&F=%d&T=%d” \
%(startbogstav, startindex, bogstavhyppighed)
tempfile = urllib.urlopen(myurl)
tekst = tempfile.read()
tempfile.close()
except IOError:
print "Kan ikke få fat på Dansk Sprognævn"
sys.exit(1)
tekst = tekst.replace("æ", "æ") # replace special codes with corresponding letters
tekst = tekst.replace("ø", "ø")
tekst = tekst.replace("å", "å")
wordRE = "<B>([a-zæøå]{5}) </B><I>sb" # look for 5-letter noun
compiled_word = re.compile( wordRE )
resultat = compiled_word.search( tekst )
if resultat:
word = resultat.group(1)
found_word = 1
else:
# get next 50 words from website
startindex += 50
if startindex > bogstavhyppighed:
startindex = 0
return word
fatwa
pligt
areal
intet
synål
ceder
tvist
9
Game programfrom get_random_word import getRandom5letterNoun
ord = getRandom5letterNoun()
g = ""svar = "\n"
while ord != g:
g = "" while len(g) != 5: g = raw_input("Dit 5-bogstavers bud? ").strip()
guess = g kopi = ord r = 0 # number of correctly placed matching letters f = 0 # number of incorrectly placed matching letters for b in range(5): if guess[b] == kopi[b]: r += 1 kopi = kopi[0:b] + '*' + kopi[b+1:] guess = guess[0:b] + '@' + guess[b+1:]
for b in range(5): index = kopi.find(guess[b]) if index >= 0: f += 1 kopi = kopi[0:index] + '*' + kopi[index+1:] guess = guess[0:b] + '@' + guess[b+1:]
svar = svar + "%s %dr %df\n"%(g, r, f) print svar
Dit 5-bogstavers bud? sport
sport 1r 1f
Dit 5-bogstavers bud? stang
sport 1r 1f
stang 1r 2f
Dit 5-bogstavers bud? satin
sport 1r 1f
stang 1r 2f
satin 3r 0f
Dit 5-bogstavers bud? salon
sport 1r 1f
stang 1r 2f
satin 3r 0f
salon 5r 0f
10
Intermezzo 1 – find it on the web:
1. Copy the get-random-word module:/users/chili/CSS.E03/ExamplePrograms/get_random_word.py
2. Make a new program that imports this module and prints out 5 random words. 3. Make a new version of the get_random_word module so that it returns a random
noun of between 5 and 10 letters. Import this module and print out 5 random words.
4. Make a new version of the get_random_word module so that it finds a random word which has an alternative spelling and returns a tuple of both versions. E.g. (sponsering, sponsorering). Import this module and print out 5 random such word pairs using e.g. print "%s or %s" %getWords()
(Hint: See this sample webpage generated by Dansk Sprognævn's website and find the word sponsorering which has an alternative spelling ("el." is short for "eller" which means or). Then look at the source code of this page which you can find here:/users/chili/CSS.E03/ExamplePrograms/dsn_page.txt.Check how exactly the words sponsering and sponsorering appear in the html. Use that example to write a new regular expression.)
http://www.daimi.au.dk/~chili/CSS/Intermezzi/9.10.1.html
11
solution
# 5-10 letter nouns:wordRE = "<B>([a-zæøå]{5,10}) </B><I>sb"
# words with alternative spelling (look at html first):
wordRE = "<B>([a-zæøå]+) </B><I>\(el. </I>([a-zæøå]+)"compiled_word = re.compile(wordRE) resultat = compiled_word.search(tekst)if resultat: word = resultat.group(1) word2 = resultat.group(2)..return (word, word2)
..
<TR><TD>
<B>sponsere </B><I>(el. </I>sponsorere<I>) vb., </I>-ede.
</TD></TR>
..
12
Sequence formats
>FOSB_MOUSE Protein fosB. 338 bp
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
Say we get this sequence in fasta format from some database:
Now we need to compare this sequence to all sequences in some other database. Unfortunately this database uses the phylip format, so we need to translate:
Phylip Format:
The first line of the input file contains the number of species, the number of sequences and their length (in characters) separated by blanks.
The next line contains the sequence name, followed by the sequence in blocks of 10 characters.
13
Sequence formats
>FOSB_MOUSE Protein fosB. 338 bp
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
1 1 338
FOSB_MOUSE MFQAFPGDYD SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM
PGSFVPTVTA ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP
GTSYSTPGLS AYSTGGASGS GGPSTSTTTS GPVSARPARA RPRRPREETL
TPEEEEKRRV RRERNKLAAA KCRNRRRELT DRLQAETDQL EEEKAELESE
IAELQKEKER LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD LPGSTSAKED
GFGWLLPPPP PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY
TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL NSPSLLAL
fasta
phylip
So we copy and paste and translate the sequence:
and all is well.
Then our boss says “Do it for these 5000 sequences.”
14
We need automatic filter!
• Need a program that reads any number of fasta sequences and converts them into phylip format (want to run sequences through a filter)
• Program structure:1. Open fasta file2. Parse file to extract needed information3. Create and save phylip file
• We will use this definition for the fasta format (and assume only one sequence per file):
– The description line starts with a greater than symbol (">").
– The word following the greater than symbol (">") immediately is the "ID" (name) of the sequence, the rest of the line is the description.
– The "ID" and the description are optional. – All lines of text should be shorter than 80 characters. – The sequence ends if there is another greater than symbol (">")
symbol at the beginning of a line and another sequence begins.
15
Pseudo-code fasta→phylip filter
1. Open fasta file2. Find line starting with >3. Parse this line and extract first
word after the > (sequence name)4. Read the sequence (count its
length)5. Open phylip file6. Write “1 1” followed by seq. length7. Write seq. name 8. Write sequence in blocks of 109. Close files
16
The other way too: pseudo-code phylip→fasta filter
1. Open phylip file
2. Find first non-empty line, ignore!
3. Parse next line and extract first word (sequence name)
4. Read rest of line and following lines to get the the sequence
5. Open fasta file
6. Write “>” followed by seq. name
7. Write sequence in lines of 80
8. Close files
17
More formats?
• Boss: “Great! What about EMBL and GDE formats?”
Coding, coding,.. : 12 filters!
fastaphylip
fasta - phylip
phylip-fasta
18
More formats?
• Boss: “Super. And Genebank and ClustalW..?”• Coding, coding, coding, ..: 30 filters
• Next new format: 12 new filters! I.e., this doesn’t scale.
19
Intermediate format
• Use our own internal format as intermediate step:
• Two formats: four filters
fasta
phylip
internal
phylip-internal
internal-phylip
fasta - internal
internal-fasta
20
Intermediate format
• Six formats: 12 filters (not 30)
• New format: always two new filters only
i-format
21
Let’s build a structured program!
• Each x2internal filter module: parse file in x format, extract information, return sequence(s) in internal format
• Each internal2y filter module: save each i-format sequence in separate file in y format
• Example: Overall phylip-fasta filter:
– import phylip2i and i2fasta modules
– obtain filenames to load from and save to from command line
– call parse_file method of the phylip2i module
– call the save_to_files method of the i2fasta module
22
Our internal format revisited
Isequence:
"""Definition of abstract data type representing a sequence in I-format
- internal format"""
def __init__(self, t = "unknown“, n = "unknown“, i = "unknown“ ):
"""Initialize fields to given values"""
self.type = t
self.name = n
self.id = i
self.sequence = "" # represent the sequence itself as a string
Thus, the information we keep about a sequence is type, name, id; all other information is disregarded
23
Example: fasta/phylip filter
• Each x2internal filter module: parse file in x format, extract information, return sequence(s) in internal format
from Isequence import Isequence
class Parser: # loads and parses fasta file into list of i-sequences def init__(self): self.iseqlist = [] # initialize empty list
def parse_file(self, loadfilename): <<load file, save content in variable lines>>
for line in lines: if line[0] == '>': # new sequence starts items = line.split()
# assume: dna, first word after > is the id, next two words are the name. self.iseq = Isequence("dna", " ".join(items[1:3]), items[0][1:])
self.iseqlist.append(self.iseq) #put new Isequence object in list
elif self.iseq: # we are currently building an iseq object, extend its sequence self.iseq.extend_sequence(line.strip()) # skip trailing newline
return self.iseqlist
24• Each internal2y filter module: save each i-format sequence in
separate file in y format
from Isequence import Isequence
class SaveToFiles: # save i-sequences in phylip format
def save_to_files(self, iseqlist, savefilename): try: for seq in iseqlist: <<create appropriate suffix for the savefilename (a unique file per sequence)>> savefile = open(savefilename + suffix, "w")
seqstring = seq.get_sequence() print >> savefile, "1 1 %d" %len( seqstring )
prefix = "%-10s " %seq.get_name() # write name savefile.write( prefix ) prefix = " " * len( prefix ) # on remaining lines write spaces instead of name counter = 1 for char in seqstring: savefile.write( char ) if counter%10 == 0: savefile.write( " " ) if counter%50 == 0: savefile.write( "\n%s" %prefix ) counter += 1 savefile.close() except IOError, message: sys.exit(message)
25
Command-line arguments
• Python stores command-line arguments in a list called sys.argv
• The first argument is the name of the program that the user is running from the command-line
# filename: command_line_arguments.py
import sys
print "first argument is program name:", sys.argv[0]
print "arguments for the program start at index 1:"
for arg in sys.argv[1:]:
print arg
threonine:~...ExamplePrograms% python command_line_arguments.py 1 2 3 qq
first argument is program name: command_line_arguments.py
arguments for the program start at index 1:
1
2
3
26
import Isequencefrom i2phylip import SaveToFiles from fasta2i import Parserimport sys
# Now SaveToFiles is a class that can save i-format sequences in phylip format,# and Parser is a class that reads a fasta file and parses it into i-format.
# load a fasta file, save each sequence in its own file in phylip formatif len(sys.argv) != 3: sys.exit("""Program takes two arguments: file to load fasta sequence(s) from and file (prefix) to save phylip sequences in.""")
loadfilename = sys.argv[1]savefilename = sys.argv[2]
# parse file and store each sequence in Isequence object:input_parser = Parser()iseq_list = input_parser.parse_file(loadfilename)
# save each Isequence in required format in separate files:save_object = SaveToFiles()save_object.save_to_files(iseq_list, savefilename)
NB: nothing about phylip and fasta below this point..
1. import phylip2i and i2fasta modules
2. obtain filenames to load from and save to from command line
3. call parse_file method of the phylip2i module
4. call the save_to_files method of the i2fasta module
Overall fasta/phylip filter
27
i2embl filter module..?
from Isequence import Isequence
class SaveToFiles: # same class name
def save_to_files(self, iseqlist, savefilename): # same method name try: for seq in iseqlist: <<create appropriate suffix for the savefilename (a unique file per sequence)>> savefile = open(savefilename + suffix, "w")
<<convert i-sequence to embl format and write to file>> savefile.close()
except IOError, message: sys.exit(message)
28
import Isequencefrom i2embl import SaveToFiles # import same method name from different modulefrom fasta2i import Parserimport sys
# Now SaveToFiles is a class that can save i-format sequences in embl format,# and Parser is a class that reads a fasta file and parses it into i-format.
# load a fasta file, save each sequence in its own file in phylip formatif len(sys.argv) != 3: sys.exit("""Program takes two arguments: file to load fasta sequence(s) from and file (prefix) to save embl sequences in.""")
loadfilename = sys.argv[1]savefilename = sys.argv[2]
# parse file and store each sequence in Isequence object:input_parser = Parser()iseq_list = input_parser.parse_file(loadfilename)
# save each Isequence in required format in separate files:save_object = SaveToFiles()save_object.save_to_files(iseq_list, savefilename)
Fasta/embl filter..?
29
Intermezzo 2, on the web:
Oh no, the phylip format has been changed by its designers! • The first line of a file with a sequence in the new phylip format is a comment line and begins with
"@@". In this comment line the name of the author of the file should appear, the year of creation, and the name of the author's favorite football player, separated by commas.
• In the next lines the sequence is written, starting with "##".
• In the final line (starting with "!!"), the sequence name is written.
Thus, a phylip format file might look like this:
@@Jakob Fredslund, 2003, Zinedine Zidane ##cgactaagcttagcacggatcgatcggaattctagagcgacgacgtctagcagcgcgtaacgtatagctcgcgaggaaagctctgtaggggactgcgagaagatgg !!Tyrannosaurus Rex
Rewrite the fasta/phylip filter to incorporate the changed phylip format. Find all needed files here. I.e.: 1. Copy the needed files – remember Isequence.py 2. Run the overall fasta/phylip filter on the given example fasta file and check the resulting phylip
files to see how it works. 3. Make the necessary changes in the right places.
http://www.daimi.au.dk/~chili/CSS/Intermezzi/9.10.2.html
30
solution
• We only need to modify the i2phylip module (neat! – another good reason to use an intermediate format).
# save each sequence in separate file:
try:
for seq in iseqlist:
suffix = ".phylip"
if len(iseqlist) > 1:
suffix = "_" + seq.get_id() + suffix
savefile = open(savefilename + suffix, "w")
seqstring = seq.get_sequence()
print >> savefile, "@@Jakob Fredslund, 2003, Zidane"
print >> savefile, "##%s" %seqstring
print >> savefile, "!!%s" %seq.get_name()
savefile.close()