Download - Searching (and manipulating) your data
Searching (and manipulating) your data
>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC
UGG : W> A06662_proteinW
>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC
GAC : D> A06662_proteinWD
>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC
CAG : Q> A06662_proteinWDQ
>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC
UCA : S> A06662_proteinWDQS
>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC
GCA : A> A06662_proteinWDQSA
>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC
GAG : E> A06662_proteinWDQSAE
>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC
GCA : A> A06662_proteinWDQSAEA
>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC
GCG : A> A06662_proteinWDQSAEAA
>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC
UGU : C> A06662_proteinWDQSAEAAC
codonAMINO = {'GCU':'A','GCC':'A','GCA':'A', 'GCG':'A', 'CGU':'R','CGC':'R','CGA':'R','CGG':'R','AGA':'R','AGG':'R', 'UCU':'S','UCC':'S','UCA':'S','UCG':'S','AGU':'S','AGC':'S’ 'AUU':'I','AUC':'I','AUA':'I','AUU':'I','AUC':'I','AUA':'I', 'UUA':'L','UUG':'L','CUU':'L','CUC':'L','CUA':'L','CUG':'L', 'GGU':'G','GGC':'G','GGA':'G', 'GGG':'G', 'GUU':'V','GUC':'V','GUA':'V','GUG':'V', 'ACU':'T','ACC':'T','ACA':'T','ACG':'T', 'CCU':'P','CCC':'P','CCA':'P','CCG':'P', 'AAU':'N','AAC':'N', 'GAU':'D','GAC':'D', 'UGU':'C','UGC':'C', 'CAA':'Q','CAG':'Q', 'GAA':'E','GAG':'E', 'CAU':'H','CAC':'H', 'AAA':'K','AAG':'K', 'UUU':'F','UUC':'F', 'UAU':'Y', 'UAC':'Y', 'AUG':'M', 'UGG':'W', 'UAG':'STOP', 'UGA':'STOP', 'UAA':'STOP' }
codonAMINO = {'GCU':'A','GCC':'A','GCA':'A', 'GCG':'A', 'CGU':'R','CGC':'R','CGA':'R','CGG':'R','AGA':'R','AGG':'R', 'UCU':'S','UCC':'S','UCA':'S','UCG':'S','AGU':'S','AGC':'S', 'AUU':'I','AUC':'I','AUA':'I','AUU':'I','AUC':'I','AUA':'I', 'UUA':'L','UUG':'L','CUU':'L','CUC':'L','CUA':'L','CUG':'L', 'GGU':'G','GGC':'G','GGA':'G','GGG':'G','AAU':'N','AAC':'N', 'GUU':'V','GUC':'V','GUA':'V','GUG':'V','GAU':'D','GAC':'D', 'ACU':'T','ACC':'T','ACA':'T','ACG':'T','UGU':'C','UGC':'C', 'CCU':'P','CCC':'P','CCA':'P','CCG':'P','CAA':'Q','CAG':'Q', 'GAA':'E','GAG':'E','CAU':'H','CAC':'H','AAA':'K','AAG':'K', 'UUU':'F','UUC':'F','UAU':'Y','UAC':'Y','AUG':'M','UGG':'W', 'AUG':'START','UAG':'STOP', 'UGA':'STOP', 'UAA':'STOP' }
>>>codonAMINO['GCU']'A'>>>codonAMINO['AUG']’START’>>> for k in codonAMINO.keys():... print k, codonAMINO[k]GUC VAUA IGUA VGUG VACU TAAC Netc.
Dictionaries
Dictionaries are unordered collections of objects
Dictionaries are structures for mapping immutable objects (keys) on arbitrary objects (values)
d = {key1:value1, key2:value2,…,keyN:valueN}
lists and dictionaries cannot be used as dictionary keys!!!!
keys must be unique, i.e. the same key cannot be associated to more than one value
>>> d = {'pep1':'MGSNKSKPKDASQRRRSLEPAENVHGAGG', \ 'pep2':'RSLEPAENVHGAGGGAFPASQTPS'}>>> len(d)2
>>> d[‘pep1’]'MGSNKSKPKDASQRRRSLEPAENVHGAGG’
>>> d['pep3'] = 'ASADGHRGPSAAFAPAAA'>>> d{'pep1' : 'MGSNKSKPKDASQRRRSLEPAENVHGAGG', 'pep2' : 'RSLEPAENVHGAGGGAFPASQTPS', ‘pep3’ : 'ASADGHRGPSAAFAPAAA'}
>>> del d[‘pep2’]>>> d{'pep1' : 'MGSNKSKPKDASQRRRSLEPAENVHGAGG', ‘pep3’ : 'ASADGHRGPSAAFAPAAA'}
>>> d.clear()>>> d{ }
>>> dict = {“a”:1, “b”:2, “c”:3}>>> dict.keys() #list of dictionary keys[ ‘a ’, ‘c ’, ‘b ’]
>>> keys = dict.keys()>>> keys.sort() #sort keys[ ‘a ’, ‘b ’, ‘c ’]
>>> dict.values() #list of dictionary values[1, 3, 2]
>>> dict.items() #tuple of dictionary (key,value) pairs[(‘a ’, 1), (‘c ’, 3), (‘b ’, 2)]
>>> dict.has_key(“a”) #True if dict has key “a”, else FalseTrue
Exercise
Using the codonAMINO dictonary from tgac.py translate the sequence in rna_seq.fasta. Start with a single reading frame.Then try all reading frames.
for line in F: if line[0] == '>': header = line.split() geneID = header[0] Out.write(geneID + '_protein\n') else: seq = seq + line.strip()
prot = ''for i in range(0,len(seq),3): if codonAMINO.has_key(seq[i:i+3]): prot = prot + codonAMINO[seq[i:i+3]] else: prot = prot + '*'
Out.write(prot + '\n')
F = open('rna_seq.fasta')Out = open('protein_seq.fasta','w')
seq = ''for line in F: if line[0] == '>': header = line.split() geneID = header[0] Out.write(geneID + '_protein\n') else: seq = seq + line.strip()
from tgac import codonAMINO
prot = ''for j in range(3): Out.write(str(j) + "-frame\n") for i in range(j,len(seq),3): if codonAMINO.has_key(seq[i:i+3]): prot = prot + codonAMINO[seq[i:i+3]] else: prot = prot + '*' Out.write(prot + '\n') prot = ''
Remove redundancy
How many different objects?
How many unique objects?
Are the two groups identical?
What is the intersection of the two groups?
Q5XXA6Q9Y5P2Q14667O75387Q8WV07Q8CH62Q9GZY1Q9NQQ7Q8VCX2Q7Z769Q8CH62Q14667Q9NQQ7Q14667Q9Y5P2
Q7Z769Q8CH62Q9GZY1Q9NQQ7Q14667Q5XXA6Q9Y5P2Q14667O75387Q9Y5P2Q8WV07Q8VCX2Q8CH62Q14667Q9NQQ7
Sets are unordered collections of unique objects
•Sets do not support indexing and slicing•in and not in operators can be used to test an element for membership in a set. •Sets are useful for removing duplicates•Set operations: intersection, union, difference, symmetrical difference
Sets
they are not sequence-like objects and that they cannot contain identical elements
In order to create a set, the method set(x) must be used, where x is a sequence-like object (string, tuple, list)
Create a new set
add(x)update(x)
S1.union(S2)
The union between 2 sets S1 and S2 creates a new set with the elements from both S1 and S2.
>>> S1 = set(['a','b','c'])>>> S2 = set (['c','d','e'])>>> S1.union(S2)set([‘a’, ‘c’, ‘b’, ‘e’, ‘d’])>>> S1 | S2set([‘a’, ‘c’, ‘b’, ‘e’, ‘d’])
S1.intersection(S2)
The intersection of 2 sets S1 and S2 creates a new set with the elements common to S1 and S2
>>> S1 = set(['a','b','c'])>>> S2 = set (['c','d','e'])>>> S1.intersection(S2)set([‘c’])>>> S1 & S2set([‘c’])
S1.symmetric_difference(S2)or S1 ^ S2
Symmetric difference of two sets S1 and S2 creates a new set with elements in either S1 or S2 but not both
>>> S1 = set(['a','b','c'])>>> S2 = set (['c','d','e'])>>> S1.symmetric_difference(S2)set([‘a’, ‘b’, ‘e’, ‘d’])>>> S1 ^ S2set([‘a’, ‘b’, ‘e’, ‘d’])
S1.difference(S2)or S1 - S2
The difference of two sets S1 and S2 creates a new set with elements in S1 but not in S2
>>> S1 = set(['a','b','c'])>>> S2 = set (['c','d','e'])>>> S1.difference(S2)set([‘a’, ‘b’])>>> S1 - S2set([‘a’, ‘b’])>>> S2 – S1set([‘e’, ‘d’])