template design © 2007 haha ssaha kelvin gu, tiffany lin, nick altemose, kevin tao duke...

1
TEMPLATE DESIGN © 2007 www.PosterPresentations.com Haha SSAHA Kelvin Gu, Tiffany Lin, Nick Altemose, Kevin Tao Duke University, Trinity College of Arts and Sciences Introduction APT Sample Search Advantages and Conclusions Contact information In conclusion: -SSAHA is a search method with one goal in mind: speed. -SSAHA likes to keep things organized, like an old lady who puts everything in labeled Tupperware, which she then organizes into labeled drawers. -SSAHA takes a database of data, like the 3 Gigabase human genome reference sequence, and organizes it into a fast searchable index, known as a hash table. -The hash table is like a dictionary to the English language, or an index to an encyclopedia. -You only have to make a hash table once for a given database, kind of like you only have to tidy your room up once, then you can find things extremely easily thereafter. Do a pictorial example of a SSAHA search. One Advantage of SSAHA -SSAHA stores individual nucleotide base information as 2-bits of information per base: A is stored as 00 C is stored as 01 G is stored as 10 T is stored as 10 -This allows for fast and efficient storage and processing, using less memory and less processor power -However, the drawback is that only 4 characters are possible, and sometimes real DNA data has gaps and ambiguous characters -Other search engines treat these ambiguities as separate letters, like R or N (N means it could be any base) -SSAHA reverts all ambiguities to A, since AAAAA… is the most common ‘word’ in the genome, and the search can easily recognize it as uninformative How Does a Hash Table Work? Ssaha uses hash tables, which is what makes it fast. It’s like a lookup table: phonebook, or bookshelf, basically insert in a key (piece of info) and get info related to it. This article explains hash tables. Hash tables are 2 dimensional tables represented by keys and values. Hash tables take information, assigns a key to it, and puts it in a certain location. The location is determined by hash-code or hashing function or mapping function or randomization technique, or key-to-address transform. A major problem with hash tables is collisions, because there are often more identifiers than addresses. Two pieces of information share the same address in the hash table. Luckily, we have solutions to collision, which are: 1) make a list of all data that all have the same home address, and then when the address is looked up, look through the list to find all the synonyms. 2) if the first address is already occupied, generate a new address, and put the data there. 3) each address is a bucket, with a capacity to hold many items. 4) use a function that does not make repeats. Hash tables are inflexible, if you want to increase the size, all hash values must be recalculated. SSAHA Step by Step Make a flowchart here Talk about the steps… fill the rest of this space. Given a collection of strands, and a target string, re-arrange the strands in the collection so that they are in order based on how they compare to the target string. The measurement of similarity is the percentage of base pairs of the shorter string that match up with the longer string. The shorter string(pattern string) is shifted left and right to find the best match, and the similarity is measured there. The percentage is calculated based on the length of the shorter string. When two strings have the same percentage similarity, the longer string should come first. When two strings have the same percentage similarity, and length, alphabetical order should take precedence. Definition Class: StringMatcher Method: orderList Parameters: String, String[] Returns: String[] Method signature: String[] orderList(String target, String[] patterns) (be sure your method is public) Class public class StringMatcher { public String[] orderList(String target, String[] patterns) { // fill in code here } } Constraints A strand’s similarity should be based on the percentage of matching base pairs The pattern strands must be shorter or equal to the length of the target strand The pattern strand should be allowed to find its best match at any point in the target strand If two or more strands share the same percentage similarity, list in order of length If two or more strands share the same Contact info? T: 510.649.3001 F: 510.649.0331 TF: 1.866.649.3004 E: [email protected] OPTIONAL LOGO HERE OPTIONAL LOGO HERE A C T N N G C A 0001100000100100

Upload: lucinda-briggs

Post on 29-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TEMPLATE DESIGN © 2007  Haha  SSAHA Kelvin Gu, Tiffany Lin, Nick Altemose, Kevin Tao Duke University, Trinity College of Arts

TEMPLATE DESIGN © 2007

www.PosterPresentations.com

Haha SSAHAKelvin Gu, Tiffany Lin, Nick Altemose, Kevin Tao

Duke University, Trinity College of Arts and Sciences

Introduction APT Sample Search

Advantages and Conclusions

Contact information

In conclusion:

-SSAHA is a search method with one goal in mind: speed.

-SSAHA likes to keep things organized, like an old lady who puts everything in labeled Tupperware, which she then organizes into labeled drawers.

-SSAHA takes a database of data, like the 3 Gigabase human genome reference sequence, and organizes it into a fast searchable index, known as a hash table.

-The hash table is like a dictionary to the English language, or an index to an encyclopedia.

-You only have to make a hash table once for a given database, kind of like you only have to tidy your room up once, then you can find things extremely easily thereafter.

Do a pictorial example of a SSAHA search.

One Advantage of SSAHA

-SSAHA stores individual nucleotide base information as 2-bits of information per base:

A is stored as 00C is stored as 01G is stored as 10T is stored as 10

-This allows for fast and efficient storage and processing, using less memory and less processor power

-However, the drawback is that only 4 characters are possible, and sometimes real DNA data has gaps and ambiguous characters

-Other search engines treat these ambiguities as separate letters, like R or N (N means it could be any base)

-SSAHA reverts all ambiguities to A, since AAAAA… is the most common ‘word’ in the genome, and the search can easily recognize it as uninformative

How Does a Hash Table Work?

Ssaha uses hash tables, which is what makes it fast. It’s like a lookup table: phonebook, or bookshelf, basically insert in a key (piece of info) and get info related to it. This article explains hash tables. Hash tables are 2 dimensional tables represented by keys and values. Hash tables take information, assigns a key to it, and puts it in a certain location. The location is determined by hash-code or hashing function or mapping function or randomization technique, or key-to-address transform. A major problem with hash tables is collisions, because there are often more identifiers than addresses. Two pieces of information share the same address in the hash table. Luckily, we have solutions to collision, which are: 1) make a list of all data that all have the same home address, and then when the address is looked up, look through the list to find all the synonyms. 2) if the first address is already occupied, generate a new address, and put the data there. 3) each address is a bucket, with a capacity to hold many items. 4) use a function that does not make repeats.Hash tables are inflexible, if you want to increase the size, all hash values must be recalculated.

SSAHA Step by Step

Make a flowchart here

Talk about the steps… fill the rest of this space.

Given a collection of strands, and a target string, re-arrange the strands in the collection so that they are in order based on how they compare to the target string.

The measurement of similarity is the percentage of base pairs of the shorter string that match up with the longer string. The shorter string(pattern string) is shifted left and right to find the best match, and the similarity is measured there. The percentage is calculated based on the length of the shorter string.

When two strings have the same percentage similarity, the longer string should come first.

When two strings have the same percentage similarity, and length, alphabetical order should take precedence.

Definition Class: StringMatcher Method: orderList Parameters: String, String[] Returns: String[] Method signature: String[] orderList(String target, String[] patterns) (be sure your method is public) Class public class StringMatcher { public String[] orderList(String target, String[] patterns) { // fill in code here } }

Constraints A strand’s similarity should be based on the percentage of matching base pairsThe pattern strands must be shorter or equal to the length of the target strandThe pattern strand should be allowed to find its best match at any point in the target strandIf two or more strands share the same percentage similarity, list in order of length If two or more strands share the same percentage similarity AND length, list in alphabetical order Examples TARGET STRING = “actgcataccata” PATTERN STRINGS = “actgca”, “act”, “gga”, “cgta” Should return: String[] containing “actgca”, “act”, “cgta”, “gga” TARGET STRING = “aaaaagggttcccatatatata” PATTERN STRINGS = “aata”, “gggttc”, “tta”, “cccat” Should return: String[] containing “gggttc”, “cccat”, “aata”, “tta”

Contact info?

T: 510.649.3001F: 510.649.0331TF: 1.866.649.3004E: [email protected]

OPTIONALLOGO HERE

OPTIONALLOGO HERE

A C T N N G C A0001100000100100