blast programming thomas madden ncbi [email protected] january 28, 2002
Post on 19-Dec-2015
243 views
TRANSCRIPT
BLAST Programming
Thomas Madden
NCBI
January 28, 2002
O'Reilly Bioinformatics Technology - BLAST Programming
2
The BLAST algorithm
O'Reilly Bioinformatics Technology - BLAST Programming
3
What is BLAST?
• Basic Local Alignment Search Tool
• Calculates similarity for biological sequences.
• Produces local alignments: only a portion of each sequence must be aligned.
• Uses statistical theory to determine if a match might have occurred by chance.
O'Reilly Bioinformatics Technology - BLAST Programming
4
BLAST is a heuristic.
• A lookup table is made of all the “words” (short subsequences) and “neighboring” words in the query sequence.
• The database is scanned for matching words (“hot spots”).
• Gapped and un-gapped extensions are initiated from these matches.
O'Reilly Bioinformatics Technology - BLAST Programming
5
O'Reilly Bioinformatics Technology - BLAST Programming
6
BLAST OUTPUT
O'Reilly Bioinformatics Technology - BLAST Programming
7
There are many different BLAST output formats.
• Pair-wise report
• Query-anchored report
• Hit-table
• Tax BLAST
• Abstract Syntax Notation 1
• XML
O'Reilly Bioinformatics Technology - BLAST Programming
8
BLAST reports at the NCBI Web page.
O'Reilly Bioinformatics Technology - BLAST Programming
9
Formatting Page
O'Reilly Bioinformatics Technology - BLAST Programming
10
Graphical Overview
O'Reilly Bioinformatics Technology - BLAST Programming
11
One-line descriptions
O'Reilly Bioinformatics Technology - BLAST Programming
12
Pair-wise alignments
O'Reilly Bioinformatics Technology - BLAST Programming
13
Query-anchored alignments
O'Reilly Bioinformatics Technology - BLAST Programming
14
Future improvements: LinkOut, taxonomic and structure links.
Link toLocus-link
Link toUniGene Link to taxonomy
O'Reilly Bioinformatics Technology - BLAST Programming
15
BLAST report designed for human readability.
• One-line descriptions provide overview designed for human “browsing”.
• Redundant information is presented in the report (e.g., one-line descriptions and alignments both contain expect values, scores, descriptions) so a user does not need to move back and forth between sections.
• HTML version has lots of links for a user to explore.• It can change as new features/information becomes
available.
O'Reilly Bioinformatics Technology - BLAST Programming
16
Hit-table
• Contains no sequence or definition lines, but does contain sequence identifiers, starts/stops (one-offset), percent identity of match as well as expect value etc.
• Simple format is ideal for automated tasks such as screening of sequence for contamination or sequence assembly.
O'Reilly Bioinformatics Technology - BLAST Programming
17
There are drawbacks to parsing the BLAST report and Hit-table.
• No way to automatically check for truncated output.
• No way to rigorously check for syntax changes in the output.
O'Reilly Bioinformatics Technology - BLAST Programming
18
Structured output allows automatic and rigorous checks for syntax errors and changes.
O'Reilly Bioinformatics Technology - BLAST Programming
19
Abstract Syntax Notation 1 (ASN.1)
• Is an International Standards Organization (ISO) standard for describing structured data and reliably encoding it.
• Used extensively in the telecommunications industry.
• Both a binary and a text format.• NCBI data model is written in ASN.1.• Asntool can produce C object loaders from an
ASN.1 specification.
O'Reilly Bioinformatics Technology - BLAST Programming
20
ASN.1 is used for the NCBI BLAST Web page.
server
ASN.1
Return formatted results
BLAST DB
Fetch ASN.1 Fetch sequence
Request results
O'Reilly Bioinformatics Technology - BLAST Programming
21
Different reports can be produced from the ASN.1 of one search.
O'Reilly Bioinformatics Technology - BLAST Programming
22
ASN.1
TaxBlastreport
Query-anchoredBLAST report
Pair-wiseBLAST report
HTML
text
Hit-table
XML
HTML
text
O'Reilly Bioinformatics Technology - BLAST Programming
23
The BLAST ASN.1 (“SeqAlign”) contains:
• Start, stop, and gap information (zero-offset).
• Score, bit-score, expect-value.
• Sequence identifiers.• Strand information.
O'Reilly Bioinformatics Technology - BLAST Programming
24
Three flavors of Seq-Align,Score-block(s) plus one of:
• Dense-diag: series of unconnected diagonals. No coordinate “stretching” (e.g., cannot be used for protein-nucl. alignments). Used for ungapped BLASTN/BLASTP.
• Dense-seg: describes an alignment containing many segments. No coordinate “stretching”. Used for gapped BLASTN/BLASTP.
• Std-seg: a collection of locations. No restriction on stretching of coordinates. Used for gapped/ungapped translating searches. Generic.
O'Reilly Bioinformatics Technology - BLAST Programming
25
Score Block
Score ::= SEQUENCE { id Object-id OPTIONAL , -- identifies Score type value CHOICE { -- actual value real REAL , -- floating point value int INTEGER } } -- integer
SEQUENCE is an ordered list of elements,each of which is an ASN.1 type.
Required unless DEFAULT or OPTIONAL
O'Reilly Bioinformatics Technology - BLAST Programming
26
Score Block example
2.45905555x10-9
38.1576692
O'Reilly Bioinformatics Technology - BLAST Programming
27
Dense-seg definition
Dense-seg ::= SEQUENCE { -- for (multiway) global or partial alignments
dim INTEGER DEFAULT 2 , -- dimensionality
numseg INTEGER , -- number of segments here ids SEQUENCE OF Seq-id , -- sequences in order starts SEQUENCE OF INTEGER , -- start OFFSETS in ids order within segs lens SEQUENCE OF INTEGER , -- lengths in ids order within segs strands SEQUENCE OF Na-strand OPTIONAL , scores SEQUENCE OF Score OPTIONAL } -- score for each seg
SEQUENCE OF is an ordered list of
the same type of element.
O'Reilly Bioinformatics Technology - BLAST Programming
28
Dense-seg example
O'Reilly Bioinformatics Technology - BLAST Programming
29
Std-seg definition
Std-seg ::= SEQUENCE { dim INTEGER DEFAULT 2 , -- dimensionality ids SEQUENCE OF Seq-id OPTIONAL , -- sequences identifiers loc SEQUENCE OF Seq-loc , -- locations in ids order scores SET OF Score OPTIONAL } -- score for each segment
SET is an unordered list of elements, each of which is an ASN.1 type. Required unless DEFAULT or OPTIONAL.
SET OF is an unordered list of the same type of element.
O'Reilly Bioinformatics Technology - BLAST Programming
30
Std-seg example
O'Reilly Bioinformatics Technology - BLAST Programming
31
Demo program (“blreplay”) to reproduce BLAST results from
ASN.1
• Start/stops and identifiers read in from ASN.1 (SeqAlign).
• Sequences and definition lines fetched from BLAST databases.
O'Reilly Bioinformatics Technology - BLAST Programming
32
Asntool can produce XML from ASN.1
• Really a transliteration, not a new specification
• A Document Type Definition (DTD) can also be produced.
O'Reilly Bioinformatics Technology - BLAST Programming
33
ASN.1 and XML validation differences.
• XML can be “well-formed” (does not break any XML syntax rules) or “validated” (checked against a DTD).
• ASN.1 must always be valid (checked against a specification).
O'Reilly Bioinformatics Technology - BLAST Programming
34
Special purpose XML
• NCBI specification does not fit the needs of some users (the sequence is not provided in the SeqAlign, when fetched the sequence is packed 2/4 bp’s per byte).
• Possible to produce XML with more/less information or in a different format.
• First done as an ASN.1 specification, which is then dumped as XML.
O'Reilly Bioinformatics Technology - BLAST Programming
35
BLAST XML designed to be self-contained.
• Query sequence, database sequence, etc.
• Sequence definition lines.
• Start, stop, etc. (one-offset).
• Scores, expect values, % identity etc.
• Produced by BLAST binaries and on NCBI Web page.
O'Reilly Bioinformatics Technology - BLAST Programming
36
Overview of the BLAST XML
<!ELEMENT BlastOutput ( BlastOutput_program , BLAST program, e.g., blastp, etc BlastOutput_version , version of BLAST engine (e.g., 2.1.2) BlastOutput_reference , Reference about algorithm BlastOutput_db , Database(s) searched BlastOutput_query-ID , query identifier BlastOutput_query-def , query definition BlastOutput_query-len , query length BlastOutput_query-seq? , query sequence BlastOutput_param , BLAST search parameters BlastOutput_iterations BLAST results for each iteration/run)>
O'Reilly Bioinformatics Technology - BLAST Programming
37
<!ELEMENT BlastOutput (
BlastOutput_program ,
BlastOutput_version ,
BlastOutput_reference ,
BlastOutput_db ,
BlastOutput_query-ID ,
BlastOutput_query-def ,
BlastOutput_query-len ,
BlastOutput_query-seq? ,
BlastOutput_param ,
BlastOutput_iterations )>
<!ELEMENT Iteration ( Iteration_iter-num , Iteration number (one for non PSI-BLAST) Iteration_hits? , Hits (one for each database sequence) Iteration_stat? , Search statistics Iteration_message? Error messages )>
<!ELEMENT BlastOutput_iterations ( Iteration+ )>
O'Reilly Bioinformatics Technology - BLAST Programming
38
<!ELEMENT Iteration ( Iteration_iter-num , Iteration_hits? , Iteration_stat? , Iteration_message? )>
<!ELEMENT Hit ( Hit_num , ordinal number of the hit, one-offset (e.g., "1, 2..."). Hit_id , ID of db sequence (e.g., "gi|7297267|gb|AAF52530.1|") Hit_def , definition of the db sequence Hit_accession , accession of the db sequence (e.g., "AAF57408") Hit_len , length of the database sequence Hit_hsps? describes individual alignments)>
<!ELEMENT Iteration_hits ( Hit* )>
O'Reilly Bioinformatics Technology - BLAST Programming
39
<!ELEMENT Hit ( Hit_num , Hit_id , Hit_def , Hit_accession , Hit_len , Hit_hsps? )>
<!ELEMENT Hsp ( Hsp_num , ordinal number of the HSP, one-offset Hsp_bit-score , score (in bits) of the HSP Hsp_score , raw score of the HSP Hsp_evalue , expect value of the HSP Hsp_query-from , query offset at alignment start (one-offset) Hsp_query-to , query offset at alignment end (one-offset) Hsp_hit-from , db offset at alignment start (one-offset) Hsp_hit-to , db offset at alignment end (one-offset) Hsp_pattern-from? , start of phi-blast pattern on query (one-offset) Hsp_pattern-to? , end of phi-blast pattern on query (one-offset) Hsp_query-frame? , query frame (if applicable) Hsp_hit-frame? , db frame (if applicable) Hsp_identity? , number of identities in the alignment Hsp_positive? , number of positives in the alignment Hsp_gaps? , number of gaps in the alignment Hsp_density? , score density Hsp_qseq , alignment string for the query Hsp_hseq , alignment string for the database Hsp_midline? )> middle line as normally seen in BLAST report
<!ELEMENT Hit_hsps ( Hsp* )>
O'Reilly Bioinformatics Technology - BLAST Programming
40
Parsing BLAST XML with Expat.
• Expat is a popular free-ware used for parsing XML.
• Non-validating.• Simple C (demo)
program to parse BLAST output.
O'Reilly Bioinformatics Technology - BLAST Programming
41
Output sizes for a BLASTP search of gi|178628 vs. nr.
• Hit-table: 16 kb
• Binary ASN.1 (SeqAlign): 35 kb
• Text ASN.1 (SeqAlign): 144 kb
• XML (SeqAlign): 392 kb
• XML: 288 kb
• BLAST report (text): 232 kb
• BLAST report (html): 272 kb
O'Reilly Bioinformatics Technology - BLAST Programming
42
Specification (i.e., “data model”) issues should not be confused
with the question about whether to use ASN.1 or XML.
O'Reilly Bioinformatics Technology - BLAST Programming
43
Structured output is not a panacea.
• Design issues must still be addressed.
• Semantic issues still exist, e.g. is a start/stop value zero-offset or one-offset.
• Data issues still exist, e.g., is the correct sequence shown, are the offsets correct, was the DNA translated with the correct genetic code?
O'Reilly Bioinformatics Technology - BLAST Programming
44
Overview of BLAST code.
O'Reilly Bioinformatics Technology - BLAST Programming
45
NCBI toolkit
• Has many low-level functions to make it platform independent; supported under LINUX, many flavors of UNIX, NT, and MacOS.
• Contains portable types such as Int2, Int4, FloatHi.• Developer should write a “Main” function that is
called by a toolkit “main”.• Contains the BLAST code in the “tools” library.• A C++ toolkit is now being developed.
O'Reilly Bioinformatics Technology - BLAST Programming
46
BLAST code has a modular design.
• API for retrieval from databases independent of the compute engine.
• Compute engine independent of formatter.
O'Reilly Bioinformatics Technology - BLAST Programming
47
Readdb API can be used to easily extract information from the
BLAST databases.
• Date produced.
• Title of database.
• Number of letters, number of sequences, longest sequence.
• Sequence and description of an entry.
• Function prototypes in readdb.h.
O'Reilly Bioinformatics Technology - BLAST Programming
48
Dump a BLAST record in FASTA format (db2fasta.c):
Get or display command-line arguments
“Main” is called by “main” in the toolkit.
Allocate an object for reading the database
Get the ordinal number (zero-offset) of therecord given a ‘FASTA’ identifier
(e.g., “gb|AAH06766.1|AAH0676”).
Fetch the Bioseq (contains sequence, description, and identifiers) for this record
Dump the sequence as FASTA.
O'Reilly Bioinformatics Technology - BLAST Programming
49
Only a few function calls are needed to perform a BLAST
search (doblast.c):
Perform a BLAST search of the BioseqPtr query_bsp. The BioseqPtr could have been obtained from the BLAST databases, Entrez or from FASTA using
the function call FastaToSeqEntry
Allocate a BLASTOptionsBlk with default values for the specified program (e.g., “blastp”),
the boolean argument specifies a gapped search
Set the expect value cutoff to a non-default value.
O'Reilly Bioinformatics Technology - BLAST Programming
50
BlastOptionNewBLAST_OptionsBlkPtr BLASTOptionNew (CharPtr progname, Boolean gapped)
CharPtr progname: name of program. Legal values are blastp, blastn, blastx, tblastn, and tblastx.
Boolean gapped: if TRUE gapped parameters are set, if FALSE ungapped.
Non-default values may be specified by changing elements of the allocated structure (typedef in blastdef.h). The most often changed elements (options) are:
Nlm_FloatHi expect_value Expect value cutoffInt2 wordsize Number of letters used in making words for lookup table.Int2 penalty Penalty for a mismatch (only BLASTN and MegaBLAST)Int2 reward Reward for a match (only BLASTN and MegaBLASTCharPtr matrix Matrix used for comparison (not BLASTN or MegaBLAST)Int4 gap_open Cost for gap existenceInt4 gap_extend Cost to extend a gap one more letter (including first).CharPtr filter_string Filtering options (e.g., “L”, “mL”)Int4 hitlist_size Number of database sequences to save hits for.Int2 number_of_cpus Number of CPU’s to use.
O'Reilly Bioinformatics Technology - BLAST Programming
51
BioseqBlastEngineSeqAlignPtr BioseqBlastEngine (BioseqPtr bsp, CharPtr progname, CharPtr database,
BLAST_OptionsBlkPtr options, ValNodePtr *other_returns, ValNodePtr *error_returnsint (LIBCALLBACK *callback) (Int4 done, Int4 positives))
BioseqPtr bsp: contains the query sequence, identifier, and definition line. CharPtr progname: name of program (one of blastp, blastn, blastx, tblastn, or tblastx).CharPtr database: name (and path) to BLAST database(s). Multiple databases to be searched should
be separated by a space (e.g., “nt est”).BLAST_OptionsBlkPtr options: BLAST option structure obtained from BLASTOptionNew.
If NULL default values will be used.ValNodePtr *other_returns: a linked list of ValNodePtr’s, each one containing information about things
like the database(s) searched, the Karlin-Altschul parameters, the region of query masked.See blastall.c to see how to use this information.May be set to NULL.
ValNodePtr *error_returns: a linked list of error messages, these may be printed with a call toBlastErrorPrint(error_returns). May be set to NULL.
int (LIBCALLBACK *callback) (Int4 done, Int4 positives): callback function to mark progress throughthe database. May be set to NULL.
O'Reilly Bioinformatics Technology - BLAST Programming
52
What can I do with the SeqAlignPtr?
SeqAlignId gets the (C-structure) identifier for the first (zeroth) sequence.
SeqIdWrite formats the information in “query_id” into a FASTA identifier (e.g., “gi|129295”) and
places it into query_id_buf.
SeqAlignStop returns the end values (zero-offset)for the first and second sequences.
SeqAlignStart returns the start value (zero-offset)for the first and second sequences.
O'Reilly Bioinformatics Technology - BLAST Programming
53
MySeqAlignPrint output for a search of gi|129295 vs. ecoli
O'Reilly Bioinformatics Technology - BLAST Programming
54
Notes on Traditional BLAST printing.
A call to the fetch function ReadDBBioseqFetchEnable ("blastall", blast_database, db_is_na, TRUE);
tells the formatter where to obtain sequences. Entrez or a network connection to theBLAST server could have been used.
The one-line descriptions are printed byPrintDefLinesFromSeqAlignEx2(seqalign, 80, outfp, print_options,
FIRST_PASS, NULL, number_of_descriptions, NULL, NULL);
The pair-wise alignments are printed byShowTextAlignFromAnnot(seqannot, 60, outfp, NULL, NULL, align_options,
txmatrix, mask_loc, FormatScoreFunc);
Look at blreplay.c and blastall.c to see details of how these are called.
O'Reilly Bioinformatics Technology - BLAST Programming
55
Resources• BLAST Home page:
http://www.ncbi.nlm.nih.gov/BLAST/
• NCBI Information Engineering Branch home page: http://www.ncbi.nlm.nih.gov/IEB/
• Demonstration programs (parsing XML with EXPAT, blreplay.c, doblast.c, db2fasta.c): ftp://ftp.ncbi.nih.gov/blast/demo
O'Reilly Bioinformatics Technology - BLAST Programming
56
ASN.1 RESOURCES
• The Open Book : A Practical Perspective on OSIby Marshall T. Rose (Prentice Hall).
• OSS Nokalva Web site: http://www.oss.com/asn1/overview.html
• NCBI toolkit documentation on ASN.1:
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/ASNLIB.HTML
O'Reilly Bioinformatics Technology - BLAST Programming
57
Email addresses
• General questions about running BLAST: [email protected]
• Questions about compiling the toolkit and requests for hard-copy of documentation: [email protected]