woods hole, massachusetts july 25, 2006, 7 to 10 pm marine biological laboratory — workshop on...
TRANSCRIPT
Woods Hole, MassachusettsWoods Hole, Massachusetts
July 25, 2006, 7 to 10 PMJuly 25, 2006, 7 to 10 PM
Marine Biological Laboratory Marine Biological Laboratory — Workshop on Molecular — Workshop on Molecular
EvolutionEvolution
More data yields stronger analyses — if done carefully!More data yields stronger analyses — if done carefully!
Mosaic ideas and evolutionary ‘importance.’Mosaic ideas and evolutionary ‘importance.’
Multiple Sequence Multiple Sequence Alignment & Analysis Alignment & Analysis thru GCG’s SeqLabthru GCG’s SeqLab
Steven M. ThompsonSteven M. Thompson
Florida State University School of Florida State University School of Computational Science (SCS)Computational Science (SCS)
But first a prelude: My definitions
Biocomputing and computational biology are synonymous and Biocomputing and computational biology are synonymous and
describe the use of computers and computational techniques to describe the use of computers and computational techniques to
analyze any biological system, from molecules, through cells, analyze any biological system, from molecules, through cells,
tissues, organisms, and populations, to complete ecologies.tissues, organisms, and populations, to complete ecologies.
Bioinformatics describes using computational techniques to access, Bioinformatics describes using computational techniques to access,
analyze, and interpret the biological information in any of the analyze, and interpret the biological information in any of the
available online biological databases.available online biological databases.
Sequence analysis is the study of molecular sequence data for the Sequence analysis is the study of molecular sequence data for the
purpose of inferring the function, mechanism, interactions, purpose of inferring the function, mechanism, interactions,
evolution, and perhaps structure of biological molecules.evolution, and perhaps structure of biological molecules.
Genomics analyzes the context of genes or complete genomes (the Genomics analyzes the context of genes or complete genomes (the
total DNA content of an organism) within and across genomes.total DNA content of an organism) within and across genomes.
Proteomics is a subdivision of genomics concerned with analyzing Proteomics is a subdivision of genomics concerned with analyzing
the complete protein complement, i.e. the proteome, of the complete protein complement, i.e. the proteome, of
organisms, both within and between different organisms.organisms, both within and between different organisms.
from a ‘virtual’ DNA sequence to actual molecular from a ‘virtual’ DNA sequence to actual molecular physical characterization, not the other way ‘round.physical characterization, not the other way ‘round.
Using bioinformatics tools, you can infer all sorts Using bioinformatics tools, you can infer all sorts of functional, evolutionary, and, structural of functional, evolutionary, and, structural insights into a gene product, without the need insights into a gene product, without the need to isolate and purify massive amounts of to isolate and purify massive amounts of protein! Eventually you can go on to clone protein! Eventually you can go on to clone and express the gene based on that analysis and express the gene based on that analysis using PCR techniques.using PCR techniques.
The computer and molecular databases are an The computer and molecular databases are an essential part of this process.essential part of this process.
And a ‘way’ to think about it:And a ‘way’ to think about it:The reverse biochemistry analogyThe reverse biochemistry analogy
The exponential growth of molecular sequence databasesYearYear BasePairs BasePairs SequencesSequences
19821982 680338 680338 606 606
19831983 2274029 2274029 2427 2427
19841984 3368765 3368765 4175 4175
19851985 5204420 5204420 5700 5700
19861986 9615371 9615371 9978 9978
19871987 1551477615514776 1458414584
19881988 23800000 23800000 2057920579
19891989 34762585 34762585 2879128791
19901990 49179285 49179285 3953339533
19911991 71947426 71947426 55627 55627
19921992 101008486 101008486 78608 78608
19931993 157152442 157152442 143492143492
19941994 217102462 217102462 215273 215273
19951995 384939485 384939485 555694555694
19961996 651972984 651972984 10212111021211
19971997 1160300687 1160300687 17658471765847
19981998 2008761784 2008761784 28378972837897
19991999 3841163011 3841163011 4864570 4864570
20002000 1110106628811101066288 1010602310106023
20012001 1584992143815849921438 1497631014976310
20022002 2850799016628507990166 22318883 2231888320032003 3655336848536553368485 3096841830968418
20042004 4457574517644575745176 4060431940604319
20052005 5603773446256037734462 52016762 52016762
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
& cpu power& cpu power
Doubling time ~ 1 Doubling time ~ 1 year!year!
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
So what; why even bother? So what; why even bother?
Applications:Applications:
Probe/primer, and motif/profile design;Probe/primer, and motif/profile design;
Graphical illustrations;Graphical illustrations;
Comparative ‘homology’ inference;Comparative ‘homology’ inference;
Molecular evolutionary analysis.Molecular evolutionary analysis.
OK — well, how do you do it?OK — well, how do you do it?
Back to multiple sequence Back to multiple sequence alignment — Applicability?alignment — Applicability?
Dynamic programming’s complexity Dynamic programming’s complexity increases exponentially with the number of increases exponentially with the number of sequences being compared:sequences being compared:
N-dimensional matrix . . . .N-dimensional matrix . . . .complexity=[sequence length]complexity=[sequence length]number of sequencesnumber of sequences
See —See —
MSA (‘global’ within ‘bounding box’) andMSA (‘global’ within ‘bounding box’) and
PIMA (‘local’ portions only) on the multiple PIMA (‘local’ portions only) on the multiple alignment page at thealignment page at the
Baylor College of Medicine’s Search Baylor College of Medicine’s Search Launcher —Launcher —
http://searchlauncher.bcm.tmc.edu/ — but, — but,
severely limiting restrictions!severely limiting restrictions!
‘‘Global’ heuristic solutionsGlobal’ heuristic solutions
Therefore — Therefore — pairwise, pairwise, progressive dynamic progressive dynamic programming restricts the programming restricts the solution to the neighbor-solution to the neighbor-hood of only two hood of only two sequences at a time.sequences at a time.
All sequences are All sequences are compared, pairwise, and compared, pairwise, and then each is aligned to its then each is aligned to its most similar partner or most similar partner or group of partners. Each group of partners. Each group of partners is then group of partners is then aligned to finish the aligned to finish the complete multiple complete multiple sequence alignment.sequence alignment.
Multiple Sequence Dynamic ProgrammingMultiple Sequence Dynamic Programming
Reliability and the Reliability and the Comparative Approach —Comparative Approach —
explicit homologous correspondence;explicit homologous correspondence;
manual adjustments should be manual adjustments should be encouraged — based on knowledge,encouraged — based on knowledge,
especially structural, regulatory, and especially structural, regulatory, and functional sites.functional sites.
Therefore, editors like SeqLab andTherefore, editors like SeqLab and
the Ribosomal Database Project:the Ribosomal Database Project:
http://rdp.cme.msu.edu/index.jsphttp://rdp.cme.msu.edu/index.jsp
Structural & Functional correspondence in Structural & Functional correspondence in the Wisconsin Package’s SeqLab —the Wisconsin Package’s SeqLab —
Work with proteins!Work with proteins!If at all possible —If at all possible —
Twenty match symbols versus four, plus Twenty match symbols versus four, plus similarity! Way better signal to noise.similarity! Way better signal to noise.
Also guarantees no indels are placed Also guarantees no indels are placed within codons. So translate, then align.within codons. So translate, then align.
Nucleotide sequences will only reliably Nucleotide sequences will only reliably align if they are align if they are veryvery similarsimilar to each to each other. And they will require extensive other. And they will require extensive hand editing and careful consideration.hand editing and careful consideration.
Beware of aligning apples and Beware of aligning apples and oranges oranges [[and grapefruitand grapefruit]]!!
Parologous Parologous versus versus orthologous;orthologous;
genomic versus genomic versus cDNA;cDNA;
mature versus mature versus precursor.precursor.
Mask out uncertain areas —Mask out uncertain areas —
Complications —Complications —Order dependence.Order dependence.
Not that big of a deal.Not that big of a deal.
Substitution matrices and gap penalties.Substitution matrices and gap penalties.
A very big deal!A very big deal!
Regional ‘realignment’ becomes incredibly Regional ‘realignment’ becomes incredibly
important, especially with sequences that important, especially with sequences that
have areas of high and low similarity have areas of high and low similarity
(GCG’ PileUp -InSitu option).(GCG’ PileUp -InSitu option).
Complications cont. —Complications cont. —
Format hassles!Format hassles!
Specialized format conversion Specialized format conversion tools such as GCG’s tools such as GCG’s SeqConv+ program and SeqConv+ program and PAUPSearch, andPAUPSearch, and
Don Gilbert’s public domain Don Gilbert’s public domain ReadSeq program.ReadSeq program.
Still more complications —Still more complications —
Indels and missing Indels and missing
data symbols (i.e. data symbols (i.e.
gaps) designation gaps) designation
discrepancy discrepancy
headaches —headaches —
., -, ~, ?, N, or X., -, ~, ?, N, or X
. . . . . Help!. . . . . Help!
Web resources for pairwise, Web resources for pairwise, progressive multiple alignment —progressive multiple alignment —http://www.techfak.uni-bielefeld.de/bcd/Curric/
MulAli/welcome.html..
http://pbil.univ-lyon1.fr/alignment.html
http://www.ebi.ac.uk/clustalw/
http://searchlauncher.bcm.tmc.edu/
However, problems with very large datasets and However, problems with very large datasets and huge multiple alignments make doing multiple huge multiple alignments make doing multiple sequence alignment on the Web impractical sequence alignment on the Web impractical after your dataset has reached a certain size. after your dataset has reached a certain size. You’ll know it when you’re there!You’ll know it when you’re there!
If large datasets become intractable for analysis on the Web, what other resources are available?Desktop software solutions — public domain Desktop software solutions — public domain
programs are available, but . . . complicated to programs are available, but . . . complicated to
install, configure, and maintain. User must be install, configure, and maintain. User must be
pretty computer savvy. So, pretty computer savvy. So,
commercial software packages are available, e.g. commercial software packages are available, e.g.
MacVector, DS Gene, DNAsis, DNAStar, etc.,MacVector, DS Gene, DNAsis, DNAStar, etc.,
but . . . license hassles, big expense per but . . . license hassles, big expense per
machine, and Internet and/or CD database machine, and Internet and/or CD database
access all complicate matters!access all complicate matters!
Therefore, UNIX server-based solutions
Public domain solutions also exist, but now a very cooperative Public domain solutions also exist, but now a very cooperative
systems manager needs to maintain everything for users, so,systems manager needs to maintain everything for users, so,
commercial products, e.g. the Accelrys GCG Wisconsin Package commercial products, e.g. the Accelrys GCG Wisconsin Package
and the SeqLab Graphical User Interface, simplify matters for and the SeqLab Graphical User Interface, simplify matters for
administrators and users.administrators and users. One format, one ‘look-and-feel.’ One format, one ‘look-and-feel.’
One license fee for an entire institution and very fast, convenient One license fee for an entire institution and very fast, convenient
database access on local server disks. Connections from any database access on local server disks. Connections from any
networked terminal or workstation anywhere!networked terminal or workstation anywhere!
Operating system:Operating system: UNIX command line operation hassles; UNIX command line operation hassles;
communications software — telnet, ssh, and terminal emulation; X communications software — telnet, ssh, and terminal emulation; X
graphics; file transfer — ftp, and scp/sftp; and editors — vi, emacs, graphics; file transfer — ftp, and scp/sftp; and editors — vi, emacs,
pico (or desktop word processing followed by file transfer [save as pico (or desktop word processing followed by file transfer [save as
"text only!"]). See my supplement pdf file."text only!"]). See my supplement pdf file.
The Genetics Computer Group —
The Accelrys Wisconsin Package for Sequence AnalysisThe Accelrys Wisconsin Package for Sequence Analysis
GCG began in 1982 in Oliver Smithies’ Genetics Dept. lab at the GCG began in 1982 in Oliver Smithies’ Genetics Dept. lab at the
University of Wisconsin, Madison; and then starting in 1990 it University of Wisconsin, Madison; and then starting in 1990 it
became a private company; which was acquired by the Oxford became a private company; which was acquired by the Oxford
Molecular Group, U.K., in 1997; and then by Pharmacopeia Inc., Molecular Group, U.K., in 1997; and then by Pharmacopeia Inc.,
U.S.A., in 2000; and then in 2004 Accelrys, San Diego, U.S.A., in 2000; and then in 2004 Accelrys, San Diego,
California, left Pharmacopeia to become an independent entity.California, left Pharmacopeia to become an independent entity.
The suite contains around 150 programs designed to work in a The suite contains around 150 programs designed to work in a
“toolbox” fashion. Several simple programs used in succession “toolbox” fashion. Several simple programs used in succession
can lead to very sophisticated results.can lead to very sophisticated results.
Also ‘internal compatibility,’ i.e. once you learn to use one program, Also ‘internal compatibility,’ i.e. once you learn to use one program,
all programs can be run similarly, and, the output from many all programs can be run similarly, and, the output from many
programs can be used as input for other programs.programs can be used as input for other programs.
Used all over the world at over 950 institutions, so learning it will Used all over the world at over 950 institutions, so learning it will
likely be useful at other research institutions as well.likely be useful at other research institutions as well.
To answer the always perplexing GCG question — “What sequence(s)? . . . .”
The sequence is in a local GCG format single sequence file in your UNIX The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and SeqConv+ programs)account. (GCG Reformat and SeqConv+ programs)
The sequence is in a local GCG database in which case you ‘point’ to it by The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper the logical name apart from either an accession number or a proper identifier name or a wildcard expression, and they are case insensitive.identifier name or a wildcard expression, and they are case insensitive.
The sequence is in a GCG format multiple sequence file, either an MSF The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — {specification, e.g. a wildcard — {**}.}.
Finally, the most powerful method of specifying sequences is in a GCG “list” Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications and can even file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, you can program is to precede it with an at sign, “@.” Furthermore, you can supply attribute information within list files to specify something special supply attribute information within list files to specify something special about the sequence such as begin and end constraints.about the sequence such as begin and end constraints.
Specifying sequences, GCG style;Specifying sequences, GCG style;in order of increasing power and complexity:in order of increasing power and complexity:
!!NA_SEQUENCE 1.0!!NA_SEQUENCE 1.0
This is a small example of GCG single sequence format.This is a small example of GCG single sequence format.
Always put some documentation on top, so in the futureAlways put some documentation on top, so in the future
you can figure out what it is you're dealing with! Theyou can figure out what it is you're dealing with! The
line with the two periods is converted to the checksum line.line with the two periods is converted to the checksum line.
example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..
1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA
51 GATTTAATAG CATGCGATCC CATGGGA51 GATTTAATAG CATGCGATCC CATGGGA
‘‘Clean’ GCG format single sequence file after Clean’ GCG format single sequence file after
‘reformat’ (or the SeqConv+ program)‘reformat’ (or the SeqConv+ program)
SeqLab’s Editor mode can also SeqLab’s Editor mode can also
“Import” native GenBank format and “Import” native GenBank format and
ABI or LI-COR trace files!ABI or LI-COR trace files!
Logical terms for the Wisconsin PackageSequence databases, nucleic acids:Sequence databases, nucleic acids: Sequence databases, amino acids:Sequence databases, amino acids:
GENBANKPLUSGENBANKPLUS all of GenBank plus EST, HTC & GSS subdivisionsall of GenBank plus EST, HTC & GSS subdivisions GENPEPTGENPEPT GenBank CDS translationsGenBank CDS translations
GBPGBP all of GenBank plus EST, HTC & GSS subdivisionsall of GenBank plus EST, HTC & GSS subdivisions GPGP GenBank CDS translationsGenBank CDS translations
GENBANKGENBANK all of GenBank except EST, HTC & GSS subdivisionsall of GenBank except EST, HTC & GSS subdivisions UNIPROT or UNIUNIPROT or UNI all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL
GBGB all of GenBank except EST, HTC & GSS subdivisionsall of GenBank except EST, HTC & GSS subdivisions SWISSPROTPLUSSWISSPROTPLUS all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL
BABA GenBank bacterial subdivisionGenBank bacterial subdivision SWPSWP all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL
BACTERIALBACTERIAL GenBank bacterial subdivisionGenBank bacterial subdivision UNISPROTUNISPROT all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)
ESTEST GenBank EST (Expressed Sequence Tags) subdivisionGenBank EST (Expressed Sequence Tags) subdivision SWISSPROTSWISSPROT all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)
GSSGSS GenBank GSS (Genome Survey Sequences) subdivisionGenBank GSS (Genome Survey Sequences) subdivision SWISSSWISS all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)
HTCHTC GenBank High Throughput cDNAGenBank High Throughput cDNA SWSW all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)
HTGHTG GenBank High Throughput GenomicGenBank High Throughput Genomic UNITREMBLUNITREMBL Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations
ININ GenBank invertebrate subdivisionGenBank invertebrate subdivision SPTREMBLSPTREMBL Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations
INVERTEBRATEINVERTEBRATE GenBank invertebrate subdivisionGenBank invertebrate subdivision SPTSPT Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations
OMOM GenBank other mammalian subdivisionGenBank other mammalian subdivision PP all of PIR Protein all of PIR Protein
OTHERMAMMOTHERMAMM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIRPIR all of PIR Protein all of PIR Protein
OVOV GenBank other vertebrate subdivision GenBank other vertebrate subdivision PIR1PIR1 PIR fully annotated subdivision PIR fully annotated subdivision
OTHERVERTOTHERVERT GenBank other vertebrate subdivision GenBank other vertebrate subdivision PIR2PIR2 PIR preliminary subdivision PIR preliminary subdivision
PATPAT GenBank patent subdivision GenBank patent subdivision PIR3PIR3 PIR unverified subdivision PIR unverified subdivision
PATENTPATENT GenBank patent subdivision GenBank patent subdivision PIR4PIR4 PIR unencoded subdivisionPIR unencoded subdivision
PHPH GenBank phage subdivision GenBank phage subdivision Note: not all GCG installations support the PIR databaseNote: not all GCG installations support the PIR database
PHAGEPHAGE GenBank phage subdivisionGenBank phage subdivision
PLPL GenBank plant subdivision GenBank plant subdivision General data files: General data files:
PLANTPLANT GenBank plant subdivision GenBank plant subdivision GENMOREDATAGENMOREDATA path to GCG optional data filespath to GCG optional data files
PRPR GenBank primate subdivision GenBank primate subdivision GENRUNDATAGENRUNDATA path to GCG default data filespath to GCG default data files
PRIMATEPRIMATE GenBank primate subdivisionGenBank primate subdivision
RORO GenBank rodent subdivisionGenBank rodent subdivision
RODENTRODENT GenBank rodent subdivisionGenBank rodent subdivision
STSSTS GenBank (Sequence Tagged Sites) subdivisionGenBank (Sequence Tagged Sites) subdivision
SYSY GenBank synthetic subdivisionGenBank synthetic subdivision
SYNTHETICSYNTHETIC GenBank synthetic subdivisionGenBank synthetic subdivision
TAGSTAGS GenBank EST, HTC & GSS subdivisionsGenBank EST, HTC & GSS subdivisions
UNUN GenBank unannotated subdivisionGenBank unannotated subdivision
UNANNOTATEDUNANNOTATED GenBank unannotated subdivisionGenBank unannotated subdivision
VIVI GenBank viral subdivisionGenBank viral subdivision
VIRALVIRAL GenBank viral subdivisionGenBank viral subdivision
These are easy — These are easy — they make sense and they make sense and you’ll have a vested you’ll have a vested interest.interest.
GCG MSF & RSF format
The trick is to not forget the Braces and ‘wild card,’ e.g. The trick is to not forget the Braces and ‘wild card,’ e.g.
filename{filename{**}, when specifying!}, when specifying!
!!RICH_SEQUENCE 1.0!!RICH_SEQUENCE 1.0....{{name ef1a_gialaname ef1a_gialadescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listdescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listtype PROTEINtype PROTEINlongname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}sequence-ID Q08046sequence-ID Q08046checksum 7342checksum 7342offset 23offset 23creation-date 07/11/2001 16:51:19creation-date 07/11/2001 16:51:19strand 1strand 1comments ////////////////////////////////////////////////////////////comments ////////////////////////////////////////////////////////////
!!AA_MULTIPLE_ALIGNMENT 1.0!!AA_MULTIPLE_ALIGNMENT 1.0
small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..
Name: a49171 Len: 425 Check: 537 Weight: 1.00Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00Name: a46241 Len: 274 Check: 3514 Weight: 1.00
// //////////////////////////////////////////////////// //////////////////////////////////////////////////
This is SeqLab’s native formatThis is SeqLab’s native format
The List File Format
!!!SEQUENCE_LIST 1.0!SEQUENCE_LIST 1.0
An example GCG list file of many elongation An example GCG list file of many elongation
1a and Tu factors follows. As with all GCG 1a and Tu factors follows. As with all GCG
data files, two periods separate data files, two periods separate
documentation from data. ..documentation from data. ..
my-special.pepmy-special.pep begin:24begin:24 end:134end:134
SwissProt:EfTu_EcoliSwissProt:EfTu_Ecoli
Ef1a-Tu.msf{*}Ef1a-Tu.msf{*}
/usr/accounts/test/another.rsf{ef1a_*}/usr/accounts/test/another.rsf{ef1a_*}
@[email protected] The ‘way’ SeqLab works!The ‘way’ SeqLab works!
remember the @ sign!remember the @ sign!
SeqLab — GCG’s X-based GUI!
SeqLab is the merger of Steve Smith’s Genetic SeqLab is the merger of Steve Smith’s Genetic
Data Environment and GCG’s Wisconsin Data Environment and GCG’s Wisconsin
Package Interface:Package Interface:
GDE + WPI = SeqLabGDE + WPI = SeqLab
Requires an X-Windowing environment — Requires an X-Windowing environment —
either native on UNIX computers (including either native on UNIX computers (including
LINUX, but not installed by default on Mac OS LINUX, but not installed by default on Mac OS
X [v.10+] systems, however, see Apple’s free X [v.10+] systems, however, see Apple’s free
X11 package or XDarwin), or emulated with X-X11 package or XDarwin), or emulated with X-
Server Software on personal computers.Server Software on personal computers.
FOR MORE INFO...FOR MORE INFO...
Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html.Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html.
Contact me (Contact me (stevetstevet@[email protected]) for specific long-distance ) for specific long-distance bioinformatics assistance and collaboration.bioinformatics assistance and collaboration.
Gunnar von Heijne in his old but quite readable treatise, Gunnar von Heijne in his old but quite readable treatise, Sequence Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion:(1987), provides a very appropriate conclusion:
““Think about what you’re doing; use your knowledge of the molecular Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and direction of inquiry; use as much information as possible; and do not do not blindly accept everything the computer offers youblindly accept everything the computer offers you.”.”
He continues:He continues:
““. . . if any lesson is to be drawn . . . it surely is that to be able to make a . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.”all we have to become better biologists. But that’s all it takes.”
Conclusions —Conclusions —
Many texts are now available in Many texts are now available in
the field. the field. To ‘honk-my-own-horn’ a bit, To ‘honk-my-own-horn’ a bit,
check out:check out:
Current Protocols in BioinformaticsCurrent Protocols in Bioinformatics
from John Wiley & Sons, Inc.from John Wiley & Sons, Inc.
(http://www.does.org/cp/bioinfo.html);(http://www.does.org/cp/bioinfo.html);
and Horizon Scientific and Horizon Scientific
Press’ Press’
Computational Computational
Genomics: Theory and Genomics: Theory and
ApplicationApplication
((http://http://
www.horizonpress.com/hsp/www.horizonpress.com/hsp/
books/com.html).books/com.html).
AND FOR EVEN MORE INFO...
Humana Press’ Humana Press’
Introduction to Bioinformatics:Introduction to Bioinformatics:
A Theoretical And Practical ApproachA Theoretical And Practical Approach
((http://www.humanapress.com/http://www.humanapress.com/
Product.pasp?Product.pasp?
txtCatalog=HumanaBooks&txtCategorytxtCatalog=HumanaBooks&txtCategory
=&txtProductID=1-58829-241-=&txtProductID=1-58829-241-
X&isVariant=0X&isVariant=0););
They all asked me to They all asked me to
contribute chapters on contribute chapters on
multiple sequence multiple sequence
alignment and analysis alignment and analysis
using GCG software.using GCG software.
On to a demonstration of some of On to a demonstration of some of
SeqLab’s multiple sequence SeqLab’s multiple sequence
dataset capabilities —dataset capabilities —
some of my prebuilt alignments, and . . .some of my prebuilt alignments, and . . .
Elongation Factor 1Elongation Factor 1/Tu, how to do it./Tu, how to do it.