interpro and interproscan 5.0
DESCRIPTION
Event: Plant and Animal Genomes conference 2012 Speaker: Sandra Orchard InterPro is an open-source protein resource used for the automatic annotation of proteins, and is scalable to the analysis of entire new genomes through the use of a downloadable version of InterProScan, which can be incorporated into an existing local pipeline. InterPro integrates protein signatures from 11 major signature databases (CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY, and TIGRFAMs) into a single resource, taking advantage of the different areas of specialization of each to produce a resource that provides protein classification on multiple levels: protein families, structural superfamilies and functionally close subfamilies, as well as functional domains, repeats and important sites. The InterPro website has been improved, following extensive community consultation and a new version of InterProScan promises improved speed, ease of implementation as well as additional functionalities.TRANSCRIPT
![Page 1: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/1.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
InterPro and InterProScan 5.0
![Page 2: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/2.jpg)
http://www.ebi.ac.uk/interpro
• is a database that groups predictive protein signatures together
• 11 member databases
• single searchable resource
• provides functional analysis of proteins by classifying them into families and predicting domains and important sites
• Enables whole genome analysis
InterPro
![Page 3: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/3.jpg)
http://www.ebi.ac.uk/interpro
InterPro Consortium
Consortium of 11 major signature databases
![Page 4: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/4.jpg)
http://www.ebi.ac.uk/interpro
Protein signatures
• More sensitive homology searches
• Each member database creates signatures using different methods and
methodologies:
manually-created sequence alignments
automatic processes with some human input and correction
entirely automatically.
![Page 5: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/5.jpg)
http://www.ebi.ac.uk/interpro
Why do we need predictive annotation tools?
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
UniProtKB
UniProtKB/Swiss-Prot
Date
Num
ber
of s
eque
nces
![Page 6: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/6.jpg)
http://www.ebi.ac.uk/interpro
What are protein signatures?
Multiple sequence alignment
Protein family/domainBuild model
Search
Mature model
ITWKGPVCGLDGKTYRNECALL
AVPRSPVCGSDDVTYANECELK
UniProtit.
Significant match
Protein analysis
![Page 7: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/7.jpg)
http://www.ebi.ac.uk/interpro
Member databases
Hidden Markov Models Finger-Prints
Profiles PatternsSequence Clusters
Structural Domains
Functional annotation of families/domains
Prediction of conserved domains
Protein features (active sites…)
METHODS
![Page 8: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/8.jpg)
http://www.ebi.ac.uk/interpro
InterPro entry
![Page 9: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/9.jpg)
http://www.ebi.ac.uk/interpro
InterPro entry
![Page 10: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/10.jpg)
http://www.ebi.ac.uk/interpro
The InterPro entry: types
Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure
Family
Distinct functional, structural or sequence units that may exist in a variety of biological contextsDomain
Short sequences typically repeated within a proteinRepeats
PTM Active Site
Binding Site
Conserved Site
Sites
![Page 11: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/11.jpg)
http://www.ebi.ac.uk/interpro
InterPro Entry
Adds extensive annotation
Links to other databases
Structural information and viewers
Groups similar signatures together
Adds extensive annotation
Links to other databases
Quality control
Removes redundancy
![Page 12: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/12.jpg)
http://www.ebi.ac.uk/interpro
InterPro Entry
Adds extensive annotation
Links to other databases
Structural information and viewers
Groups similar signatures together
Adds extensive annotation
Links to other databases
Hierarchical classification
![Page 13: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/13.jpg)
http://www.ebi.ac.uk/interpro
Interpro hierarchies: Families
FAMILIES can have parent/child relationships with other Families
Parent/Child relationships are based on:
• Comparison of protein hits
child should be a subset of parent
siblings should not have matches in common
• Existing hierarchies in member databases
• Biological knowledge of curators
![Page 14: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/14.jpg)
http://www.ebi.ac.uk/interpro
InterPro hierarchies: Domains
DOMAINS can have parent/child relationships
with other domains
![Page 15: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/15.jpg)
http://www.ebi.ac.uk/interpro
Domains and Families may be linked through Domain Organisation
Hierarchy
![Page 16: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/16.jpg)
http://www.ebi.ac.uk/interpro
InterPro Entry
Adds extensive annotation
Links to other databases
Structural information and viewers
Groups similar signatures together
Adds extensive annotation
Links to other databases
![Page 17: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/17.jpg)
http://www.ebi.ac.uk/interpro
InterPro Entry
Adds extensive annotation
Links to other databases
Structural information and viewers
Groups similar signatures together
Adds extensive annotation
Links to other databases
The Gene Ontology project provides a controlled vocabulary of terms for
describing gene product characteristics
![Page 18: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/18.jpg)
http://www.ebi.ac.uk/interpro
InterPro Entry
Adds extensive annotation
Links to other databases
Structural information and viewers
Groups similar signatures together
Adds extensive annotation
Links to other databases
UniProt
KEGG ... Reactome ... IntAct ...
UniProt taxonomy
PANDIT ... MEROPS ... Pfam clans ...
Pubmed
![Page 19: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/19.jpg)
http://www.ebi.ac.uk/interpro
InterPro Entry
Adds extensive annotation
Links to other databases
Structural information and viewers
Groups similar signatures together
Adds extensive annotation
Links to other databases
PDB 3-D Structures
SCOP Structural domains
CATH Structural domain classification
![Page 20: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/20.jpg)
http://www.ebi.ac.uk/interpro
Protein Sequence
PredictiveModels
Analysisalgorithm
“Raw”Matches
Filteringalgorithm
ReportedMatches
InterProScan
![Page 21: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/21.jpg)
http://www.ebi.ac.uk/interpro
Interactive:http://www.ebi.ac.uk/Tools/pfa/iprscan/
Webservice (SOAP and REST):http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_resthttp://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_soap
Downloadable:ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/
InterProScan access
![Page 22: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/22.jpg)
http://www.ebi.ac.uk/interpro
Why redesign InterProScan?
• InterProScan 4– complicated installation– complicated update– limited queuing system
• Only guaranteed with LSF
– limited configurability– reliability
![Page 23: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/23.jpg)
http://www.ebi.ac.uk/interpro
InterProScan 5.0 aims• Easy install and configuration
• Modular
• Expandable
• Easily integrated into existing pipelines
• Incorporate new data model / XML exchange format
• Easy to port on to different architectures:• Desktop machine • Simple LAN• LSF• PBS• Sun Grid Engine ...cloud? GRID?
• Reliablity
![Page 24: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/24.jpg)
http://www.ebi.ac.uk/interpro
InterProScan 5 Technology
![Page 25: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/25.jpg)
http://www.ebi.ac.uk/interpro
OraclePostgreSQL
HSQLDB
File system
Data Model
Database Access File I/O
Business Logic:performing analyses
Job Management:scheduling analyses
JMS:monitoring queues
XML
Cluster platform
One-way dependencies +replaceable layers = low-coupling + maintainable
Web services
Architecture
Java API
InterPro website
![Page 26: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/26.jpg)
http://www.ebi.ac.uk/interpro
“Worker”Peforms task / sub-task and
reports back to Broker
“Worker”Peforms task / sub-task and
reports back to Broker
“Worker”Peforms task / sub-task and
reports back to Broker
“Worker”Peforms task / sub-task and
reports back to Broker
“Worker”Peforms task / sub-task and
reports back to Broker
Monitoring & Management Application
Web or stand-alone app to monitor & manage InterProScan
Broker startsworkers on demand
Workers take tasksoff queues
• Simple and robust programming model
• Mature and stable standard – current JMS version released in 2002
• Guaranteed message delivery to a single worker
• Easy to monitor
• Flexible – easy to implement on multiple platforms
Java Messaging Service
“Master”Schedules tasks &
sub-tasks, and places on queue
BrokerManages queues &
topics
“Worker”Peforms task / sub-task and
reports back to Broker
“Worker”Peforms task / sub-task and
reports back to Broker
“Worker”Peforms task / sub-task and
reports back to Broker
“Worker”Peforms task / sub-task and
reports back to Broker
“Worker”Performs task /
sub-task, reports back to Broker
![Page 27: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/27.jpg)
http://www.ebi.ac.uk/interpro
Beta release functionality
![Page 28: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/28.jpg)
http://www.ebi.ac.uk/interpro
Installation
• Requirements– Java 1.6– Linux– Perl
• Installation process
– ready to use
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/i5-dist.tar.gz
tar –xzf i5-dist.tar.gz
![Page 29: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/29.jpg)
http://www.ebi.ac.uk/interpro
./interproscan.sh -i test_proteins.fasta -o test_proteins.tsv --goterms
A2YIW7 f927b0d241297dcc9a1c5990b58bf3c4 122 Pfam PF00085 Thioredoxin 9 112 1.3E-28 T 08-07-2011 IPR013766 Thioredoxin domain Biological Process:cell redox homeostasis (GO:0045454)A2YIW7 f927b0d241297dcc9a1c5990b58bf3c4 122 ProSitePatterns PS00194 Thioredoxin family active site. 32 50 - T 08-07-2011 IPR017937 Thioredoxin, conserved site Biological Process:cell redox homeostasis (GO:0045454)A2YIW7 f927b0d241297dcc9a1c5990b58bf3c4 122 PIRSF PIRSF000077 null 4 113 1.50000307E-27 T 08-07-2011 IPR005746 Thioredoxin Molecular Function:protein disulfide oxidoreductase activity (GO:0015035), Biological Process:glycerol ether metabolic process (GO:0006662), Biological Process:cell redox homeostasis (GO:0045454), Molecular Function:electron carrier activity (GO:0009055)A2YIW7 f927b0d241297dcc9a1c5990b58bf3c4 122 PRINTS PR00421 Thioredoxin family signature 39 48 - T 08-07-2011 IPR005746 Thioredoxin Molecular Function:protein disulfide oxidoreductase activity (GO:0015035), Biological Process:glycerol ether metabolic process (GO:0006662), Biological Process:cell redox homeostasis (GO:0045454), Molecular Function:electron carrier activity (GO:0009055)A2YIW7 f927b0d241297dcc9a1c5990b58bf3c4 122 PRINTS PR00421 Thioredoxin family signature 78 89 - T 08-07-2011 IPR005746 Thioredoxin Molecular Function:protein disulfide oxidoreductase activity (GO:0015035), Biological Process:glycerol ether metabolic process (GO:0006662), Biological Process:cell redox homeostasis (GO:0045454), Molecular Function:electron carrier activity (GO:0009055)A2YIW7 f927b0d241297dcc9a1c5990b58bf3c4 122 PRINTS PR00421 Thioredoxin family signature 31 39 - T 08-07-2011 IPR005746 Thioredoxin Molecular Function:protein disulfide oxidoreductase activity (GO:0015035), Biological Process:glycerol ether metabolic process (GO:0006662), Biological Process:cell redox homeostasis (GO:0045454), Molecular Function:electron carrier activity (GO:0009055)
Default tab-separated values output
![Page 30: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/30.jpg)
http://www.ebi.ac.uk/interpro
./interproscan.sh -i test_proteins.fasta -o test_proteins.xml --goterms -F xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><protein-matches xmlns="http://www.ebi.ac.uk/schema/interpro"> <protein> <sequence md5="f927b0d241297dcc9a1c5990b58bf3c4">MAAEEGVVIACHNKDEFDAQMTKAKEAGKVVIIDFTASWCGPCRFIAPVFAEYAKKFPGAVFLKVDVDELKEVAEKYNVEAMPTFLFIKDGAEADKVVGARKDDLQNTIVKHVGATAASASA</sequence> <xref id="A2YIW7"/> <matches> <fingerprints-match graphscan="III" evalue="2.500000864E-7"> <signature name="THIOREDOXIN" desc="Thioredoxin family signature" ac="PR00421"> <models> <model name="THIOREDOXIN" desc="Thioredoxin family signature" ac="PR00421"/> </models> <signature-library-release version="41.1" library="PRINTS"/> </signature> <locations> <fingerprints-location score="0.0" pvalue="0.0" motifNumber="3" end="48" start="39"/> <fingerprints-location score="0.0" pvalue="0.0" motifNumber="2" end="89" start="78"/> <fingerprints-location score="0.0" pvalue="0.0" motifNumber="1" end="39" start="31"/> </locations> </fingerprints-match> <hmmer2-match score="100.5" evalue="-INF"> <signature name="Thioredoxin" ac="PIRSF000077"> <models> <model name="Thioredoxin" ac="PIRSF000077"/> </models> <signature-library-release version="2.74" library="PIRSF"/> </signature> <locations> <hmmer2-location hmm-length="0" hmm-end="108" hmm-start="1" evalue="1.50000307E-27" score="0.0" end="113" start="4"/> </locations> </hmmer2-match>...etc
XML output
![Page 31: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/31.jpg)
http://www.ebi.ac.uk/interpro
• BerkeleyDB-backed REST web service• Includes matches for all of UniParc (27 million
sequences)• 250 million matches• Fast response
• Integrated into i5.
0 10 20 30 40 50 60 70 800
50
100
150
200
250
300
350
400
Response Time (ms) per sequence
Pre-calculated match lookup
![Page 32: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/32.jpg)
http://www.ebi.ac.uk/interpro
Other functionality
• Increased reliability• Precalculated match lookup• Configuration
– simple properties file• Nucleotide sequence
– getOrf– map matches to nucleotide coordinates
• Pathway mapping– KEGG, Reactome, MetaCyc, Unipathway
![Page 33: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/33.jpg)
http://www.ebi.ac.uk/interpro
Future functionality• Webservice• Interact directly with architecture:
– LAN– LSF– PBS– Sun Grid Engine
• Database persistence– Oracle– MySQL– Postgres – etc
• Graphical output• Other functionality
– ask!
![Page 34: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/34.jpg)
http://www.ebi.ac.uk/interpro
InterProScan 5 timeline
• Beta release – August 2011– InterProScan 4 still maintained
• Full release– Early 2012
– InterProScan 4 deprecated
![Page 35: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/35.jpg)
http://www.ebi.ac.uk/interpro
Acknowledgements
Craig McAnulla
Anthony Quinn
PhilJones
Matthew Fraser
Maxim Scheremetjew
Alex Mitchell
Siew-Yit Yong
AmaiaSangrador
Sebastien Pesseat
Sarah Hunter
Team leader Developers Bioinformaticians Curators
Any Questions → Stand 302
![Page 36: InterPro and InterProScan 5.0](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c69fd24a795911758b4599/html5/thumbnails/36.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
Come and see us at booths 9 and 10!
• Job opportunities• PhD and postdoc positions• Training in person and online• Services• Industry programme