friedberg bosc2010 iprstats

Post on 30-May-2015

684 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

IPRStats: a Visualization Tool for InterProScan

Iddo FriedbergMicrobiology and

Computer Science & Software Engineering

Miami Universityhttp://github.com/devrkel/IPRStats.git

Microbes are Everywhere

● 1030 prokaryotic cells on Earth (give or take a couple)

● Dominate the biosphere● 90% of the cells in your body

are prokaryotic (1014)● Found in the most hostile

environments

Microbes do Everything● Nutrient reservoir:

● 4x1010 tons carbon (rivaling plants)

● 1x1010 tons Nitrogen● 1x109 tons phosphorous

almost

Of course there is health...

● Communicable diseases

● Heart disease● Gastric cancer● Irritable Bowel

Syndrome

...and Wellness

Microbial Genomics

Phage phi-X174 1978: 5.5Kbp

H. influenzae 1995: 1.7Mbp

Classic microbial genomics

Classic microbial genomics

Classic microbial genomics

Microbes live in Communities& only 1% can be cultured

What is Metagenomics?• Culture independent approach to study

microbial communities– < 1% of microbes can be cultured– DNA directly isolated from environmental sample

and sequenced

• Examining genomic content of organisms in community/environment to better understand:– Diversity of organisms– Their roles and interactions in the ecosystem

Metagenomics is the Application of Genomics to Communities

Some things we can learn using Metagenomics

● Taxonomic content: Taxon diversity in a habitat (using taxonomic markers)

• Functional content: biological functions, qualitative and quantitative profiles

• Coping with the environment: differences in functional content between habitats

• Decompose the biotic / abiotic elements in a habitat: metadata analysis

A Metagenomic project

● Sequencing● Assembly● Diversity analysis● Annotation

● Gene finding ● Function prediction

● Diversity analysis● Comparative

analysis

A Metagenomic project

● Sequencing● Assembly● Diversity analysis● Annotation

● Gene finding ● Function prediction

● Diversity analysis● Comparative

analysis

A Metagenomic project

● Sequencing● Assembly● Annotation

● Gene finding ● Function prediction

● Diversity analysis● Comparative

analysis

Population analysis tools

InterProScan● Signature search against an

integrated resource of domains and functional sites

● Easy to install, cluster-enabled (pleasantly parallel)

● Maintained by EBI

● Can annotate whole genomes

● PIR, Pfam, TIGRFam, Panther, Prodom, PRINTS,...

● Needs a visualization tool for population / metagenomic annotation

IPRStats

File Help

PFAM

PIR

GENE3D

HAMAP

PANTHER

PRINTS

PRODOM

PROFILE

PROSITE

SMART

SUPERFAMILY

TIGRFAMs

Charting

Full Databases

Python SAX Parser

Aggreg ateQ

ueries

Resulting Tables

Open XML file

GUI: wxPythonExcel export: xlwt

IPRStats(wx.Frame)

Results(sqlite or pytables)

Menu(wx.MenuBar)

PropertiesDlg(wx.Dialog)

Table(wx.PyGridTableBase)

standalone

HTML

XLS(using xlwt)

IPS

exporters

XML

IPS

importers

StatsData

Settings

IPRStats Architecture

Chart(wx.StaticBitmap)

?What is PyTables?

- package for creating data structures that can handle large amounts of data- uses NumPy (for in memory) and HDF5 (for disk storage) structures- uses Numexpr (jit compiler) for evaluating expressions (like queries)- in the context of IPRScan, it provides a way of accessing a huge table of data without requiring that all the data be in memory

Pros- HDF5 provides very fast, compact and efficient indexing- NumPy provides efficient in-memory storage- Minimizes disk and memory usage- Very fast read times compared to SQLite and MySQL

Cons- Large memory overhead (particularly in comparison to smaller datasets)- Many large, complex dependencies including HDF5, NumPy, Numexpr and Cython- Slow write times (particularly important since IPRStats bottlenecks with writing)

Multiple graph formats

Pie charts

Bar graphs

Conclusions & Future

● A lightweight, machine-independent visualization tool for InterProScan annotations

● License: AFL● Todo:

● Comparative population analysis● Large dataset handling● More graphic options● Anything else you like...

– http://github.com/devrkel/IPRStats.git

Thanks

● David Ream● Han Wang● Ian Fleming● David Vincent● Ryan Kelly● EBI● Miami University startup funding● Miami University Undergraduate Summer Scholars

Program

The Friedberg Lab is Recruiting

● Graduate students● Postdocs● Catch me later, email me, or look at

iddo-friedberg.net to learn more

top related