unknown genes, community profiling, & biotorrents.net

28
Morgan Langille UC Davis

Upload: morgan-langille

Post on 17-Jul-2015

836 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Unknown Genes, Community Profiling, & Biotorrents.net

Morgan Langille

UC Davis

Page 2: Unknown Genes, Community Profiling, & Biotorrents.net
Page 3: Unknown Genes, Community Profiling, & Biotorrents.net

Questions

If we wanted to start studying a gene of unknown function, which one(s) should we study first?

How many un-annotated genes could be annotated?

What proportion of unknown genes (hypothetical proteins) are probably not real proteins (i.e. pseudo-genes, mis-predicted orfs, etc.) ?

What proportion of unknown gene families are probably phage-related?

Can some of these families (hopefully the top ranking ones) be characterized using non-similarity based bioinformatic approaches?

Page 4: Unknown Genes, Community Profiling, & Biotorrents.net

Outline of project

Genomic Data Pfam SearchFilter for unknown

genes

Build HMMs for unknown genes

Rank Families

•Universality

•Evenness

•Pathogen / Non-pathogen

•Etc.

Create unknown families for

metagenomics data

Identify unknown families that now

merge with known families

Quantify families that are likely

phage

Use several non-similarity based methods to predict family function

•Community Profiling**

•3D structure similarity

•Neighboring genes

Page 5: Unknown Genes, Community Profiling, & Biotorrents.net
Page 6: Unknown Genes, Community Profiling, & Biotorrents.net

Phylogenetic profiling

C. hydrogenoformans

identified presence or

absence of homologs in

all other completely

sequence genomes

Identified many

hypothetical proteins that

had the same profile as

other sporulation

proteins

Wu, et al., PLOS Genetics, 2005

Page 7: Unknown Genes, Community Profiling, & Biotorrents.net

Community ProfilingKEGG COG

Delong, et al., Science, 2006

Page 8: Unknown Genes, Community Profiling, & Biotorrents.net

Community Profiling

Look across multiple metagenomic

samples

Gene families that have similar profiles

may have similar function

Similar to using co-expression to identify

similar functioning genes

Page 9: Unknown Genes, Community Profiling, & Biotorrents.net

So what have I done?

"all metagenomics peptides" from

CAMERA

43M sequences (mostly GOS)

Searched against 11,000 Pfams using

HMMER 3

Used “cluster” to group genes and samples

Page 10: Unknown Genes, Community Profiling, & Biotorrents.net

Results

Red = above avg.

number of pfams

Green = below avg.

number of pfams

Have not normalized

Number of sequences

per sample

For number of pfams

Metagenomic Samples

Pfams

Page 11: Unknown Genes, Community Profiling, & Biotorrents.net

Example of phage Pfams

clustering together

Page 12: Unknown Genes, Community Profiling, & Biotorrents.net

Measuring functional

relatedness Need to measure community profiling performance

The hierarchal clusters were broken into 575 groups using a correlation cutoff of 0.90 or above.

PFams were mapped to GO terms using pfam2GO 1893 PFams had no associated GO term

○ 695 of these were Domains of Unknown Function:DUFs

3377 PFams had one or more associated GO terms and could be used for further analysis

Only 67 (of 575) clusters contained 4 or more PFamswith at least one GO term

Page 13: Unknown Genes, Community Profiling, & Biotorrents.net

Measuring GO similarity

G-SESAME

Measures the semantic similarity of any two GO

terms

Not downloadable so queries had to be

made to their web server (not fun)

Pair-wise similarity was measure for each

pair of GO terms in each cluster

had to check if terms were in same namespace

Page 14: Unknown Genes, Community Profiling, & Biotorrents.net

Results

Average G-Sesame scores for each cluster

The average of all cluster averages was 0.484 10 clusters had a score of 0.60 or greater.

The data was then randomized by using the same GO terms but in different random clusters and a score of 0.412-0.420 over 4 iterations Each of the 4 iterations had only 1 or 0 clusters with

a score of 0.60 or greater

Page 15: Unknown Genes, Community Profiling, & Biotorrents.net

Community Profiling Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96

G-S

es

am

e S

co

re

Cluster Correlation Coefficient

• Average of all clusters= 0.49

• 10 clusters are > 0.60

Page 16: Unknown Genes, Community Profiling, & Biotorrents.net

Random Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96

G-S

esam

e S

co

re

Cluster Correlation Coefficient

• Average of all clusters (4 iterations) = 0.41 - 0.42

• 1 or 0 clusters are > 0.60

Page 17: Unknown Genes, Community Profiling, & Biotorrents.net
Page 18: Unknown Genes, Community Profiling, & Biotorrents.net

Bittorrent

A peer-to-peer file sharing protocol

~ 27-55% of all Internet traffic

Mostly illegal file sharing

Files are shared in small

pieces between several

users

Page 19: Unknown Genes, Community Profiling, & Biotorrents.net

Torrents for Biology

Why use torrent technology?

1. Download large datasets much faster

2. Searchable central listing

3. Decentralization of data

Page 20: Unknown Genes, Community Profiling, & Biotorrents.net

What is BioTorrents?

A legal file sharing website for scientists

Users can upload their own research results, data, software

Users can browse or search through all datasets

Data is not hosted on BioTorrents

Page 21: Unknown Genes, Community Profiling, & Biotorrents.net

www.biotorrents.net

Page 22: Unknown Genes, Community Profiling, & Biotorrents.net

Browse & Search

Page 23: Unknown Genes, Community Profiling, & Biotorrents.net

Details

Page 24: Unknown Genes, Community Profiling, & Biotorrents.net

Sign Up

Page 25: Unknown Genes, Community Profiling, & Biotorrents.net

Upload

Page 26: Unknown Genes, Community Profiling, & Biotorrents.net

Other Features

Forum

RSS Feed

Top 10

FAQ

Links

Page 27: Unknown Genes, Community Profiling, & Biotorrents.net

Who will upload data?

Everyone!

Realistically,

Large organizations (e.g. NCBI, CAMERA, etc.)

○ May need some convincing to host their data via

torrents in addition to FTP, HTTP, etc.

Scientists that really support open science

○ Sharing data before formally complete and

published

Page 28: Unknown Genes, Community Profiling, & Biotorrents.net

Technical Challenges

Many institutions frown on BitTorrent technology

A port must be opened/forwarded

Client program and computer must be left running

Ensuring data is legal, virus free, etc. Users that upload many legitimate torrents will provide

more confidence to people downloading

Making downloading and uploading easy