pan-genome graphs biodata14
DESCRIPTION
Pan-genome graphs for bacteria and the web.TRANSCRIPT
11/6/2014 graphSVG.svg
file:///Users/anwarren/Documents/biodata14/graphSVG.svg 1/1
Background
• “Pan Genome” - way to think about, compute on, visualize the differences and similarities of many genomes at once
• Reference free structure
• Many, many genomes
de Bruijn Graph Construction
• Dk = (V,E)• V = All length-k subfragments• E = Directed edges between consecutive subfragments
• Nodes overlap by k-1 words
• Locally constructed graph reveals the global sequence structure• Overlaps between sequences implicitly computed
Slide: http://cbcb.umd.edu/confcour/CMSC828H-materials/Lecture12-MSchatz-DeBruijnAssembly.pptx
It was the best was the best ofIt was the best of
Original Fragment Directed Edge
de Bruijn, 1946Idury and Waterman, 1995Pevzner, Tang, Waterman, 2001
Strategy: find all k-mers, build graph
• Every k-mer becomes a node
• Two nodes are linked with an edge if they
share a k-1 mer
GACTGGGACTCC
GACTGG ACTGGG
GGACTC GGGACT
TGGGACCTGGGA
GACTCC
Strategy: k-mers from feature families, build graph
• Every k-mer becomes a node
– If it is present in m genomes
• Two nodes are linked with an edge if they share a k-1 mer
• d# = a feature family
d1d2d3d4d5d6d7d8
d9
d1d2d3d4d5d
6
d2d3d4d5d6d
7
d4d5d6d7d8d9d3d4d5d6d7d8
d1d2d3d4d5d6d7d8
d9
rf-graph de Bruijn “like”
Create pg-graph
Similarities and Differences10 groups of 10
Organism Sum Pairwise Distances (Phylogenetic)
E. coli 0.07
Coxiella 10.42
Mycobacterium 2.70
Brucella 0.08
Rickettsia 8.62
Burkholderia 7.21
Clostridium 9.05
Bacillus 4.48
Staph. 2.08
Strep. 4.79
Similarities and Differences
Node Increase = (Nodes – Max(Families)) / Nodes
Diversity Score= Sum of maximum pairwise distances in Order level tree
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.850 0.900 0.950 1.000
No
de
Incr
eas
e
MUMi
Node Increase vs. MUMi
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.00 2.00 4.00 6.00 8.00 10.00 12.00
No
de
Incr
eas
e
Diversity Score
Node Increase vs. Diversity Score
MUMi= Maximum of all pairwise MUMi in a group
Layout
Gephi ToolkitYifan Hu’s MultilevelForce Atlas 2
Colors and Lines
Dealing with many Genomes
N=2K=5M=2B. Abortus
N=40, K=5, M=2, B. Suis
N=20K=5M=2Brucella
N=400, K=5, M=2, All Brucella N=1000, K=10, M=100, E. coli
Information Compounded
For the Web
• GEXF
– NetworkX, Gephi,
– Cytoscape, Gexf-JS, D3-Gexf
• BGZF GFF
– Backing store
– Byte range loading
Other Uses
• “Rearrangement” detection
Other Uses
• “Scaffolding”
– e.g. 86 contigs
• Closing
– Predicted primers
Other Uses• Rearrangements
– Insertions/Deletions
– Islands
– Inversions
Other Uses
• Synthetic BAM
Takeaways
• A new way to leverage protein family databases
• “Reference free” structure for many bacterial genomes using feature families
• Quickly investigate whole genome relationships and speed up potentially expensive calculations
Acknowledgements
• Eric Nordberg
• Lenny Heath
• CID at VBI (PATRIC)
• RAST – Argonne (PATRIC)
https://github.com/aswarren
https://twitter.com/aswarren