uncovering the dynamics of genome function at the ... · ffv1 codec root tip identification...
TRANSCRIPT
Uncovering the dynamics of genome function at the organismal scale using machine vision and computation
Tessa Durham Brooks, Ph.D. Doane College
Department of Biology
Anticipation at the dawn of the Genomic Era
“Within the next few years, technologies developed for the Human Genome Project and similar sequencing efforts will revolutionize medicine, agriculture, crimefighting, and other fields.” – Gwynne and Page, Science, 2000
Genetronic Replicator – Star Trek: TNG, 1992
The genomic powerhouses
~20,000 genes 97 mil base pairs Sequence finished 1998
~25,000 genes 100 mil base pairs Sequenced finished 2000
~22,000 genes 137 mil base pairs Sequence finished 2000
For reference: Humans have about 25,000 genes, 3.2 bil base pairs.
The problem: at best functional roles have been assigned for 15% of predicted genes of a genome
For reference: Humans have about 25,000 genes, 3.2 bil base pairs.
~20,000 genes 97 mil base pairs Sequence finished 1998
~25,000 genes 100 mil base pairs Sequenced finished 2000
~22,000 genes 137 mil base pairs Sequence finished 2000
Why has determining gene function in multicellular organisms been difficult?
Why has determining gene function in multicellular organisms been difficult?
Our goal: Describe genome function at the organismal scale.
Requirements:
• Observations should be made at sufficiently high spatial and temporal resolution
• Methods should be relatively high-throughput to allow genomic survey
• Observations should be able to be made over time and in many environmental contexts
~25,000 genes
Root gravitropism: a model for image analysis approaches in functional genomics
Root gravitropism: a model for image analysis approaches in functional genomics
0 2 4 6 8 10
0
20
40
60
80
100
glr3.3-1
glr3.3-2
SalkColT
ip A
ngle
(deg
)
Time (h)
First Order (swing rate)
Scal
e Sc
ale
Second Order (acceleration)
Time (h)
glr3.3-1 vs wt
glr3.3-2 vs wt
Miller, Durham Brooks, and Spalding 2010, Genetics
Developing tools to detect genome function at the organismal scale.
Requirements:
Observations should be made at sufficiently high spatial and temporal resolution
• Methods should be relatively high-throughput to allow genomic survey
• Observations should be able to be made over time and in many environmental contexts
~25,000 genes
Automation and High Throughput ver. 1
Automation and High Throughput ver. 2 Doane Phytomorph
• Recombinant inbred lines (RILs)
• Ecotypes (e.g. 1001 genomes project)
High throughput genetic stocks
Matthieu Reymond Max-Planck Institute
S T S T
QTL Analysis
Developing tools to detect genome function at the organismal scale.
Requirements:
Observations should be made at sufficiently high spatial and temporal resolution
Method should be relatively high-throughput to allow genomic survey
• Observations should be able to be made over time and in many environmental contexts
~25,000 genes
Phenotypes are plastic
Durham Brooks, Miller, and Spalding 2010, Plant Physiology
One ecotype
Po
siti
on
(cM
)
Time (minutes)
LOD
Genomics analysis in a multi-dimensional condition space
Seed Size
Small Large
See
dlin
g A
ge
2d 164 lines X
15 indiv.
3d
4d
Moore, et al., unpublished result
Developing tools to detect genome function at the organismal scale.
Requirements:
Observations should be made at sufficiently high resolution
Method should be relatively high-throughput to allow genomic survey
Observations should be able to be made over time and in many environmental contexts
• Cyberinfrastructure must facilitate the above
~25,000 genes
Workflow
Data Capture
Data Capture
Data Storage (30 TB)
Data Compression and Analysis
Feature Extraction
Data Storage (X TB)
Schorr Center and OSG
0.5 TB/day
Data Compression
QTL analysis
Data Compression
Scan Root Response
•Time Series of 220 uncompressed TIF’s •~225 MB each
Image Compression
(Condor on OSG)
•Time series of lossless -compressed PNG’s •~195 MB each
Auto Crop & Time Stamp
(Condor on OSG)
•Equalize dimensions •Insert the time stamp into the least significant bits of the first 14 pixels of each image
Video Compression (Local Grid)
• Compress to video using FFV1 codec •Uses a lossless intraframe codec •~160 MB/frame
Ship out to UW
Feature Extraction
Compressed Data
•Decompress Data •Currently using FFV1 codec
Root tip Identification
•Isolate and track root tip •Currently image curvature features are used
Machine Learning
•Isolate plant’s Arial and ground tissue from image background •Currently Bayesian and SVM methods work well
Tip angle Measurement
• Linear regression on the root’s meristematic tissue Ship out to Doane
QTL Analysis - Association of a phenotypic
value (e.g. root tip angle) with a genetic element
Compile Tip Angle Data
•Find mean tip angle at each time point for each genetic line
QTL Detection (Local Grid)
•Choose detection method •Optimize maximum likelihoods of trait data •Determine significance (LOD)
Determine Significance Threshold
(Condor on OSG)
•Permutation testing •Haley Knott or Multiple imputation (0.24 vs 63 CPU years per time point) •Determine max likelihood and LOD score of randomized data
Additional analysis
•Interactions, additive effects •Plasticity QTL •Conditional QTL (GxE effects)
Ship out to Doane
Ship out to Doane •Launch analysis locally •Bleed out to OSG
Progress and Future Directions
• One small college has collected over 14,500 individual root gravitropic responses in six conditions (32 TB) in RIL population in 6 mo.
• We will finish collection from NILs (near-isogenic lines) - an additional 8,700 individuals, 19 TB in 1-2 mo.
• Begin image analysis and QTL analysis – dataset opens new doors in visualizing genomes
• Interdisciplinary undergraduate training opportunities
Acknowledgments Doane College
Mike Carpenter (CIO), David Andersen, Dan Schneider Chris Wentworth (Physics) and Alec Engebretson (IST) Students: Amy Craig and Brad Higgins (Physics), Autumn Longo and Grant Dewey (Biochemistry), Tracy Guy, Miles Mayer, Halie Smith, Anthony Bieck, Sarah Merithew, Devon Niewohner, Muijj Ghani, Julie Wurdeman (Biology)
University of Wisconsin Edgar Spalding’s Laboratory: Nathan Miller, Candace Moore, Logan Johnson Miron Livny (CHTC)
UNL – Schorr Center and HCC David Swanson, Brian Bockelman, Carl Lundstedt
University of Florida Mark Settles
Computing Resources
Funding
PGRP - 1031416 EPSCoR - URE NE-INBRE
Questions?