synopsis it has been estimated that at least 40% of the total human genome sequence contains the...

53

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,
Page 2: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Synopsis• It has been estimated that at least 40% of the

total human genome sequence contains the integrated fragments of genomic parasites

• Retroviruses, Retrotransposons, DNA transposons, and parvoviruses can efficiently insert new sequence into the human genome

• These integrating elements can be powerful tools for discovering . . .

Page 3: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

What genomic features affect integration?

• Each element shows a different pattern of favorable integration sites

• Favored specific nucleotide sequences can be detected in the target DNA at the point of integration for most of these elements

• Post-integration genomic DNA is harvested, and the DNA flanking the integrated element is cloned and sequenced

Page 4: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Intention

“Present a comprehensive statistical comparison of the factors influencing integration frequency by annotating each base pair in the human genome

for its likelihood of hosting integration events”

Page 5: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Framework

7 types of integrating elements17 different integration complexes (datasets)

200+ variables (genomic features)10,000+ integration sites

Page 6: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Previous research provided extensive insertion site data

• HIV favors integration in active transcription units (TUs)

• MLV favors integration near gene 5` ends

• ASLV integration is mostly random, but TUs seem to be favored slightly

TUs are defined as regions of transcribed DNA

Page 7: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Previous research had provided extensive insertion site data

• SFV integration is mostly random, but is favored slightly near CpG islands

• SB favors integration in transcription units.• AAV-based vectors show a modest preference for

regions neat transcription start sites• Experiments concerning whether LINEs prefer to

integrate within TUs have been inconclusive

Page 8: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Some Variables (Genomic Features)

• Genes and Exons: Indicator variables for whether the site falls into a gene or an exon

• Gene or Expression Density: The number of genes or expressed genes per base pair in the region surrounding the integration site

• Dnase I Site Density: The number or density of DNAse I sites in regions surrounding the integration

Page 9: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Some Variables (Genomic Features)

• GC Content: The GC percent in the 5kb region containing the site

• CpG Islands: The site is in a CpG island

• CpG Island Density: The number or density of CpG islands in the region surrounding the site

• Transcription Start/Stop Features: The relation of the site to transcription start/stop position

Page 10: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Some Variables (Genomic Features)

• Positional Weight in Flanking Sequence: The loglikelihood for integration versus control site at each position in twenty bases of flanking sequence (10 upstream and 10 downstream) and their sum

• Loglikelihood is defined as the log ratio of the frequency of each of the four bases at each position to the frequency in the controls

Page 11: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Integration Complexes (Datasets)

Page 12: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Control Site Generation

Each dataset has one of two types of control:

• Matched (preferred): the integration sites were created using a restriction enzyme. The control site matches the distance from the nearest restriction site in the direction of transcription

• Random: The control site is merely a random sequence from the genome

Page 13: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

The ROC Curve• Used to analyze the effects of genomic

features on integration• Provide a measurement of a predictor

variable’s ability to discriminate between two classes of events

• This measure can be interpreted as the probability that a randomly drawn integration site will have a value for its genomic feature that exceeds that of a control

Page 14: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

The ROC Curve

The area under the ROC curve is taken as a measure of the association between genomic feature and the likelihood of an integration event

Page 15: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

The ROC Curve

The area under the curve is 1.0 when all integration events have higher values for the feature than any control event, and 0.0 for the opposite case.

Page 16: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

The ROC Curve

Values very near 1.0 occur when higher values of the feature predict integration, and values very near 0.0 occur when lower values of the feature predict integration

Page 17: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

The ROC Curve

When the area is 0.50, it is equally likely that either has a higher value

Values near 0.50 are consistent with having no predictive value

Page 18: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

ROC Curve Construction

1) Values for the integration sites are tallied to create the histogram and the upper tail areas of the histogram, which shows the fraction of integration sites (vertical axis) that have values for the feature that exceed a given value (horizontal axis)

Page 19: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

ROC Curve Construction

2) Repeat this same procedure using data from the control sites

3) Rotate this histogram and upper tail areas graph 90˚ clockwise

4) The ROC curve is constructed from the collection of true and false positive rates

Page 20: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

ROC Curve Construction

5) For every possible cutpoint, plot the True Positive Rate on the y-axis and the False Positive Rate on the x-axis

A cutpoint is defined as any value of a predictor

Page 21: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

A Compact Representationof these Associations

• The absolute difference between the area and 0.50 is plotted

• Values around 0.0 indicate no useful predictive information in the feature

• Values near 0.50 indicate that the feature is nearly perfect in separating integration sites from the controls

Page 22: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Color-coded ‘‘Heat Maps’’

• Color-coded heat maps are matrices displaying associations for each type of genomic feature using rows of the matrix for features and columns for data sets

Page 23: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Color-coded ‘‘Heat Maps’’

• Bright green represents ROC curve areas near 0.0• Black represents ROC curve areas of 0.50• Bright red represents ROC curve areas near 1.0

Page 24: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Effects of Nucleotide Sequence of the 20 Base Pairs Surrounding

the Point of Integration

1) To determine how important different features are in directing integration towards a region, each base in the interval is treated as the edge of an integration site

Page 25: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Effects of Nucleotide Sequence of the 20 Base Pairs Surrounding

the Point of Integration

2) Each region is then scored for the expected number of integration events over the interval, and these interval scores are summed

Page 26: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Effects of Nucleotide Sequence of the 20 Base Pairs Surrounding

the Point of Integration

3) The summed values are then tested for their ability to sort experimental integration sites from controls

Page 27: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Effects of Nucleotide Sequence of the 20 Base Pairs Surrounding

the Point of Integration

Results are presented as areas under the ROC curve for this variable

Integrating Elements

Interval Size

Page 28: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Integration in Transcription Units and the Effect of Gene Activity

• Analysis of DNA integration within TU's and exons

Page 29: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

• HIV: (Red) positively correlated with TU's

• Others varied from slight, negative (green) to undistinguishable data (black)

Page 30: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

• This figure summarizes the effects of gene density in differently sized genomic intervals 100kb-4 Mb– Utilized Affimetrix arrays to do

transcriptional profiling– Each expression scores for all genes

in a interval divided by interval width

• All datasets resulted in weakly positive for insertion in at least one integral. And…– "There was no clear pattern of

interval size, type of gene call. or expression level.“

• Suggests that Gene density features were most significant

• -Strong effects seen in HIV and MLV datasets

• Weakest response from non-dividing cells or macrophage

Page 31: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

How does G/C Content and Proximity to

CpG Islands Effect Integration? On average, G/C Content implies …

1. Gene rich

2. Short introns

3. High frequencies of ALu repeats

4. Low frequencies of LINEs

5. High Frequency of CpGs

Page 32: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

• 2 MLVs where integration was positive

• 3 HIVs that were negatively correlated, A/T preference

• Other datasets showed weaker and less consistent responses

Page 33: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Whoa!? I Thought HIV Integrated in In Gene Enriched Regions?

Page 34: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Fig. 3 A

Fig. 4 A

A/T preference of HIV integrase-binding protein

Page 35: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

• GpC Island density – Increasing length 1K-32 M

• Correlates to gene density

• Within short regions, proximity to CpG islands correlate to proximity to regulatory regions

• Long intervals span many genes

Page 36: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

DNase I Cleavage Sites

• DNase I cleaves the sites in chromatin where the binding of transcription factors occurs along with the presence of CpG islands, and gene control regions.

Page 37: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Integration Near Transcription Factor Binding Motifs

• Summarizes how integration is affected by its proximity to transcription factor binding sites

• TRANSFAC PWM- scores how well the integration site or control matches a PWM and this score generates an ROC describing the effects of that PWM

• Lack of strength when analyzed with other factors

Page 38: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Proximity to Transcription Start and Stop Features

• To compare the integration frequency between start and stop codons for experimental and matched random controls expressed as ROC areas. Fig 4C

Page 39: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

• Boundary.dx: Distance from 5' or 3' end

• Start.dx: distance to the nearest gene start sites

• closer to the start (green)

• Signed.dx: High probability at the start sites (red)

• General.width- length of introns

Page 40: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Improved Models Incorporating Score.20 Together with Other Genomic Features

• Score.20 was the most effective method for differentiating between site selection of the different vehicles

• Addition of other variables to accentuate our results.– Non-redundant – Lack of correlation

Page 41: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Increase in ROC Area by the Addition of a Genomic Feature

• Histogram: Found little correlation of score.20 with other features

• Predictors of Integration targeting can be constructed based on score.20 and another feature

• The fitting process leads to values that rank higher than random match controls

Page 42: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Fig. 5 D

Page 43: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

A Single Model!

• Regression models would be too complex• Want to analyze various features• Bayes Model Averaging (BMA)

– Reinforces that score. 20 and other features are independent

• Models with high posterior probability were collected and used to evaluate the importance of various features

• Random sites are scored for the logarithmic odds of integration with BMA models

Page 44: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Hierarchical clustering• Major grouping of

retrovirus HIV• Amongst our 17 datasets,

with each branch different element types were resolved

• Verifies that integration site selection is dominated by element encoded recombination enzymes

Page 45: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

What we’ve learned about each integrating element:

•HIV- Found to be weakly attracted to integration sites near DNase 1 cleavage domains over long intervals. Probably because of the correlation of HIV insertion sites and DNase 1 cut sites with gene dense regions. Also revealed a strong integration attraction to A/T rich sequences, contradictory to previous presumptions correlating insertion with C/G dense areas.

•MLV- Integration associations with CpG islands and DNase 1 hypersensitive sites found to be amplified when a larger scale of interest is used. The influence of the local nucleotide sequence also increased with a larger interval. Strong correlation for integration near areas of gene expression.

•ASLV- Integration near DNase 1 sites over long genomic intervals favored.

What genomic features influence integration of new DNA?

• HIV favors integration in active transcription units (TUs)

• MLV favors integration near gene 5` ends

• ASLV integration is mostly random, but TUs seem to be favored slightly

Page 46: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

What we’ve learned about each integrating element:

•SFV- Cell specific integration influences. Integration near CpG islands and proximity to DNase 1 cut sites more evident in stem cells then fibroblasts.

•SB- Contradictory results in regards to proximity to CpG islands and gene density. Possibly because of cell type specific integration influences.

•AAV- Of all vectors, integration found least favorable into TU’s. Contradictory to previous mouse liver studies.

•L1- Supports previous studies suggesting strong integration site nucleotide relationships.

What genomic features influence integration of new DNA?

• SFV integration is mostly random, but is favored slightly near CpG islands

• SB favors integration in transcription units.

• AAV-based vectors show a modest preference for regions neat transcription start sites

• Experiments concerning whether LINEs prefer to integrate within TUs have been inconclusive. Specific sequence known to have effect on integration.

Page 47: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

For example; You use a vector that you think integrates near the sequence: GATTACA,

When you focus on a 20 bp segment, it can be very easy to predict where the vector will integrate.

Conversely, if that same vector is integrated into a 1kbp segment, or 20kb, or 3 billion base pair segment, the integration site is going to be harder to predict.

Especially if there are other, less understood influences acting in concert. As seen in our case.

Other factors were seen to increase their influence with increased area, as seen in MLV and ASLV.

When asking this question, the scale of interest is very important because it can influence the results.

What genomic features influence integration of new DNA?

Page 48: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Future StudiesWith this catalog of vector-feature interactions, we can better understand novel insertion influences as they’re identified. They can be studied and compared in cooperation with the current comprehensive predictive models incorporating all currently known genomic features. In doing so, we will gain better insertion prediction abilities with each new independent variable genomic feature discovered.

One such new feature could be the relative locations of nucleosomes, or other epigenetic factors, like methylation or acetylation of the DNA strand.

http://en.wikipedia.org/wiki/Nucleosome

Page 49: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Future StudiesThis paper mentioned many potential future studies surrounding each individual potential insertion vector, for example, SB cell specific integration and AAV likeliness of TU insertion.

Many other areas of research could collaborate upon the findings presented in this article. Stronger mathematical modeling systems could be of great value.

http://www.bioscience.heacademy.ac.uk/network/sigs/numeracy/

Page 50: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Future StudiesAlso using a different approach utilizing the advances in proteomics to isolate and identify some of the functional proteins used by these potential insertion vectors could expand our understanding of the mechanisms used.

A bioinformatics data base could then be used to see if there any DNA binding proteins, chromatin related proteins, DNase proteins, DNA ligase proteins, etc were found.

http://www.dartmouth.edu/~toxmetal/TXQAas.shtml

Page 51: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Future StudiesA second novel use of the vector-feature interaction library is as a reference in respect to the feature in question.

If you were working with CpG islands, you could look up what kind of insertion vectors have a probability of inserting near your CpG island of interest.

http://www.pb.ethz.ch/research/chromatin_technics/TDI.jpg/image

Page 52: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Big Future StudiesThe purpose of this research was to better understand the factors influencing various vector insertions.

This is useful for the hope of creating a reliable, predictable, vehicle for integrating DNA elements into humans. This innovation could turn gene therapy into a plausible reality.

We need to be able to insert desired segments with pin point accuracy as illustrated at the beginning of this paper. A previous study ‘successfully’ treated human X-SCID while also indirectly causing leukemia in three of the patients,

Unlike mice, it has to work the first try, every try.

Page 53: Synopsis It has been estimated that at least 40% of the total human genome sequence contains the integrated fragments of genomic parasites Retroviruses,

Gene TherapyTypically gene therapy is most successful when used to treat a single

gene, or monogenic genetic disorder

•Cystic Fibrosis

•Sickle Cell Anemia

•Marfan Syndrome

•Huntington’s Disease

•Hereditary Hemochromatosis

•Ornithine Transcarboxylase Deficiency (OTCD)

•X-linked Severe Combined Immunodeficiency Disease (X-SCID) "bubble baby syndrome."

For more information about gene therapy visit

http://www.ornl.gov/sci/techresources/Human_Genome/medicine/assist.shtml

http://www.annasslant.com/doctor-shot.jpg