chapter 1 analysis of chip-seq data with partek genomics ... · pdf fileanalysis of chip-seq...
Post on 10-Mar-2018
223 Views
Preview:
TRANSCRIPT
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 1
Chapter 1 Analysis of ChIP-Seq Data with Partek
Genomics Suite™ 6.6
Overview
ChIP-Sequencing technology (ChIP-Seq) uses high-throughput DNA sequencing to map
protein-DNA interactions across the entire genome. Partek® Genomics Suite™ (PGS)
offers convenient visualization and analysis of the high volumes of data generated by ChIP-
Seq.
In this tutorial, you will go through the PGS ChIP-Seq workflow and will analyze aligned
data from a ChIP sample versus a control sample in .bam format.
This tutorial will illustrate how to
Import ChIP-Seq data
Perform QA/QC of the samples
Detect and visualize peaks and enriched regions in the genome
Discover binding site motifs
Annotate enriched regions with overlapping genes
Visualize mapped sequence reads on the genome
Note: the workflow described is specific for PGS version 6.6. To upgrade to this version,
go to the Main menu and access Help > Check for Updates. The screenshots shown below
may vary slightly across hardware platforms and across different versions of PGS.
Description of the Data Set
The data for this tutorial is from Johnson et al. (2007) that maps the genomic binding sites
of the NRSF (neuron-restrictive silencer factor) transcription factor across the entire
genome. It includes two samples: an NRSF-enriched ChIP sample (chip.bam) and a control
sample without immuno-enrichment (mock.bam). The chip.bam file contains almost 1.7
million mapped reads, and the mock.bam file contains approximately 2.3 million mapped
reads. These bam files contain the aligned genomic locations and sequences of the
mappable reads. This dataset contains reads from a single-end (SE) library; the differences
in processing paired-end (PE) reads will also be discussed when applicable.
Data and associated files for this tutorial can be downloaded from the Next Generation
Sequencing tab on Help > On-line Tutorials from the PGS main menu.
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 2
Import Instructions
The steps below will briefly describe how to import the mapped reads of ChIP-Seq data
into PGS.
Step 1 – Download the Data
Download, unzip the tutorial data, and save the bam files on your computer. Due
to the large file sizes associated with NGS data, it is recommended that bam files
be accessed locally (not across the network). The first time a bam file is read in by
PGS, the file will be sorted to allow faster access; therefore, you must have write
permission on the bam files and in the bam file folder.
Step 2 - Import Mapped Reads into PGS
Open the ChIP-Seq workflow within PGS by selecting it from the Workflows
drop-down in the upper right corner of the menu
Under Import from the ChIP-Seq workflow, select Import and manage samples
to invoke the Sequence Import wizard
Using the file browser on the left, navigate to the ChIP-Seq_Data folder
containing the bam files. For this tutorial, select chip.bam and mock.bam (Figure
1). Select OK
Figure 1: Selecting ChIP-Seq files. Date modified may be different than what is shown
In the Sequence Import dialog, specify the Output file, Species, and Genome
build. For this tutorial, set Species to Homo sapiens and Genome build to hg18.
The Output file will be the name of the parent spreadsheet. Select OK
The Bam Sample Manager (Figure 2) can be used to add new samples or files to the project
(Add samples), to remove samples (Remove selected samples), to associate (multiple) files
with particular samples (Manage samples), and to map the chromosome names from the
input files to the annotation files (Manage sequence names). Since none of these operations
are needed, select Close. If the bam file has not been sorted previously by PGS, you may
see the Sort bam files dialog; select OK to sort the files if this dialog box appears. While
the files are being sorted, you will see a message in the status bar at the bottom of the
window:
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 3
Figure 2: Bam Sample Manager used to add or remove additional bam files to the
experiment
The resulting spreadsheet is shown in Figure 3. Each sample will be on one row.
The number of aligned reads per sample is shown in column 2. The import
process is now finished
Figure 3: Viewing the spreadsheet after import. Each row contains a sample
Quality Control of Samples
In addition to any quality control that may have been performed when the data was
sequenced, it is a good idea to check the quality of the samples using PGS before analyzing
the data.
Examining the Distribution of Reads
BAM files contain both aligned and unaligned reads. The top-level spreadsheet in Figure 3
shows the number of reads that were aligned to the reference genome. A large number of
unaligned reads may be the result of poor quality sequence data or alignment problems
(wrong genome, alignment settings, etc.). You might also be interested in knowing how
many reads map to more than one location in the genome (if the aligner options supported
multiple-mapped reads).
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 4
In the QA/QC section of the ChIP-Seq workflow, select Alignments per read
A new spreadsheet called Alignment_Counts is generated (Figure 4). The titles of
columns 2 and 3 and indicate that this is single-ended data. Column 2 shows the
number of unaligned reads (0 alignments per read), and column 3 shows the
number of reads that align exactly once to the genome (1 alignment per read). If
the BAM files had contained reads that mapped to more than one location in the
genome, these would be shown after column 3
Figure 4: Alignment_Counts spreadsheet. The unaligned reads had been removed from
these BAM files and the alignment options did not permit more than one mapping location
per read
Strand Cross-Correlation
In short-read ChIP-Seq data, peaks are found upstream of the actual DNA-binding site
(upstream on both strands). In a good quality ChIP-Seq sample, the peaks on the forward
strand and the reverse strand are offset (phase-shifted) by the size of the “effective
fragment length.” The effective fragment length tends to be shorter than the length of the
fragmented DNA, the length of the size selection, and the pull-down length. Strand Cross-
Correlation calculates the correlation of the strand-specific read densities; the maximum
correlation should occur at the average size of the peak shift across all chromosomes.
For single-end reads, PGS will calculate the phase shift between the reads on the forward
strand and reads on the reverse strand using the method (Pearson cross-correlation)
described by Kharchenko et al. (2008). Note: the estimation of effective fragment length
for single-end reads can only be done on IP samples and not on mock controls since non-
enriched samples do not contain a phase shift. For paired-end reads, Strand Cross-
Correlation is calculated from the distribution of fragment lengths between the paired-ends
of the two reads.
Under QA/QC from the ChIP-Seq Workflow, select Strand Cross-Correlation. If
you have not run this step previously, you will be asked if you would like to
create a new QA/QC child spreadsheet. If prompted, select Yes
After running Strand Cross-Correlation from the QA/QC workflow, the Strand
Separation of Samples viewer will appear (Figure 5)
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 5
Figure 5: Viewing the Strand-Cross Correlation plot to estimate effective fragment length
In Figure 5, the x-axis represents the phase-shift, and the y-axis represents the Pearson
correlation of the strand densities of the forward and reverse strands. Notice in the IP
sample, the peak occurs at 111 bp, corresponding to an average effective fragment length
of 111 base pairs. The peak location can be determined by examining the values in the
strand_correlation spreadsheet, by mousing over the peak in the graph, or by sorting the
data in the spreadsheet.
The control sample (blue) does not have a similar peak because it does not have the phase-
shift property of IP samples. The control sample does have a small peak at 26 bp which
corresponds to the sequencing read length. This is probably due to the fact that some
regions in the genome of the control sample contain many reads stacked up on each other
which will create a correlation peak when the forward and reverse strands are shifted by the
length of the reads. At the sequencing read length, the IP-sample will show a strand cross-
correlation near 0.
The location and magnitude of the peaks in the cross-correlation plot can be used as a
measure of the quality of the enriched sample. Figure 5 shows a highly enriched sample
because the peak at 111 bp dominates the peak at the read length. If the dominant peak in
the IP-enriched sample occurred at the read length, the sample was poorly enriched or
contained very few binding sites. The plot in Figure 6 shows two IP-samples with medium-
level enrichment. Multiple dominant peaks in the IP sample may indicate there are several
populations of DNA fragment lengths which will complicate peak calling (Kundaje 2010).
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 6
Figure 6: Example of medium-level enriched samples
Detecting Peaks and Enriched Regions
Regions that contain a binding site for the DNA-binding protein of interest will have many
sequence reads mapped to it. Since single-end reads only cover one end of a sequence
fragment, enriched regions will generally show two adjacent peaks. PGS will directionally
extend each SE read in the 3’ direction by the fragment length (extended reads) to facilitate
merging adjacent peaks into a single peak. For PE reads, the fragment length is defined
from the start of the 5’ end of the first read through the 3’ end of its paired read. For peak
detection, PGS divides the genome into windows (bins) of a user-defined size and counts
the number of (midpoints of) the reads that fall within each bin. PGS fits a zero-truncated
negative binomial to the bin counts and finds all regions that are above a user-defined false
discovery rate (FDR). See the ChIP-Seq white paper for more information on the peak-
finding algorithm and tips for setting the Fragment extension and window sizes.
Under Peak Analysis of the ChIP-Seq workflow, select Detect peaks. The Detect
peaks dialog will appear (Figure 7)
Specify the Fragment Extensions by setting the Maximum average fragment size
to 110. Maximum average fragment size is based on your experimental design: the
size of the fragment pulled-down in the immunoprecipitation step, the size used
during DNA fragmentation, the fragment length used for size selection, or the
effective fragment length. If you have used an antibody that binds the DNA as the
control antibody (rather than no-enrichment as the control), you could use
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 7
different fragment lengths for both samples with the Individual maximum
fragment sizes radio button. For experiments a using mock control (no
enrichment), use Maximum average fragment size
As this example uses the mock sample as the reference, select mock in the drop-
down list under Reference sample
The peak detection algorithm will divide the genome into windows and find
windows that are enriched with reads based on the FDR value. Set the Window
Size to (base pairs) to 110
Peak Cut-off FDR determines the cut-off for the significance peaks in the chip
sample. Lower cut-off values imply greater differences between the chip and
mock peaks; higher cut-offs lessen the difference in peak heights between the chip
and mock samples. Set the Peak Cut-off FDR to 1 false positives in 1000 (0.001)
Leave the remaining parameters with the default values and select OK
Note: As transcription factor binding sites tend to have localized and sharp clusters of
reads, the window size used during the analysis of a transcription factor study can be
left relatively small (approximately the same as the average fragment length), and the
option to allow for gaps between enriched windows need not to be used. Subsequently,
in the Results reporting section, the Region in the window with most reads could also
be selected. Histone modification peaks, on the other hand, tend to be subtle, diffuse,
and spread-out. For that type of analysis, larger windows might be more suitable, and
neighboring windows may be combined (Within a gap distance of option) into larger
windows (under Window size and Results reporting, respectively). The exact settings
depend on the data and the experiment design, so fine tuning is recommended.
The More info link at the top of the dialog box displays a figure which demonstrates
the relationship between window and gap size. Try changing the How should windows
be merged or the Which regions should be reported? options; the blue bar underneath
each figure will reflect how regions are detected and reported with these settings.
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 8
Figure 7: Configuring the Peak Detection dialog
Figure 8: Viewing the detected peaks in the samples
The resulting spreadsheet (Figure 8) will appear. The spreadsheet is sorted by chromosome
number and genomic location. Each row represents one genomic region of peak enrichment
whereas the columns are:
1. Chromosome: Chromosome of region
2. Start: Start of region (inclusive)
3. Stop: End of region (exclusive)
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 9
4. Sample ID: The sample containing the enriched region
5. Interval Length: length of region, i.e., Stop – Start, in base pairs
6. Maximum Extended Reads in Window: The greatest number of (extended) reads in one of
the windows of a ChIP-Seq region
7. Reads per Million (RPM): Column 6 divided by the total number of aligned reads in the
sample (in millions). This column will help you compare peaks across samples, especially
when there is a large difference in the number of aligned reads between samples
8. Mann-Whitney p-value: Identifies separation between forward and reverse peaks for
single-end reads using the Mann-Whitney U-test. Lower p-values indicate better
separation. This p-value can be used when there was no control sample or to eliminate
reads due to PCR bias
9.-10. Total reads in region: Total number of (non-extended) reads for each sample (chip
and mock, respectively) in the given genomic region
11. p-value(Sample ID vs. mock): Compares each sample to the reference (mock in this
example) using a one-tailed binomial test. A low p-value means there are significantly
more reads in the sample specified in column 4 (that is, for each region) than in the mock
sample. This column is only included if a reference sample is specified in the Peak
Detection dialog (Figure 7)
12. scaled fold change (Sample ID vs. mock): Compares intensity of signal between each
sample (specified in column 4) to the reference sample (mock in this example). The fold-
change is scaled by a ratio of the number of reads for each sample (IP vs. control) on a per-
chromosome basis. Scaled fold changes > 1 indicate more enrichment in the IP-sample than
in the control sample. This column is only included if a reference sample is specified in the
Peak Detection dialog
13.-14. <Sample> overlap percent: Fraction of called region that overlaps a region from
the given sample where <Sample> is the name of the Sample ID in column 4. For example,
the values of 100% in column 13 and 0% in column 14 point to regions detected in the chip
sample, but not in the mock sample. Similarly, regions with the value of 100% in column
14 were detected in the mock sample (and thus might be excluded from downstream
analyses)
Create a list of enriched regions
You have just created a list of peaks found in both samples. In this section, you will create
a list that filters out peaks detected in the chip sample that also occur in the control (mock)
sample. This list will be used to search for motif binding sites.
Under Peak Analysis of the ChIP-Seq workflow, select Create a list of enriched
regions. The regions found in the IP sample that do not have many reads in the
control sample are of most interest. Use the List Creator functions to filter out
regions that have a high number of reads in the control (mock) sample by using
the p-value against the control
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 10
Select Specify New Criteria. Give the new criteria a name such as p-value
filtered, select the 1/regions (peaks) Spreadsheet, and choose Column 11. p-
value(Sample ID vs. mock). Include p-values so that comparison of the number
of reads in the sample compared to the control has a p-value less than 0.05 by
including significant with FDR of 0.05. The dialog should look like Figure 9.
Select OK
Figure 9: Configure criteria dialog to filter out peaks that occur in the control sample
Before closing the List Creator dialog, Save the list you just created. The
spreadsheet should have 2473 rows. The resulting regions are those that have
significantly more reads in the chip sample than in the mock sample. Select Close
to exit the dialog
Other List Creator operations (Figure 10) like the Venn Diagram and Union (Or) or
Intersection (And) of the lists could also be performed to create a list of “true” enriched
peaks. For instance, you could filter on the intersection between FDR and Peaks not in
mock or you may choose to filter by scaled fold change or apply a minimum number of
reads per million (RPM). The choice of how to create a list of “true” peaks is up to you and
may be different for different kinds of experimental designs.
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 11
Figure 10: List Creator commands
de novo Motif Discovery and Motif Search
Now that you have a list of enriched regions, you will learn how to find recurring patterns
or motifs in these regions. A transcription factor can bind to many sites throughout the
genome. These sites usually share a certain pattern in their sequences (consensus
sequence). By searching for these binding site motifs, you can determine the binding site
pattern and the locations of binding in the genome. PGS detects de novo motifs using the
Gibbs motif sampler (Neuwald et al., 1995).
A known database of transcription factors such as JASPAR (http://jaspar.cgb.ki.se/) can be
searched or de novo motifs may be identified using only the sequences from the identified
regions to find motifs.
Step 1 – de novo Motif Discovery
Under Peak Analysis, select Motif discovery. The two options for motif
discovery, Discover de novo motifs and Search for known motifs, will be
discussed separately
Select Discover de novo motifs and OK
Choose 1/p-value_filtered as the Spreadsheet with genomic regions. Use the
default settings: Number of Motifs 1, Discover motifs of length between 6 and 16
base pairs, and Result file: Motifs. Select OK. If the reference genome has not
been previously downloaded onto this computer, you may be asked if you would
like to download the .2bit reference genome. If prompted, select Automatically
download a .2bit file and OK if PGS is able to connect to the Internet properly. If
you do not have an Internet connection, choose one of the other two options:
(Manually specify a .2bit file or Create a .2bit file from reference fasta files). The
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 12
reference genome is required for determining which genes overlap the enriched
peak regions and for displaying the aligned sequences
A motif visualization plot (Figure 11) and two spreadsheets will be generated.
One spreadsheet, motifs (Motifs), contains information about the motif, and the
other, instances (Motifs_instances.txt), lists the genomic locations of the motif. If
your motif does not look exactly like Figure 11, select the Reverse button, which
will give you the reverse complement of the motif
Figure 11: Viewing the binding site motif for NRSF. Use the yellow arrows in the upper
right to cycle through views of all the motifs found (if more than one was found)
Description of Motif Output
Sequence Logo window
The Sequence Logo window (Figure 11) graphically displays the best motif found in the
peak regions of the data. In this case, the motif finder discovered a motif in the NRSF-
enriched regions that is 15 base pairs in length. The height of each position is the relative
entropy (in bits) and indicates the importance of a base at a particular location in the
binding site. The title CAG.ACC..GGA.AG is the consensus sequence for the sequence
logo. Dots represent positions that contain more than one base across all reads in the motif.
The dots can be replaced with letters by checking the Show nucleotide codes checkbox;
doing so will give characters representing the possible bases at that position. For a
description of the IUPAC nucleotide codes, please visit:
http://www.bioinformatics.org/sms/iupac.html.
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 13
Motifs spreadsheet
The motif information spreadsheet (Figure 12), entitled Motifs, lists the information about
the motif that was visualized using the sequence logo. This includes the Counts of bases in
each position of the pattern (column 1), the Consensus Sequence (column 2), the Motif ID
(column 3), the Log Likelihood Ratio of the motif (column 4), and the Background
frequency of each of the bases in all of the sequences of that motif. The Log Likelihood
Ratio scores the relative likelihood that the found pattern did not occur by chance.
Figure 12: Viewing the motif spreadsheet
You can (re)display the Sequence Logo of the motif by right clicking on a row header and
selecting Logo View. If more than one motif was found (in the de novo motif dialog, you
only requested one motif to be found), then the yellow arrows shown in Figure 11 may be
used to cycle through the motifs.
Motif_instances spreadsheet
The Motif_instances spreadsheet (Figure 13), a child of the Motifs spreadsheet and entitled
instances, details all of the locations of the motif(s) in the enriched regions. Each row lists
a putative binding site for a motif. The genomic location is given (chromosome, start, end,
and strand), along with the Motif ID, the sequence found at that location, and a score of
how likely that site is part of the motif. The list is sorted in order of descending score. The
larger the score, the more likely the site is a true instance of the motif.
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 14
Figure 13: Viewing the motif instances spreadsheet
Step 2 – Search JASPAR for Known Motifs
Repeat the Motif discovery step; however, select the Search for known motifs
radio button and OK. This will search the JASPAR database for motifs that are
over-represented (more than by chance) in the list of sequences in the significant
regions list. The JASPAR database will download automatically if needed during
the Search for known motifs step. Downloading the JASPAR database will create
a spreadsheet in your experiment named JASPAR.txt that contains all of the
species-specific motifs in the database. Visualization of the motifs is done by
right-clicking on a row in the JASPAR.txt spreadsheet and selecting Logo View.
The yellow arrows in the upper right corner ( ) may be used to cycle
through visualization of the motifs in the JASPAR database
The motif search should be performed on the p-value_filtered list. You may
search for a particular element in the database or all of the elements in the
database. For this tutorial, use the defaults and search for all of the motifs listed in
JASPAR database (Figure 14). Select OK
Alternatively, you can also search the list of sequences for a single motif specified by a
valid nucleotide sequence (Search for motif) or if you want look for several motifs, you can
import them as a list (import the list as tab-delimited file) (Import motifs from text file).
This feature may also be used to import motifs from other databases to which you have
access (TRANSFAC®, custom database, etc.). Use the help button ( ) for specification of
the format of the text file. Sequence Quality value is a number between 0 and 1 and
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 15
indicates how closely a sequence must match the pattern for it to be called an instance of
the pattern. The higher the value, the closer it must match the pattern to be called.
Figure 14: Search for JASPAR Motifs in Sequences dialog
Two resulting spreadsheets, similar to the spreadsheets in the de novo motif
discovery step, will be generated, the motif_summary (MotifSearch) spreadsheet
(Figure 15) and the motif_instances (MotifSearch.instance) spreadsheet
Sort the motif_summary spreadsheet by p-value by right-clicking on the p-value
column and selecting Sort Ascending
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 16
Figure 15: Motif_summary spreadsheet. Each motif from the JASPAR database (or other
input database used) will be shown. Probability of Occurrence (column 2) is the
probability of detecting a false positive for this motif in a random DNA sequence. Expected
Number of Occurrences (column 3) is the Probability of Occurrence times the total length
of the reads. Actual Number of Occurrences (column 4) is the count of sequences that
match the known motif in the reads. P-value (column 5) is the uncorrected p-value
(binomial test)
As you can see in Figure 15, REST (another name for NRSF) is at the top of the list. The
spreadsheet indicates that the expected number of by-chance occurrences of the
NRSF/REST motif is less than 1, but in fact, 1071 occurrences of the motif were observed,
resulting in a very low p-value (0). This motif agrees with the motif found in the de novo
motif detection step. Interestingly, other motifs appear a significant number of times in the
ChIP-Seq peaks and may represent possible co-factors.
The motif_instances spreadsheet contains all instances of the motifs (with actual counts >0)
from the motif_summary spreadsheet.
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 17
Generating a list of regions containing the REST motif (Optional)
Because the motif_instances spreadsheet contains every instance of every motif identified,
you may wish to create a spreadsheet of just the REST instances that contains the locations
of each of the 1071 instances of the REST motif.
Select the motif_instances (MotifSearch.instance) spreadsheet in the
spreadsheet navigator
Select the Motif Name column header (column 5) in the spreadsheet
Right-click and select Find / Replace / Select as shown in Figure 16
Figure 16: Finding all REST peaks (step 1)
In the next dialog, at Find What, type in REST and choose Select All at the
bottom of the screen. This finds and selects the 1071 instances of the REST motif
as shown in Figure 17
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 18
Figure 17: Selecting all REST instances in motif_instances spreadsheet (step 2)
Close the dialog. You will notice that in the original spreadsheet, the focus has
shifted that so now row 12848 is highlighted and visible in the view.
Right-click on row 12848 and select Filter Include (Figure 18)
Figure 18: Including all REST instances that were identified by Find / Replace / Select
Notice now that the motif_instances spreadsheet has 1071 rows and that a filter
has been applied (Figure 19)
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 19
Figure 19: Filtered motif_instances spreadsheet contains 1071 REST instances. The black
and yellow bar at the far right shows that a filter has been applied to this spreadsheet
Filters are very powerful but will slow down spreadsheet operations on the original list.
Furthermore, the filter operation does not create a brand new spreadsheet. In order to create
a spreadsheet that only contains the REST instances, it is necessary to clone the original
spreadsheet with the filter applied, save the clone with a new name, and clear the filter
from the original spreadsheet.
Right click on motif_instances in the spreadsheet navigator and then select
Clone…
In the Clone Spreadsheet dialog, type REST for Name of resulting copy and
select 1/p-value_filtered/motif_summary (MotifSearch) from the pull-down
menu of Create as a child of spreadsheet
This creates a new spreadsheet in the spreadsheet navigator that has not been
saved (there is an * after the spreadsheet name). Save the spreadsheet by right-
clicking on the spreadsheet and selecting Save As… and type in REST as the File
name
To remove the filter from the original spreadsheet, right-click on motif_instances
(MotifSearch) in the spreadsheet navigator. Notice the yellow/black bar on the
right (also shown in Figure 19). Right-click anywhere in the yellow/black bar and
select Clear Filter. Now both the original spreadsheet and the REST spreadsheet
exist without filters
Finding Nearest Genomic Features
In this section, you will learn how to find genomic features (genes) that are near the IP-
enriched regions of the data. You will also learn how to classify the peak locations by gene
section (5’ UTR, 3’ UTR, Promoter, CDS).
Step 1 – Specify the Database
Make sure the spreadsheet that you want to overlap with genes is active. In this
case, you want to detect overlaps on the p-value_filtered spreadsheet, so select the
p-value_filtered spreadsheet
Under Peak Analysis, select Find nearest genomic feature. A dialog, similar to
Figure 20, will appear. Select RefSeq Transcripts. A download of the database
will be started if this information has not previously been downloaded onto your
computer. Leave the promoter region boundaries as default and select OK
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 20
Figure 20: Configuring the dialog for finding genes that overlap enriched regions of the
data
Step 2 – View the List of Nearest Genomic Features
The resulting spreadsheet (gene-list) (Figure 21) is a child of the p-value_filtered
spreadsheet. Each row represents a transcript with Transcript ID (column 5), Gene Symbol
(column 6), and genomic location of the transcript (columns 1-3). Distance to TSS (column
7) gives distance of each enriched region to the transcription start site (in base pairs;
positive means downstream and negative means upstream). Overlap with gene and region
are given in columns 8 and 9, respectively. Columns 10 and greater were already discussed
under the Detect Enriched Regions section.
Note: Percent overlap with gene is more likely to be high (close to 1) in cases where one
region covers several genes (for example, histone studies). Percent overlap with region is
likely to be high (close to 1) if a region is relatively small and is found completely within a
gene (for example, transcription factor binding studies). If both columns are close to 1, then
the gene and the region have nearly the same start and stop locations. If both columns are
small (close to 0) then the region doesn't overlap with the gene directly but the region
found likely covers only the promoter region.
Another way to interpret the percent overlap with region and percent overlap with gene is
to use Peak Analysis > Classify regions by gene section. This step is left for you to try on
your own (the input should be a region list or filtered region list).
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 21
Figure 21: Identifying closest genomic features to regions spreadsheet
Visualize Reads and Enriched Regions
You have gone through the steps for importing data and detecting TF-enriched regions and
have identified potential binding sites within these enriched regions. You might explore the
functions of the genes these binding sites regulate by using other Biological Interpretation
tools like GO Enrichment or Pathway Analysis which are discussed in other tutorials.
In this section, viewing the ChIP-Seq data using PGS’s Genome Viewer will be explained.
For more information about the viewer, see NGS Chromosome Viewer.
Step 1 – Load the Data into the Viewer
Select the parent spreadsheet (WoldChipSeqBamFiles) containing the list of
samples in two rows: one for the chip sample and one for the mock sample
Under Visualization on the ChIP-Seq Workflow, select Plot chromosome view.
The left-hand side contains a list of tracks that can be visualized. The tracks that
are shown by default are (from the top) the transcript tracks, the sequence read
visualization tracks, and the cytoband track (Figure 22)
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 22
Figure 22: Viewing the sample reads on chromosome 1
To add additional tracks, select the New Track button on the left-hand side of the
viewer. Choose Add tracks from a list of spreadsheets and Next
Add the p-value_filtered.txt track and add the Motifs_instances.txt track by
selecting the appropriate checkboxes. Uncheck Aligned Reads as these tracks are
already being displayed. Select Create
This will display the enriched regions found in the samples and the locations of the motif
instances from the de novo motif discovery (additionally, you could display the regions that
were found by searching the JASPAR database). If you have not gone through the steps for
peak detection and motif discovery, these tracks will not be available. The viewer in Figure
23 will appear.
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 23
Figure 23: Adding p-value and motif binding site tracks
The two resulting tracks display the detected regions at each location on the chromosome
for the NRSF-enriched sample (chip) and align them to the de novo discovered motif
binding sites. Switching between chromosomes is possible by selecting a chromosome
from the drop-down menu at the top of the window. The positions of the tracks can be
changed by dragging the names of the tracks on the left-hand side of the viewer to the
appropriate locations.
Step 2 – Explore the Data
Change the Genomic Scale via Zoom using the Mouse
Select the magnifying glass icon ( ) to zoom in on the data. Zoom can be done by using
one of several methods: (1) clicking and drawing a box on the plot with the left mouse
button (2) using the mouse scroll wheel (3) using the magnifying glass icons at the bottom
of the screen or (4) sliding the bar between the icons. Figure 24
shows a zoomed-in view of one of the enriched regions.
Selecting the home icon ( ) at the bottom of the screen will reset show the whole
chromosome. Selecting the selection icon ( ), allows you to select a track and change the
properties of that track.
Select ( ) and then select the chip track (or select the Bam Profile (chip) track from the
list of tracks in the left pane)
Under the Style tab, select Histogram, Alignments and select Color by Strands. Select
Apply and the viewer in Figure 24 will appear
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 24
The chip region indicates the location of the ChIP-Seq peak. Since only the ends of the
resulting fragments from the ChIP assay are read, enriched regions will generally contain
two peaks, one for the forward reads (shown in green) and one for the reverse reads (shown
in red). The control (mock.bam) does not contain an enriched region at this site. The two
motif binding site regions indicate that there are two potential binding sites in this region.
Figure 24: Viewing the zoomed-in view of an enriched region showing two possible
binding sites at this location
Shortcut to Showing an Enriched Region
To go to an enriched region from the p-value spreadsheet, right-click on the row
header of the region of interest in the p-value_filtered spreadsheet and select
Browse to Location; this action will automatically go to the coordinates of the
region
You can also type the name of a gene in the text box at the top of the viewer next
to the magnifying glass, and the viewer will display the location of that gene. For
example, typing NEUROD1 goes immediately to the NEUROD1 gene (Figure
25)
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 25
NEUROD1 contains a binding site for the NRSF motif. Notice that the enriched region for
the NRSF transcription factor is within the NEUROD1 gene. As discussed in the Johnson
et al. paper, NRSF is implicated in the repression of NEUROD1, but it was unknown
exactly where the NRSF binding occurred. This data indicates that the binding site is within
the NEUROD1 gene itself, as shown by the orange box in the Regions track.
Figure 25: Viewing the zoomed-in view of NEUROD1 gene
You may also save the reads shown in the visible genome browser window in selection
mode ( ) by right-clicking in the peak area and selecting Dump Displayed Reads to
Spreadsheet.
Additional Analysis
In addition to the items covered in this tutorial, detecting SNPs in the ChIP-Seq sample is
possible. You may look for differences in nucleotides across the samples or against a
reference genome. This analysis is the same for all of the next generation sequencing
workflows (ChIP-Seq, RNA-Seq, and DNA-Seq) and so is not covered in this tutorial.
Also, the ChIP-Seq results can be merged with gene expression data using the Genomic
Integration step in the ChIP-Seq workflow.
End of Tutorial
For additional assistance, contact our technical support staff at +1-314-878-2329 or email
support@partek.com.
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 26
References
Johnson, D. S., Mortazavi, A., Myers, R. M., & Wold, B. (2007). Genome-Wide Mapping
of in Vivo Protein-DNA Interactions (Vol. 316). New York, NY: Science.
Kharchenko, P.V., Tolstorukov, M.Y., & Park, P.J. (2008). Design and analysis of ChIP-Seq
experiments for DNA-binding proteins (Vol. 26). Nature Biotechnology.
Kundaje, A. (2010) The phantom-peak coefficient as measure of *-seq data quality. Retrieved
from
ftp://encodeftp.cse.ucsc.edu/users/akundaje/phantomPeakQuality/ThePhantomPeakCoeffi
cient.pdf.
Neuwald, A. F., Liu, J.S., & Lawrence, C.E. (1995). Gibbs motif sampling: detection of
outer membrane repeats (Vol. 4). Protein Science.
Tutorial last revised: Feb. 2012
Copyright 2012 by Partek Incorporated. All Rights Reserved. Reproduction of this material without express written consent
from Partek Incorporated is strictly prohibited.
top related