geuvadis wp4: rna sequencing progress, aims and data
DESCRIPTION
Geuvadis WP4: RNA sequencing Progress, Aims and Data. Tuuli Lappalainen University of Geneva. Geuvadis Analysis Group Meeting, April 16, 2012, Geneva. Genomics, meet transcriptomics RNA sequencing of ~500 individuals from the 1000 Genomes. FIN. GBR. CEU. TSI. - PowerPoint PPT PresentationTRANSCRIPT
Geuvadis WP4: RNA sequencing
Progress, Aims and Data
Tuuli Lappalainen
University of Geneva
Geuvadis Analysis Group Meeting, April 16, 2012, Geneva
Genomics, meet transcriptomics
RNA sequencing of ~500 individuals from the 1000 Genomes
FIN
GBR
TSICEU
YRI
Geuvadisin 1000G Phase1
TSI 93 92GBR 96 86FIN 95 89CEU 92 79YRI 89 77TOTAL 465 423
Integrated haplotypes of SNPs, indels, structural variants of total ~ 13M variants + mRNAseq + miRNAseq
I have all these variants from my sequencing study but I don’t
know what’s functional.Here’s a pretty good catalogue of
regulatory variants. We can also start to predict functional consequences of
novel variants based on their properties.
We might want to do RNAseq on a big scale. What do we get out
of it? How should we do it?At least we did lots of cool
science. This is how we created the data and analyzed it.
Why are we doing this?
I want to use 1000g data in my research, but is there any
functional data available?
Yes – this the largest genome+transcriptome reference dataset thus far. You can use it in
your own research (after our paper is out).
UU48
72
Samples
1. Transformed lymphoblastoid cell lines from Coriell & UNIGE
2. Cell culture at ECACC: Cell pellets for RNA isolation + cell banks for all the partners
3. RNA extracted at UNIGE
4. Sequencing in 7 partner labs
Randomization of the sample processing
ICMB
MPIMG
HMGUUNIGE
CRG/CNAG/USC
LUMC48
48
72
96
116+168
Sequencing
mRNAseq: 2 x 75bp, minimum of 20M mapping reads per sample
total ~15 billion mapping reads
miRNAseq: 1 x 36bp, minimum of 3M total reads per sample
total ~1 billion mapping reads
All sequencing in HiSeq with the latest TruSeq kits
standardization of the methods as much
as possible
Progress and timeline
2010: Pilot of 5 samples, 7 labs
2011 2012
study designsample selection
cell line shipments and growing
RNA extraction pilot
RNA extraction
sequencing
mappingQC
Thomas / Tuuli
10/1212paper submission
Documentation : wikihttp://www.geuvadis.org/group/geuvadis/wikis . Tech support
from Gabrielle [email protected]
Contents of the WP4 Wiki
Analysis Analysis results, methods, etc
Data storage Locations and descriptions of data files found in EBI
(ENA/Arrayexpress or FTP site) The Wiki is only for sharing small result files, not actual
data
Partners and contact info WP4 participants
Protocols Protocols, from cells to fastq files
Samples Information of the samples included in the project,
including sample lists for sequencing
Teleconference minutes
Presentations Presentation slides and abstracts
documentation of any analysis that is
used by the consortium is obligatory
Data storage : ftpftp:ftp-private.ebi.ac.uk/upload/geuvadis/wp4_rnaseq/
main_project/
Tech support from Natalja ([email protected])
Status of the data: mRNA Fastqs
All filtered, uploaded to ftp, sample information sheets sorted out, checksums OK 464 samples in total (1 failed sequencing QC)
Mapping bwa (Tuuli/Ismael)
All done and uploaded to the ftp site GEM (Micha/Thasso/Paolo)
GEM files are done. Bam conversion coming
Quantifications Exon quantifications
bwa: all done and uploaded to the ftp site GEM
deconvoluted from flux: ready to upload? read counts: once the bams are done
Transcript quantifications from flux: ready to upload?
QC and normalization No sample swaps. 5 samples that show signs of cross-contamination. Expression
outliers – soon QTL analysis needs normalization to remove technical variation
mRNA quality statistics
mRNA quality statistics
mRNA quality statistics
mRNA quality statistics: replicates
Estimate Std.Error z value p
(Intercept) 4.25749 0.01379 308.706 <10^-16
HG00355 0.27618 0.01271 21.729 <10^-16
NA06986 -0.65074 0.01041 -62.518 <10^-16
NA19095 -0.22808 0.01125 -20.265 <10^-16
NA20527 0.22239 0.01253 17.752 <10^-16
lab1_2 -0.19326 0.01556 -12.417 <10^-16
lab2 -0.22091 0.01547 -14.279 <10^-16
lab3 -1.17157 0.01329 -88.144 <10^-16
lab4 -0.34313 0.01509 -22.745 <10^-16
lab5 0.02166 0.01635 1.325 0.185
lab6 -0.09454 0.01591 -5.942 10^-9
lab7 0.26147 0.0174 15.027 <10^-16
reference: HG00117, lab1_batch1
mRNA quality statistics: all full-coverage samples
Status of the data: miRNA
FastqsAll except 48 from Kiel uploaded to ftp, sample
information sheets sorted out, checksums OK
Processing of the data ongoing (Marc F)trimming, mapping, QC
Status of the data: genotypes 422 individuals from 1000g Phase 1 are OK
genotypes in the final format uploaded to the ftp site
imputation of the Phase 2 individuals issues either with the input haplotypes from 1000g or filtering of the reference panel…
annotation of the variants most of the information from 1000g Functional Interpretation Group + additional info by Tuuli and Manny will be included in the vcf files, format customized from VAT and documented in the wiki
VA=1: AlleleNumber
C1orf159: GeneName
ENSG00000131591.12: GeneID
-: Strand
nonsynonymous: Type
2/8: FractionOfTranscriptsAffected
C1orf159-201: TranscriptName
ENST00000294576.5: TranscriptID
23468_23597: ExonStartPosGenomic_ExonEndPosGenomic:
3/7: ExonNumber/TotalExonNumberInTranscript:
1035_944_315_R->Q_1035 TranscriptLength_
PositionOfVariantInTranscript_
PositionOfAminoAcidInPeptide_
AminoAcidChange_
AltAlleleTranscriptLength
(Some of the) questions that we should address
1. How to do transcriptomics in a big scale? technical covariates, batch effects, replicates low-level data processing
2. SNP calling from RNAseq data
3. How does the transcriptome vary and interact? quantitative/qualitative mRNA variation population variation of miRNAs interactions (mRNA-miRNA), coexpression networks
4. Catalogue of genetic variants in 1000g that affect transcriptome variation
common eQTLs, sQTLs, variation QTLs, loss of function variants…
5. What are the mechanisms underlying regulatory variants? Functional annotation of regulatory variants Mapping of causal regulatory variants
6. Interpretation: population and evolutionary genetic analysis, disease aspects….
The consortium
UNIGE (Geneva)Manolis DermitzakisStylianos AntonarakisTuuli LappalainenThomas GigerEmilie FalconnetLuciana Romano Alexandra PlanchonIsmael PadioleauAlisa Yurovsky
CRG/CNAG/USC (Barcelona)Xavier EstivillIvo GutRoderic GuigoAngel Carracedo AlvarezGabrielle Bertier Micha SammethThasso GriberPaolo RibecaPedro FerreiraJean MonlongEsther LizanoMarc FriedländerMarta GutSergi Bertran Agullo
ICMB (Kiel)
Stefan SchreiberPhilip RosenstielMatthias Barann
MPIMG (Berlin)Hans LehrachRalf SudbrakMarc SultanVyacheslav Amstislavskiy
LUMC (Leiden)Gert-Jan van Ommen Peter ‘t Hoen Irina Pulyakhina
UU (Uppsala)Ann-Christine SyvänenOlof KarlbergJonas AlmlöfMathias Brännvall
HMGU (Munich)Thomas MeitingerTim StromThomas WielandThomas SchwarzmayrEBIAlvis Brazma
Natalja Kurbatova
Oxford UniversityManuel Rivas
Massachusetts General HospitalDaniel McArthur
ECACCBryan Bolton Karen BallEdward BurnettJim Cooper
Who is missing??