geuvadis wp4: rna sequencing progress, aims and data

Geuvadis WP4: RNA sequencing

Progress, Aims and Data

Tuuli Lappalainen

University of Geneva

Geuvadis Analysis Group Meeting, April 16, 2012, Geneva

Genomics, meet transcriptomics

RNA sequencing of ~500 individuals from the 1000 Genomes

FIN

GBR

TSICEU

YRI

Geuvadisin 1000G Phase1

TSI 93 92GBR 96 86FIN 95 89CEU 92 79YRI 89 77TOTAL 465 423

Integrated haplotypes of SNPs, indels, structural variants of total ~ 13M variants + mRNAseq + miRNAseq

I have all these variants from my sequencing study but I don’t

know what’s functional.Here’s a pretty good catalogue of

regulatory variants. We can also start to predict functional consequences of

novel variants based on their properties.

We might want to do RNAseq on a big scale. What do we get out

of it? How should we do it?At least we did lots of cool

science. This is how we created the data and analyzed it.

Why are we doing this?

I want to use 1000g data in my research, but is there any

functional data available?

Yes – this the largest genome+transcriptome reference dataset thus far. You can use it in

your own research (after our paper is out).

UU48

72

Samples

1. Transformed lymphoblastoid cell lines from Coriell & UNIGE

2. Cell culture at ECACC: Cell pellets for RNA isolation + cell banks for all the partners

3. RNA extracted at UNIGE

4. Sequencing in 7 partner labs

Randomization of the sample processing

ICMB

MPIMG

HMGUUNIGE

CRG/CNAG/USC

LUMC48

48

72

96

116+168

Sequencing

mRNAseq: 2 x 75bp, minimum of 20M mapping reads per sample

total ~15 billion mapping reads

miRNAseq: 1 x 36bp, minimum of 3M total reads per sample

total ~1 billion mapping reads

All sequencing in HiSeq with the latest TruSeq kits

standardization of the methods as much

as possible

Progress and timeline

2010: Pilot of 5 samples, 7 labs

2011 2012

study designsample selection

cell line shipments and growing

RNA extraction pilot

RNA extraction

sequencing

mappingQC

Thomas / Tuuli

10/1212paper submission

Documentation : wikihttp://www.geuvadis.org/group/geuvadis/wikis . Tech support

from Gabrielle [email protected]

Contents of the WP4 Wiki

Analysis Analysis results, methods, etc

Data storage Locations and descriptions of data files found in EBI

(ENA/Arrayexpress or FTP site) The Wiki is only for sharing small result files, not actual

data

Partners and contact info WP4 participants

Protocols Protocols, from cells to fastq files

Samples Information of the samples included in the project,

including sample lists for sequencing

Teleconference minutes

Presentations Presentation slides and abstracts

documentation of any analysis that is

used by the consortium is obligatory

http://www.geuvadis.org/group/geuvadis/wikis

Data storage : ftpftp:ftp-private.ebi.ac.uk/upload/geuvadis/wp4_rnaseq/

main_project/

Tech support from Natalja ([email protected])

mailto:[email protected]

Status of the data: mRNA Fastqs

All filtered, uploaded to ftp, sample information sheets sorted out, checksums OK 464 samples in total (1 failed sequencing QC)

Mapping bwa (Tuuli/Ismael)

All done and uploaded to the ftp site GEM (Micha/Thasso/Paolo)

GEM files are done. Bam conversion coming

Quantifications Exon quantifications

bwa: all done and uploaded to the ftp site GEM

deconvoluted from flux: ready to upload? read counts: once the bams are done

Transcript quantifications from flux: ready to upload?

QC and normalization No sample swaps. 5 samples that show signs of cross-contamination. Expression

outliers – soon QTL analysis needs normalization to remove technical variation

mRNA quality statistics

mRNA quality statistics: replicates

Estimate Std.Error z value p

(Intercept) 4.25749 0.01379 308.706 <10^-16

HG00355 0.27618 0.01271 21.729 <10^-16

NA06986 -0.65074 0.01041 -62.518 <10^-16

NA19095 -0.22808 0.01125 -20.265 <10^-16

NA20527 0.22239 0.01253 17.752 <10^-16

lab1_2 -0.19326 0.01556 -12.417 <10^-16

lab2 -0.22091 0.01547 -14.279 <10^-16

lab3 -1.17157 0.01329 -88.144 <10^-16

lab4 -0.34313 0.01509 -22.745 <10^-16

lab5 0.02166 0.01635 1.325 0.185

lab6 -0.09454 0.01591 -5.942 10^-9

lab7 0.26147 0.0174 15.027 <10^-16

reference: HG00117, lab1_batch1

mRNA quality statistics: all full-coverage samples

Status of the data: miRNA

FastqsAll except 48 from Kiel uploaded to ftp, sample

information sheets sorted out, checksums OK

Processing of the data ongoing (Marc F)trimming, mapping, QC

Status of the data: genotypes 422 individuals from 1000g Phase 1 are OK

genotypes in the final format uploaded to the ftp site

imputation of the Phase 2 individuals issues either with the input haplotypes from 1000g or filtering of the reference panel…

annotation of the variants most of the information from 1000g Functional Interpretation Group + additional info by Tuuli and Manny will be included in the vcf files, format customized from VAT and documented in the wiki

VA=1: AlleleNumber

C1orf159: GeneName

ENSG00000131591.12: GeneID

-: Strand

nonsynonymous: Type

2/8: FractionOfTranscriptsAffected

C1orf159-201: TranscriptName

ENST00000294576.5: TranscriptID

23468_23597: ExonStartPosGenomic_ExonEndPosGenomic:

3/7: ExonNumber/TotalExonNumberInTranscript:

1035_944_315_R->Q_1035 TranscriptLength_

PositionOfVariantInTranscript_

PositionOfAminoAcidInPeptide_

AminoAcidChange_

AltAlleleTranscriptLength

(Some of the) questions that we should address

1. How to do transcriptomics in a big scale? technical covariates, batch effects, replicates low-level data processing

2. SNP calling from RNAseq data

3. How does the transcriptome vary and interact? quantitative/qualitative mRNA variation population variation of miRNAs interactions (mRNA-miRNA), coexpression networks

4. Catalogue of genetic variants in 1000g that affect transcriptome variation

common eQTLs, sQTLs, variation QTLs, loss of function variants…

5. What are the mechanisms underlying regulatory variants? Functional annotation of regulatory variants Mapping of causal regulatory variants

6. Interpretation: population and evolutionary genetic analysis, disease aspects….

The consortium

UNIGE (Geneva)Manolis DermitzakisStylianos AntonarakisTuuli LappalainenThomas GigerEmilie FalconnetLuciana Romano Alexandra PlanchonIsmael PadioleauAlisa Yurovsky

CRG/CNAG/USC (Barcelona)Xavier EstivillIvo GutRoderic GuigoAngel Carracedo AlvarezGabrielle Bertier Micha SammethThasso GriberPaolo RibecaPedro FerreiraJean MonlongEsther LizanoMarc FriedländerMarta GutSergi Bertran Agullo

ICMB (Kiel)

Stefan SchreiberPhilip RosenstielMatthias Barann

MPIMG (Berlin)Hans LehrachRalf SudbrakMarc SultanVyacheslav Amstislavskiy

LUMC (Leiden)Gert-Jan van Ommen Peter ‘t Hoen Irina Pulyakhina

UU (Uppsala)Ann-Christine SyvänenOlof KarlbergJonas AlmlöfMathias Brännvall

HMGU (Munich)Thomas MeitingerTim StromThomas WielandThomas SchwarzmayrEBIAlvis Brazma

Natalja Kurbatova

Oxford UniversityManuel Rivas

Massachusetts General HospitalDaniel McArthur

ECACCBryan Bolton Karen BallEdward BurnettJim Cooper

Who is missing??

geuvadis wp4: rna sequencing progress, aims and data

Documents

geuvadis rna sequencing

g data

transcriptomics rna

unige sequencing

descriptions of data

partners rna

sequencing study

functional data available