gene expression and microarrays array technology · age: 3 - 4 weeks after birth [ mged ] growth...
TRANSCRIPT
1
Armstrong, 2004
Bioinformatics 2
Genomics & Proteomics
Armstrong, 2004
Biological Profiling
• Microarrays– cDNA arrays– oligonucleotide arrays– whole genome arrays
• Proteomics– yeast two hybrid– PAGE techniques
Armstrong, 2004
Gene expression and microarrays
• Annealing properties of DNA
• ‘probes’ of known DNA on a substrate
• Label the target DNA (taken from sample)
• Add the labeled target DNA to substrate
• Look for co-localisation of label and knownDNA spot on substrate.
Armstrong, 2004
Array technology
• Not really a new technology
• Old methods used nylon or nitrocellulosefilter squares
• 100 x 100 spots on a square just larger thana CD cover
• Required large amount of sample DNA
• Limited number of spots on substrate.
Armstrong, 2004
Microarrays
• Glass substrate
• Very high density of probes
• Two main methods:– robot spotted system
– substrate synthesized system (affymetrix)
Armstrong, 2004
Why microarrays?
• What genes are expressed in a tissue andhow does that tissue respond to one of anumber of factors:– change in physical environment
– experience
– pharmacological manipulation
– influence of specific mutations
2
Armstrong, 2004
What do we actually get?
• A snap-shot of the mRNA profile in abiological sample
• With the correct experimental conditions wecan compare two situations
• Not all biological processes are regulatedthrough mRNA expression levels
Armstrong, 2004
What can we learn?
• Find promoter regions• Predict genetic interactions• If we change one variable a network of gene
responses should compensate• Homeostasis is the fundamental principle of
biology - almost all biological systems exist in acontrolled state of negative feedback.
Armstrong, 2004
Robot spotting
• Robotically controlled print head with asmall array of fine needles
• Needles dipped into DNA probes• Needles spotted onto glass substrate.• Clean needles, get more probe and offset
next set of spots• Needle mismatches are propagated over
entire matrix.
Armstrong, 2004
Robot spotted array.
Armstrong, 2004
Affymetrix system
• Proprietary technology
• Synthesize oligonucleotides on the substrate
• Allows much higher density of probes
• Probe length is limited
• Multiple probes used per gene
• Reverse complement controls included
• Has the ability to look at splice variants
Armstrong, 2004
The experiment
• Tissue/biological sample from anexperiment is obtained
• RNA is extracted from the tissue
• RNA is labeled with coloured dyes
• RNA is hybridised to the array
3
Armstrong, 2004
Analysis of microarray data
• Image processing– Locate all the probe spots and identify them– Measure the spot intensity
• Normalising for experimental Artifacts– Spot variability– Dye pairing bias– Background noise
Armstrong, 2004
Replication
• Gene expression levels are often multi-factorial
• Expression levels will vary independentlyof any experimentally controlled variable
• Multiple samples or mixed samples willgive an overall baseline
• Limited by cost and sample availability
Armstrong, 2004
Data analysis
• The data is noisy
• The sensitivity is low
• 2 fold changes in gene expression are thenormal threshold
• Start by looking for significant changes inindividual gene expression levels betweenthe two conditions or time points.
Armstrong, 2004
Clustering
• Take the relative changes in geneexpression per gene
• Cluster these changes into sets (normallyaround 100-1000)
• Gives us sets of genes that are affected to asimilar degree by the experimental variable
• Can be used to help look for promoterregions
Armstrong, 2004
Dealing with microarray data
• Current experiments are focussing on just afew samples over a few conditions toexamine gene expression changes inresponse to an environmental change.
• Can we reuse this data to learn aboutnetworks of genes?
Armstrong, 2004
Predicting gene networks
• Very active area of research.• Multiple microarrays contain information
that we can use to predict networks• The data from microarray experiments was
not collected for that express purpose.• The Microarray Gene Expression Database
group are proposing management systemsto help
4
Armstrong, 2004
MGED
• Microarray Gene Expression Database Group
• MIAME
• MAGE
• Ontologies
• Normalization
• http://www.mged.org/Armstrong, 2004
MIAME
• Minimum amount of information requiredto unambiguously describe a microarrayexperiment.
• Describe the relevant details of theexperiment.
• Allows biological repetition later or reuse ofthe data and comparison with otherexperiments.
Armstrong, 2004 Armstrong, 2004
MIAME Express
• Under development at EBI
• GUI interface to the MIAME databasesystem
• http://www.ebi.ac.uk/microarray/MIAMExpress/miamexpress.html
Armstrong, 2004
MAGE
• Microarray And Gene Expression
• Group who are proposing two standards forrepresenting and exchanging microarray data.
• MAGE-OM Object Model for microarray datadeveloped using Unified Modelling Language(UML)
Armstrong, 2004
MAGE
• MAGE-ML - XML based model forcommunication
• MAGE-STK - toolkit for handling andconverting between MAGE-OM andMAGE-XML on a variety of platforms– Java and PERL APIs available
5
Armstrong, 2004
Excerpts from a Sample Descriptioncourtesy of M. Hoffman, S. Schmidtke, Lion BioSciences
Organism: mus musculus [ NCBI taxonomy browser ]Cell source: in-house bred mice (contact: [email protected])Sex: female [ MGED ]Age: 3 - 4 weeks after birth [ MGED ]Growth conditions: normal
controlled environment20 - 22 oC average temperaturehoused in cages according to German and EU legislationspecified pathogen free conditions (SPF)14 hours light cycle10 hours dark cycle
Developmental stage: stage 28 (juvenile (young) mice) [ GXD "Mouse Anatomical Dictionary" ]Organism part: thymus [ GXD "Mouse Anatomical Dictionary" ]Strain or line: C57BL/6 [International Committee on Standardized Genetic Nomenclature for Mice]Genetic Variation: Inbr (J) 150. Origin: substrains 6 and 10 were separated prior to 1937. Thissubstrain is now probably the most widely used of all inbred strains. Substrain 6 and 10 differ at the H9,Igh2 and Lv loci. Maint. by J,N, Ola. [International Committee on Standardized Genetic Nomenclaturefor Mice ]Treatment: in vivo [MGED] intraperitoneal injection of Dexamethasone into mice, 10 microgram per25 g bodyweight of the mouseCompound: drug [MGED] synthetic glucocorticoid Dexamethasone, dissolved in PBS Armstrong, 2004
Biomaterial Concepts• Environmental or experimental history: A description of the conditions the
organism has been exposed to that are not one of the variables under study.– Culture conditions: A description of the isolated environment used to grow
organisms or parts of the organism.• atmosphere, humidity, temperature• light: The photoperiod and type (e.g., natural, restricted wavelength) of light exposure.• nutrients: The food provided to the organism (e.g., chow, fertilizer, DEMM 10%FBS,
etc.).• medium: The physical state or matrix used to provide nutrients to the organism (e.g.,
liquid, agar, soil)• density range: The concentration range of the organism.• contaminant organisms: Organisms present that were not planned as part of the study
(e.g., mycoplasma).• removal of contaminants: Steps taken to eliminate contaminant organisms.• host organism or organism parts: Organisms or organism parts used as a designed
part of the culture (e.g., red blood cells, stromal cells).– Generations: The number of cell divisions if the organism or organism part that is
cultured is unicellular otherwise the number of breedings.– Clinical history: The organism's (i.e., the patient's) medical record.– Husbandry: water, bedding, barrier facility, pathogen test results– Preservation: seed dormancy, frozen storage
Armstrong, 2004
Biomaterial Concepts• Treatment: The manipulation of the biomaterial for the purposes
of generating one of the variables under study.– somatic modification: The organism has had parts removed, added,or
rearranged.– genetic modification: The organism has had genes removed, added, or
rearranged.– starvation: The organism (or organism part) has been deprived of nutrients.– infection: The organism (or organism part) has been exposed to a virus or
pathogen.– behavioral stimulus: The organism is forced to respond to a stimulus with some
behavior (e.g., avoidance, obtaining a reward, etc.)– agent-based treatment: The treatment is effected by a defined chemical,
biological, or physical agent.– agent type: chemical (drugs), biological (macromolecule), physical
(stress from light, temperature, etc.)• agent application: In vivo, in vitro, in situ; qualitative or quantitative
– treatment protocol: method of treatment– treatment parameters: constant, variable– treatment duration: length of treatment Armstrong, 2004
Biomaterial Concepts• Biomaterial preparation: A description of the state
and condition of the biomaterial.– Time of day when the biomaterial was generated (i.e.,
sampled). Pathological staging: pre or post mortem atsampling
– state at start of treatment (age, time of day)– physio-chemical composition of the sample: amount of
material, number of cells, purity.– Extraction: Chemical extraction, Physical extraction– protocol: method used.– Pool types:
• Multiple: Biomaterial prepared from multiple specimens, butsame Organism, Genotype, Phenotype and treatment.
• Individually: Biomaterial prepared from individually specimen,but same Organism, Genotype, Phenotype and treatment
Armstrong, 2004
MGED Ontologies
• Develop standard ontologies for describingexperimental procedures associated withmicroarray data.
• Ontologies specific for:– sample (e.g. species, anatomical location etc)– by concept– array design
Armstrong, 2004
MGED Normalization
• Working group to discus standards fornormalization in microarray experiments– Differences in labeling efficiency between dyes
– Differences in the power of the two lasers.
– Differing amounts of RNA labeled between the 2channels.
– Spatial biases in ratios across the surface of themicroarray.
6
Armstrong, 2004
The Transcriptome
• Microarrays work by revealing DNA-DNAbinding.
• Transcriptional activators also bind DNA• Spot genomic DNA onto glass slides• Label protein extracts• Hybridise to the genomic probes• Reveals domains that include promoter
regionsArmstrong, 2004
pause…
Gene networks and network inference
Armstrong, 2004
What is a gene network
• Genes do not act alone.
• Gene products interact with other genes– Inhibitors
– Promoters
• The nature of genetic interactions in complex– Takes time
– Can be binary, linear, stochastic etc
– Can involve many different genes
Armstrong, 2004
What makes boys boys and girls girls?
Sugar, Spice and synthetic Oestrogens?
Armstrong, 2004
Sex determination: a genecascade (in flies…)
RuntSisterlessScute
DaughterlessDeadpanExtramachrochaete
6 Genes detect X:A ratio
Females Males
Armstrong, 2004
Sex determination(in flies…)
RuntSisterlessScute
DaughterlessDeadpanExtramachrochaete
6 Genes regulate ‘Sexlethal’
Sexlethal
+ effect
- effect
7
Armstrong, 2004
Sex determination(in flies…)
RuntSisterlessScute
DaughterlessDeadpanExtramachrochaete
Sexlethal can then regulate itself...
Sexlethal
Armstrong, 2004
Sex determination(in flies…)
Downstream cascade builds...
RuntSisterlessScute
DaughterlessDeadpanExtramachrochaete
Sexlethal transformer doublesex
Armstrong, 2004
Gene expression and time1 Runt2 Sisterless3 Scute
4 Daughterless5 Deadpan6 Extramachrochaete
7 Sexlethal 8 transformer 9 doublesex
Armstrong, 2004
Gene microarrays
time
Armstrong, 2004
Gene sub networks
• Do all types of connections exist in genenetworks?
• Milo et al studied the transcriptionalregulatory networks in yeast and E.Coli.
• Calculated all the three and four genecombinations possible and looked at theirfrequency
Armstrong, 2004
Milo et al. 2002 Network Motifs: Simple Building Blocks of ComplexNetworks. Science 298: 824-827
Biological Networks
Three node possibilities
8
Armstrong, 2004
Gene sub networks
Heavy bias in both yeast and E.coli towards these two subnetwork architectures
Armstrong, 2004
Armstrong, 2004
Gene Network Inference
• Gene micro-array data
• Learning from micro-array data
• Unsupervised Methods
• Supervised Methods
• Edinburgh Methods
Armstrong, 2004
Gene Network Inference
• Gene micro-array data– Time Series array data
– Tests under ranges of conditions
• Unlike example - 1000s genes
• Lots of noise
• Clustering would group many of these genestogether
• Aim: To infer as much of the network as possible
Armstrong, 2004
Learning from Gene arrays
• Big growth industry but difficult problem• Initial attempts based on unsupervised
methods:– Basic clustering analysis - related genes– Principal Component Analysis– Self Organising Maps– Bayesian Networks
Armstrong, 2004
Bayesian ‘gene’ networks
• Developed by Nir Friedman and Dana Pe’er
• Can be easily adapted to a supervisedmethod
9
Armstrong, 2004
Learning Gene Networks
• The field is generally moving towards moresupervised methods:– Bayesian networks can use priors– Support Vector machines– Neural Networks– Decision Trees
• New local approach uses modified geneticprogramming
Armstrong, 2004
GAGA
• Genetic Algorithms for Gene Analysis
• Represent the network sub-components in a string
• Mutate this string to produce new networks
• Assign a rough linear weight to each edge
• Select the best of these
• Refine the weights of each edge using backpropagation
• Select the fittest and start over
Armstrong, 2004
Island Model
• Set up individual islands for computation(servers)
• Allow these to interbreed from time to time
• Massively increased performance
• More accurate since we can use a hugeinitial search space
Armstrong, 2004
• Scale free architecture– Chance of new edges is proportional to existing ones– Highly connected nodes may well be known to be lethal
• Network motifs– Constrain the types of sub networks
• Prior Knowledge– Many sub networks already known
Can we combine networkknowledge with gene inference?
Armstrong, 2004
Conclusions
• Network analysis is a big growth area
• Several promising fields starting to converge– Complex systems analysis
– Using prior knowledge
– Application of advance machine learning algorithms
– AI approaches show promise
Armstrong, 2004
Proteomics
• What is it?– Reveal protein interactions
– Protein profiling in a sample
• Yeast two hybrid screening
• High throughput 2D PAGE
• Automatic analysis of 2D Page
10
Armstrong, 2004
Yeast two hybrid
• Use two mating strains of yeast
• In one strain fuse one set of genes to atranscription factor DNA binding site
• In the other strain fuse the other set of genesto a transcriptional activating domain
• Where the two proteins bind, you get afunctional transcription factor.
Armstrong, 2004
Armstrong, 2004 Armstrong, 2004
Data obtained
• Depending on sample, you get a profile ofpotential protein-protein interactions thatcan be used to predict functional proteincomplexes.
• False positives are frequent.
• Can be confirmed by affinity purificationetc.
Armstrong, 2004
Proteomics - PAGE techniques
• Proteins can be run through a polyacrylamide gel (similar to that used toseqparate DNA molecules).
• Can be separated based on charge or mass.
• 2D Page separates a protein extract in twodimensions.
Armstrong, 2004
2D Page
charge
mass
11
Armstrong, 2004
DiGE
• We want to compare two protein extracts inthe way we can compare two mRNAextracts from two paired samples
• Differential Gel Electrophoresis
• Take two protein extracts, label one greenand one red (Cy3 and Cy5)
Armstrong, 2004
DiGE
• The ratio of green:red shows the ratio of theprotein across the samples.
Armstrong, 2004
Identifying a protein ‘blob’
• Unlike DNA microarrays, we do notnormally know the identify of each ‘spot’ orblob on a protein gel.
• We do know two things about the proteinsthat comprise a blob:– mass
– charge
Armstrong, 2004
Identifying a protein ‘blob’
• Mass and Charge are themselvesinsufficient for positive identification.
• Recover from selected blobs the protein(this can be automated)
• Trypsin digest the proteins extracted fromthe blob (chops into small pieces)
Armstrong, 2004
Identifying a protein ‘blob’
• Take the small pieces and run through amass spectrometer. This gives an accuratemeasurement of the weight of each.
• The total weight and mass of trypsindigested fragments is often enough toidentify a protein.
• The mass spec is known as a MALDI-TOFF
Armstrong, 2004
Identifying a protein ‘blob’
MALDI-TOFF output from myosinGood for rapid identification of single proteins.Does not work well with protein mixtures.
12
Armstrong, 2004
Identifying a protein ‘blob’
• When MALDI derived information isinsufficient. Need peptide sequence:
• Q-TOF allows short fragments of peptidesequences to be obtained.
• We now have a total mass for the protein,an exact mass for each trypsin fragment andsome partial amino acid sequence for thesefragments.