dr marion wooldridge 26th july 2002

39
Dr Marion Wooldridge Dr Marion Wooldridge 26th July 2002 26th July 2002 Data quality, combining Data quality, combining multiple data sources, multiple data sources, and distribution and distribution fitting fitting Selected points and Selected points and illustrations! illustrations!

Upload: jerod

Post on 07-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Data quality, combining multiple data sources, and distribution fitting Selected points and illustrations!. Dr Marion Wooldridge 26th July 2002. Outline of talk. Data quality what do we need? Combining multiple data sources when can we do it? an example explained Fitting distributions - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dr Marion Wooldridge 26th July 2002

Dr Marion WooldridgeDr Marion Wooldridge

26th July 200226th July 2002

Data quality, combining Data quality, combining multiple data sources, and multiple data sources, and

distribution fittingdistribution fitting

Selected points and Selected points and illustrations!illustrations!

Page 2: Dr Marion Wooldridge 26th July 2002

Outline of talkOutline of talk

Data qualityData qualitywhat do we need?what do we need?

Combining multiple data sourcesCombining multiple data sourceswhen can we do it?when can we do it?an example explainedan example explained

Fitting distributionsFitting distributionshow do we do it?how do we do it?an example explained an example explained

Page 3: Dr Marion Wooldridge 26th July 2002

What do we mean by data?What do we mean by data?

Experimental resultsExperimental resultsnumericalnumerical

Field dataField datanumericalnumerical

Information on pathways/processesInformation on pathways/processeswords and numberswords and numbers

Expert opinionExpert opinionwords to convert to numbers?!words to convert to numbers?!

Page 4: Dr Marion Wooldridge 26th July 2002

Why do we need data?Why do we need data?

Model constructionModel constructionrisk pathwaysrisk pathways

what actually happens?what actually happens? do batches get mixed?do batches get mixed? is the product heated? etc….is the product heated? etc….

Model inputsModel inputsestimate probabilitiesestimate probabilities

estimate uncertaintiesestimate uncertainties estimate variabilityestimate variability

Page 5: Dr Marion Wooldridge 26th July 2002

Which data should we use?Which data should we use?

The best available!The best available!However…….However…….

a risk assessment is pragmatica risk assessment is pragmatica risk assessment is ‘iterative’a risk assessment is ‘iterative’there are always data deficienciesthere are always data deficiencies

these are often ‘crucial’these are often ‘crucial’

So what does this tell us?…..So what does this tell us?…..

Page 6: Dr Marion Wooldridge 26th July 2002

What do we mean by pragmatic?What do we mean by pragmatic?

Purpose of risk assessment…..Purpose of risk assessment…..

To give useful info to risk manager To give useful info to risk manager to aid decisions, usually in short termto aid decisions, usually in short term

This leads to time constraintsThis leads to time constraintsThis generally means using This generally means using currentlycurrently available available

data for decisions data for decisions ‘now’‘now’ however incomplete or uncertain that datahowever incomplete or uncertain that data

Page 7: Dr Marion Wooldridge 26th July 2002

What do we mean by ‘iterative’?What do we mean by ‘iterative’?

Often several stages, with different purposesOften several stages, with different purposes

‘‘Is there really a problem?’ Is there really a problem?’ rapid preliminary assessmentrapid preliminary assessment

‘‘OK, so give me the answer’OK, so give me the answer’ refine assessmentrefine assessment

‘‘Did we get it right?’Did we get it right?’ revisit assessment if/when new datarevisit assessment if/when new data

Page 8: Dr Marion Wooldridge 26th July 2002

The The minimumminimum data quality required….. data quality required…..

will be different at each stagewill be different at each stageso - why is that?….so - why is that?….

Page 9: Dr Marion Wooldridge 26th July 2002

‘‘Is there really a problem?’ Is there really a problem?’ Risk manager needs rapid answerRisk manager needs rapid answer

‘‘preliminary’ risk assessmentpreliminary’ risk assessmentidentify data rapidly availableidentify data rapidly available

may be incomplete, anecdotal, old, from a different country, about a may be incomplete, anecdotal, old, from a different country, about a different strain, from a different species, etc. etc……. different strain, from a different species, etc. etc…….

Still allows decision to be madeStill allows decision to be madea problem….. a problem…..

is highly likely - do more now!is highly likely - do more now! may develop - keep watching briefmay develop - keep watching brief is highly unlikely - but never zero risk!is highly unlikely - but never zero risk!

Page 10: Dr Marion Wooldridge 26th July 2002

Stage 1: Conclusion...Stage 1: Conclusion...

Sometimes, poor quality data may be Sometimes, poor quality data may be useful datauseful data

is it ‘fit-for-purpose’is it ‘fit-for-purpose’

Don’t just throw it out!Don’t just throw it out!

Page 11: Dr Marion Wooldridge 26th July 2002

‘‘OK, so give me the answer’OK, so give me the answer’

refine risk assessmentrefine risk assessmentset up modelset up model

data on risk pathwaysdata on risk pathways model input data model input data

utilise ‘best’ data found, identify ‘real’ gaps & uncertaintiesutilise ‘best’ data found, identify ‘real’ gaps & uncertainties elicit expert opinionselicit expert opinions

data should be ‘best’ currently availabledata should be ‘best’ currently availablewith time allowed to search all reasonable sourceswith time allowed to search all reasonable sources

still often incomplete and uncertainstill often incomplete and uncertain checked by peer reviewchecked by peer review

Page 12: Dr Marion Wooldridge 26th July 2002

‘‘So the risk is…….’So the risk is…….’

The best currently available data gives….The best currently available data gives….the best currently available estimatethe best currently available estimate

Which is….Which is…. the best the risk manager can ever have ‘now’!the best the risk manager can ever have ‘now’!

Even if…Even if…it is still based on data gaps and uncertaintiesit is still based on data gaps and uncertainties

which it will be!which it will be! the ‘best’ data may still be poor datathe ‘best’ data may still be poor data allows targeted future data collection - may be lengthyallows targeted future data collection - may be lengthy

Page 13: Dr Marion Wooldridge 26th July 2002

Stage 2: Conclusion….Stage 2: Conclusion….

If a choiceIf a choiceNeed to identify the best data availableNeed to identify the best data available

What makes it the best data?What makes it the best data?

What do we do if we can’t decide?What do we do if we can’t decide? Multiple data sources for model!Multiple data sources for model!

What do we do with crucial data gaps?What do we do with crucial data gaps?Often need expert opinionOften need expert opinion

How do we turn that into a number?How do we turn that into a number?

Page 14: Dr Marion Wooldridge 26th July 2002

‘‘Did we get it right?’Did we get it right?’

revisit assessment if/when new data revisit assessment if/when new data availableavailable

targetted data collectiontargetted data collection may be years away….may be years away…. but allows quality to be specified but allows quality to be specified

not just ‘best’ but ‘good’not just ‘best’ but ‘good’ will minimise uncertaintywill minimise uncertainty should describe variabilityshould describe variability

Page 15: Dr Marion Wooldridge 26th July 2002

What makes data ‘good’?What makes data ‘good’?

Having said all that - some data is ‘good’!Having said all that - some data is ‘good’!Two aspectsTwo aspects

Intrinsic features of the dataIntrinsic features of the data universally essential for high qualityuniversally essential for high quality

Applicability to current situationApplicability to current situation data selection affects quality of the assessment and outputdata selection affects quality of the assessment and output

General principles…... General principles…...

Page 16: Dr Marion Wooldridge 26th July 2002

High quality data should be….High quality data should be….

CompleteComplete for assessment in handfor assessment in hand required level of detailrequired level of detail

RelevantRelevant e.g. right bug, country, management system, date etc. etc.e.g. right bug, country, management system, date etc. etc. nothing irrelevantnothing irrelevant

Succinct and transparentSuccinct and transparentFully referencedFully referencedPresented in a logical sequence Presented in a logical sequence

Page 17: Dr Marion Wooldridge 26th July 2002

High quality data should have….High quality data should have….

Full information on data source or Full information on data source or provenanceprovenance

Full reference if paper/similarFull reference if paper/similarName if Name if pers. commpers. comm. or unpublished. or unpublishedCopy of web-page or other ephemeral sourceCopy of web-page or other ephemeral sourceDate of data collection/elicitationDate of data collection/elicitationAffiliation and/or funding source of providerAffiliation and/or funding source of provider

Page 18: Dr Marion Wooldridge 26th July 2002

High quality data should have….High quality data should have….

A description of the level of uncertainty and the A description of the level of uncertainty and the amount of variabilityamount of variability

uncertainty = lack of knowledgeuncertainty = lack of knowledge variability = real life situationvariability = real life situation

UnitsUnits where appropriatewhere appropriate

Raw numerical dataRaw numerical data where appropriatewhere appropriate

Page 19: Dr Marion Wooldridge 26th July 2002

High quality study data should High quality study data should have….have….Detailed information on study design: e.g.Detailed information on study design: e.g.

Experimental or field basedExperimental or field basedDetails of sample and sampling frame including:Details of sample and sampling frame including:

livestock species (inc. scientific name)/product definition/population sub-group livestock species (inc. scientific name)/product definition/population sub-group definitiondefinition

source (country, region, producer, retailer, etc.)source (country, region, producer, retailer, etc.) sample selection method (esp for infection: clinical cases or random selection) sample selection method (esp for infection: clinical cases or random selection) Population sizePopulation size

Season of collectionSeason of collectionPortion description or sizePortion description or sizeMethod of sample collectionMethod of sample collection

Page 20: Dr Marion Wooldridge 26th July 2002

High quality microbiological data High quality microbiological data should have….should have…. Information on microbiological methods Information on microbiological methods

including:including:Pathogen species, subspecies, strainPathogen species, subspecies, strain

may include antibiotic resistance patternmay include antibiotic resistance pattern

tests used, including any variation from published tests used, including any variation from published methodsmethods

sensitivity and specificity of testssensitivity and specificity of testsunits usedunits usedprecision of measurementprecision of measurement

Page 21: Dr Marion Wooldridge 26th July 2002

High quality numerical data High quality numerical data should have….should have….

Results as raw dataResults as raw data Including:Including:

Number testedNumber tested results given for all samples testedresults given for all samples tested for pathogens, number of micro-organisms for pathogens, number of micro-organisms

not just positive/negative. not just positive/negative.

Page 22: Dr Marion Wooldridge 26th July 2002

A note on comparability….A note on comparability….

High quality data sets can easily be checked for High quality data sets can easily be checked for comparabilitycomparability

as they contain sufficient detailas they contain sufficient detail are they describing or measuring the same thing?are they describing or measuring the same thing? are the levels of uncertainty similar?are the levels of uncertainty similar?

Poor quality data sets are difficult to comparePoor quality data sets are difficult to compare lack detaillack detail

may never know if describing or measuring same thing!may never know if describing or measuring same thing! or may be exactly same data!!or may be exactly same data!! difficult/unwise to combine!difficult/unwise to combine!

Page 23: Dr Marion Wooldridge 26th July 2002

and on homogeneity….. and on homogeneity…..

homogeneity of input data aids homogeneity of input data aids comparability of model outputcomparability of model output

homogeneity achieved by, e.g. homogeneity achieved by, e.g. standardised methodsstandardised methods

for sampling, testing etc.for sampling, testing etc.

standard unitsstandard unitsstandard definitionsstandard definitions

for pathogen, host, product, portion size, etc.for pathogen, host, product, portion size, etc.

Page 24: Dr Marion Wooldridge 26th July 2002

Multiple data sources - why?Multiple data sources - why?

No data specific to the problem - but several studies No data specific to the problem - but several studies with some relevancewith some relevance

e.g. one old, one a different country etce.g. one old, one a different country etc

Several studies directly relevant to the problem - all Several studies directly relevant to the problem - all information needs inclusioninformation needs inclusion

e.g. regional prevalence studies available; need nationale.g. regional prevalence studies available; need national

Expert opinion - but a number of expertsExpert opinion - but a number of expertsand may need to combine with other dataand may need to combine with other data

Page 25: Dr Marion Wooldridge 26th July 2002

Multiple data sources….how?Multiple data sources….how?

model all separatelymodel all separately ‘‘what-if’ scenarioswhat-if’ scenarios several outputsseveral outputs for data inputs and ‘model uncertainty’for data inputs and ‘model uncertainty’

fit single distributionfit single distribution point valuespoint values

weight alternativesweight alternatives equally or differentially? equally or differentially? expert opinion on weights?expert opinion on weights? sample by weight distribution - single outputsample by weight distribution - single output for data inputs and ‘model uncertainty’for data inputs and ‘model uncertainty’

Page 26: Dr Marion Wooldridge 26th July 2002

Weighting exampleWeighting exampleOverall aim: Overall aim:

To estimate probability that a random bird from GB broiler poultry flock will To estimate probability that a random bird from GB broiler poultry flock will be campylobacter positive at point of slaughterbe campylobacter positive at point of slaughter

Current need: Current need: To estimate probability that a random flock is positiveTo estimate probability that a random flock is positive

Data available:Data available: Flock positive prevalence data from 4 separate studiesFlock positive prevalence data from 4 separate studies

Ref: Hartnett E, Kelly L, Newell D, Wooldridge M, Gettinby G (2001).Ref: Hartnett E, Kelly L, Newell D, Wooldridge M, Gettinby G (2001). The occurrence of campylobacter in broilers at the point of slaughter, a quantitative risk assessment. The occurrence of campylobacter in broilers at the point of slaughter, a quantitative risk assessment.

Epidemiology and Infection 127; 195-206.Epidemiology and Infection 127; 195-206.

Page 27: Dr Marion Wooldridge 26th July 2002

Probability flock is positive, Probability flock is positive, PPfpfp, per , per

data setdata set Raw data usedRaw data used PPfpfp per data setper data set

‘‘standard’ methodstandard’ method Beta distributionBeta distribution

r = positive flocksr = positive flocks s = flocks sampleds = flocks sampled

PPfp fp = Beta (r+1,s-r+1)= Beta (r+1,s-r+1)

for P1, P2, P4, P4for P1, P2, P4, P4 describes uncertainty describes uncertainty

per data setper data set 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

Probability a flock is positive

f(x

)

p1fp

p2fp

p3fp

p4fp

Page 28: Dr Marion Wooldridge 26th July 2002

Probability random GB flock is Probability random GB flock is positive - weighting method usedpositive - weighting method used

Data sets:Data sets: 2 x poultry companies2 x poultry companies 1 x colleagues (published) 1 x colleagues (published)

epidemiological studyepidemiological study 1 x another published study1 x another published study

Other data available:Other data available: Market share by total bird Market share by total bird

numbers for studies 1,2,3numbers for studies 1,2,3

Assumption made: Assumption made: study 4 was ‘rest of study 4 was ‘rest of

market’market’ approximationapproximation

Weighting method:Weighting method: uses all available uses all available

infoinfo better than ‘equal better than ‘equal

weighting’weighting’

Page 29: Dr Marion Wooldridge 26th July 2002

So - weights assigned….So - weights assigned….

based on market share per studybased on market share per studyP1+P2 = 35%; w1+w2 = 0.35P1+P2 = 35%; w1+w2 = 0.35

combined here as confidential datacombined here as confidential data

P3=50%; w3 = 0.50P3=50%; w3 = 0.50P4=15%; w4 = 0.15P4=15%; w4 = 0.15

probability random flock positive, probability random flock positive, PPfpfp

)44()33()22()11( wPwPwPwPP fpfpfpfpfp

Page 30: Dr Marion Wooldridge 26th July 2002

So - when to combine?So - when to combine?Depends on info neededDepends on info needed

in example, for random GB flockin example, for random GB flock must combinemust combine

but - loses infobut - loses info in example, no probability by sourcein example, no probability by source

Depends on info availableDepends on info available and assumptions on relevance/comparabilityand assumptions on relevance/comparability e.g: two studies 10 years old and 5 years old:e.g: two studies 10 years old and 5 years old:

if prevalence studies - leave separate? Trend info? ‘What-if’ scenarios better?if prevalence studies - leave separate? Trend info? ‘What-if’ scenarios better? if effects of heat on organism - will this have altered? Check method, detection if effects of heat on organism - will this have altered? Check method, detection

efficacy, strain etc - but age of study efficacy, strain etc - but age of study per seper se not relevant not relevant

Assessors judgementAssessors judgement

Page 31: Dr Marion Wooldridge 26th July 2002

Fitting distributionsFitting distributions

Could fill a book (has done!)Could fill a book (has done!)Discrete or continuous?Discrete or continuous?

Limited values ‘v’ any value (within the range); e.g.Limited values ‘v’ any value (within the range); e.g. number of chickens in flock - discrete (1,2,3... etc)number of chickens in flock - discrete (1,2,3... etc) chicken bodyweight - continuous (within range min-max)chicken bodyweight - continuous (within range min-max)

Parametric or non-parametric?Parametric or non-parametric? Theoretically described ‘v’ directly from data; e.g. Theoretically described ‘v’ directly from data; e.g.

incubation period - theoretically lognormal - parametricincubation period - theoretically lognormal - parametric percentage free-range flocks by region - direct from datapercentage free-range flocks by region - direct from data

Truncated or not?Truncated or not? For unbounded distributions - e.g. incubation period > lifespan?For unbounded distributions - e.g. incubation period > lifespan?

Page 32: Dr Marion Wooldridge 26th July 2002

Where do we start?Where do we start?Discrete or continuous?Discrete or continuous?

Inspect the dataInspect the data whole numbers = limited values = discretewhole numbers = limited values = discrete

Parametric or non-parametric?Parametric or non-parametric? Use parametric with care! - usually if:Use parametric with care! - usually if:

theoretical basis known and appropriatetheoretical basis known and appropriate has proved to be accurate even if theory undemonstratedhas proved to be accurate even if theory undemonstrated parameters define distributionparameters define distribution

non-parametric non-parametric generally more useful/appropriategenerally more useful/appropriate

For biological process dataFor biological process data for selected distribution - biological plausibility (at least!)for selected distribution - biological plausibility (at least!)

gross error check…..gross error check…..

Page 33: Dr Marion Wooldridge 26th July 2002

Distribution fitting example...Distribution fitting example...

Aim: Aim: To estimate, for a random GB broiler flock at slaughter, its To estimate, for a random GB broiler flock at slaughter, its

probable age probable age

Data available:Data available: Age at slaughter, in weeks, for a large number of GB broiler flocksAge at slaughter, in weeks, for a large number of GB broiler flocks

Ref: Hartnett E (2001).Ref: Hartnett E (2001). Human infection with Campylobacter spp. from chicken consumption: A quantitative risk assessment. Human infection with Campylobacter spp. from chicken consumption: A quantitative risk assessment.

PhD Thesis, Department of Statistics and Modelling Science, University of Strathclyde. 322 pages. PhD Thesis, Department of Statistics and Modelling Science, University of Strathclyde. 322 pages. Funded by, and work undertaken in, The Risk Research Department, The Veterinary Laboratories Funded by, and work undertaken in, The Risk Research Department, The Veterinary Laboratories

Agency, UK. Agency, UK.

Page 34: Dr Marion Wooldridge 26th July 2002

Discrete or continuous?Discrete or continuous?Inspect data…age given in whole weeks - but:Inspect data…age given in whole weeks - but:

no reason why slaughter should be at specific weekly intervalsno reason why slaughter should be at specific weekly intervals time is a time is a continuouscontinuous variable variable

Parametric or non-parametric?Parametric or non-parametric?No theoretical basisNo theoretical basis

non-parametric more appropriatenon-parametric more appropriate

Other considerationsOther considerationsslaughter for meatslaughter for meat

shouldn’t be too young (too small) or too old (industry economics)shouldn’t be too young (too small) or too old (industry economics) suggests bounded distributionsuggests bounded distribution

Sequence of steps: 1. Decisions….Sequence of steps: 1. Decisions….

Page 35: Dr Marion Wooldridge 26th July 2002

2. What does the distribution look like? 2. What does the distribution look like? Scatter plot drawn…..Scatter plot drawn…..

Suggests triangular ‘appearance’Suggests triangular ‘appearance’

0

2

4

6

8

10

12

25 35 45 55 65 75

Age at slaughter

Fre

qu

en

cy

Page 36: Dr Marion Wooldridge 26th July 2002

3. Is this appropriate?3. Is this appropriate?

Triangular distribution is..Triangular distribution is..continuouscontinuousnon-parametricnon-parametricboundedbounded

Sounds good so far!Sounds good so far!‘‘checked’ in BestFit (Palisade)checked’ in BestFit (Palisade)

only AFTER logic consideredonly AFTER logic considered gave Triang as best fit gave Triang as best fit

Page 37: Dr Marion Wooldridge 26th July 2002

Comparison of Input Distribution and Triangulardistribution

0.00

0.04

0.08

28.0 35.2 42.4 49.6 56.8 64.0

Input

Triang

4. Check: Rank 1 using Kolmogorov-Smirnov test in 4. Check: Rank 1 using Kolmogorov-Smirnov test in BestFitBestFit

Page 38: Dr Marion Wooldridge 26th July 2002

Conclusion…..Conclusion…..

Use Triang in model!Use Triang in model!And it was…. And it was….

Note: This was also an example of combining (point value) data from multiple sources by fitting a single distribution Note: This was also an example of combining (point value) data from multiple sources by fitting a single distribution

Page 39: Dr Marion Wooldridge 26th July 2002

Summary….Summary….

Acceptable data quality Acceptable data quality requires judgementrequires judgement

Multiple data sets - to combine or not?Multiple data sets - to combine or not? requires judgementrequires judgement

Distribution - the most appropriate?Distribution - the most appropriate? requires judgement requires judgement

Risk assessment is art plus scienceRisk assessment is art plus science assessor frequently uses judgement assessor frequently uses judgement and makes assumptionsand makes assumptions

Only safeguard: transparency!Only safeguard: transparency!