random sampling practical applications in ediscovery
TRANSCRIPT
![Page 1: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/1.jpg)
RANDOM SAMPLINGPRACTICAL APPLICATIONS IN eDISCOVERY
![Page 2: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/2.jpg)
WELCOME
Thank you for joining Numerous diverse attendees Today’s topic and presenters Question submission for later response You will receive slides, recording and survey
tomorrow Coming up next month – SharePoint webinar
![Page 3: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/3.jpg)
SPEAKERS
Matthew Verga– Director, Content Marketing and eDiscovery Strategy
![Page 4: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/4.jpg)
TODAY’S TOPICS
Sampling’s opaque ubiquity and the dark ages of discovery
Finding out what’s in a new dataset, aka estimating prevalence
Finding out how good a search string is, aka testing classifiers
Finding out how good your reviewers are, aka quality control
Finding out how much stuff you missed, aka measuring elusion
![Page 5: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/5.jpg)
SAMPLING’S OPAQUE UBIQUITY AND THE DARK AGES OF
DISCOVERY
![Page 6: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/6.jpg)
INTRODUCTION
The topic of “sampling” comes up constantly – when referring to collections, – early case assessment – and review
o both human o and technology-assisted
Before review software incorporated sophisticated sampling tools, practitioners were taking samples manually
![Page 7: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/7.jpg)
INTRODUCTION, CONT.
When I started out, nearly 8 years ago, the best wisdom:
– Included iterative testing of search strings by partners or senior attorneys, who would informally sample the results of each revised search string to inform their next revision
– Suggested employing a 3-pass document review process with successively more senior attorneys performing each pass:o The first pass reviewed everythingo The second pass re-reviewed a random 10% sampleo And the third pass re-reviewed a random 5% sample
![Page 8: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/8.jpg)
INTRODUCTION, CONT.
But, but, but:– Why is a search that returns more documents than
expected invalid? – How many search results are enough to sample? – Why re-review 10% and 5%? – What’s the basis?
o (Answer Key: It’s not necessarily; Not just however many “feels right”; Mostly because it’s what was done before; There isn’t much of one)
Turns out, law schools need to add statistics courses
![Page 9: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/9.jpg)
FINDING OUT WHAT’S IN A NEW DATASET, AKA ESTIMATING
PREVALENCE
![Page 10: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/10.jpg)
ESTIMATING PREVALENCE
Finding out what’s in a new, unknown dataset
Prevalence– Prevalence is the portion of a dataset that is relevant
to a particular information need– For example, if one third of a dataset was relevant in a
case, the prevalence of relevant materials would be 33%– Always known at the end of a document review project
Why estimate it at the beginning?
![Page 11: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/11.jpg)
ESTIMATING PREVALENCE, CONT.
Knowing the prevalence of relevant materials can guide the selection of culling and review techniques to be employed– (It can also provide a measuring stick for overall progress)
Knowing the prevalence of different subclasses of materials can guide decisions about resource allocation and prioritization– (e.g., associates vs. contract attorneys vs. LPO)
Knowing the prevalence of specific features facilitates more accurate estimation of project costs:– (e.g., volume to review, volume to redact, volume to privilege log,
etc.)
![Page 12: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/12.jpg)
ESTIMATING PREVALENCE, CONT.
Estimating prevalence of one or more features of a new, unknown dataset is fundamentally valuable because it provides discovery intelligence for data-driven decision making, replacing gut-feelings and anecdotes with data and knowledge
![Page 13: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/13.jpg)
ESTIMATING PREVALENCE, CONT.
Now that we know why estimating prevalence can be valuable, how do we do it?
Steps:– Step 1: Identify your sampling frame– Step 2: Determine your needed sample size– Step 3: Take and review your simple random sample– Step 4: Calculate your prevalence estimate
![Page 14: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/14.jpg)
ESTIMATING PREVALENCE, CONT.
Step 1: Identify your sampling frame
Generally, the same pool that would be subjected to review:– A pool with system files removed (de-NISTed)– A pool with documents outside of any applicable date
range removed– A pool that has been de-duplicated– A pool to which any other obvious, objective culling
criteria have been appliedo (e.g., court mandated key word or custodian filtering)
![Page 15: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/15.jpg)
ESTIMATING PREVALENCE, CONT.
Step 2: Determine your sample size
The sample size you should take depends on:– The strength of the measurement you wish to take– The size of your sampling frame– The prevalence of relevant material within the dataset
Let’s look at how each affects sample size
![Page 16: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/16.jpg)
ESTIMATING PREVALENCE, CONT.
Step 2: Determine your sample size, cont.– The strength of the measurement you want to take
Expressed through two values: confidence level and interval– Confidence level
o How certain you are about the results you geto How many times out of 100 would you get the same resultso Typically 90%, 95%, or 99%
– Confidence intervalo how precise your results areo how much uncertainty there is in your resultso Typically between +/-2% and +/-5%
![Page 17: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/17.jpg)
ESTIMATING PREVALENCE, CONT.
Step 2: Determine your sample size, cont.– The strength of the measurement you want to take
The higher the confidence level sought, the larger the sample size that will be required
The lower the confidence interval sought, the larger the sample size that will be required
![Page 18: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/18.jpg)
ESTIMATING PREVALENCE, CONT.
![Page 19: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/19.jpg)
ESTIMATING PREVALENCE, CONT.
Step 2: Determine your sample size, cont.– The size of your sampling frame
The larger the sampling frame, the larger the sample size needed – but only up to a point– Required sample size eventually levels off– The sample size needed for 1,000,000 documents is not
much larger than the sample size needed for 100,000 documents
– Potential for cost savings compared to old methods
![Page 20: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/20.jpg)
ESTIMATING PREVALENCE, CONT.
![Page 21: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/21.jpg)
ESTIMATING PREVALENCE, CONT.
Step 2: Determine your sample size, cont. – The prevalence of relevant material within the dataset
Sample size decreases as prevalence increases or decreases from 50%– When taking a sample to estimate prevalence,
prevalence is unknown– So, the most conservative option (i.e., resulting in the
largest sample size) should be used, which is 50%– Most sampling calculators default to this; on some it is
not variable
![Page 22: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/22.jpg)
ESTIMATING PREVALENCE, CONT.
![Page 23: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/23.jpg)
ESTIMATING PREVALENCE, CONT.
Step 2: Determine your sample size, cont.
An Example:– Sampling frame of 1,000,000– Desired strength of 95%, +/-2%– Assumed prevalence of 50%
Resulting Sample Size: 2,396 Documents
Sampling calculators are integrated into most review tools and are also available online
![Page 24: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/24.jpg)
ESTIMATING PREVALENCE, CONT.
Step 3: Take and review your simple random sample
Simple random sample: one in which every document has an equal chance of being selected– Most modern review programs include sampling tools– Spreadsheet programs can generate lists of random numbers
Ensure the highest quality review possible, as any errors in the review of the sample will be effectively amplified in the estimations based on that review
![Page 25: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/25.jpg)
ESTIMATING PREVALENCE, CONT.
Step 4: Calculate your prevalence estimate
Resulting Sample Size: 2,396 Documents (95%, +/-2%)– If review finds 599/2,396 (i.e., 25%) are relevant,
then…
o You would have 95% confidence that the overall prevalence of relevant documents is between 23% and 27%
o You would have 95% confidence that between 230,000 and 270,000 of the 1,000,000 document sampling frame are relevant
![Page 26: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/26.jpg)
ESTIMATING PREVALENCE, CONT.
Beyond relevant documents, you could measure the prevalence of:– Privileged documents– Documents requiring redaction– Documents presenting HIPAA or FOIA issues– Documents requiring other special handling
You can use this information to:– Guide the selection of culling and review techniques– Provide a measuring stick for overall progress– Guide decisions about resource allocation– Estimate project costs more accurately
![Page 27: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/27.jpg)
FINDING OUT HOW GOOD A SEARCH STRING IS, AKA TESTING
CLASSIFIERS
![Page 28: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/28.jpg)
TESTING CLASSIFIERS
Finding out how good a search string is
Classifiers– Tools, mechanisms or processes by which documents are
classified into categories like responsive/nonresponsive or privileged/non-privileged
– The tools, mechanisms, and processes employed could include:o Keyword or Boolean searcheso Individual human reviewerso Overall team review processeso Machine categorization by latent semantic indexingo Predictive coding by probabilistic latent semantic analysis
![Page 29: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/29.jpg)
TESTING CLASSIFIERS, CONT.
What’s the value of testing classifiers?
Testing classifiers, like estimating prevalence, is a source of discovery intelligence to guide data-driven decision making
– First, testing classifiers can improve the selection and iterative refinement of search strings and other classifiers
– Second, testing provides a stronger basis for argument for or against search strings or classifiers during pre-trial proceedings
![Page 30: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/30.jpg)
TESTING CLASSIFIERS, CONT.
The efficacy of classifiers is expressed through two values: recall and precision
– Recall is a measurement of how much of the material sought was returned by the classifiero e.g., if 250,000 relevant documents exist and a search returns 125,000 of
them, it has a recall of 50%, finding half of what is soughto Under-inclusiveness
– Precision is a measurement of how much unwanted material is returned by a classifiero e.g., if a search returns 150,000 documents of which only 50,000 are
relevant, it has a precision of just 33%, having returned 100,000 unwanted items
o Over-inclusiveness
![Page 31: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/31.jpg)
TESTING CLASSIFIERS, CONT.
Testing classifiers before applying them to a full dataset requires the creation of a control set against which they can be tested
A control set is a representative sample of the full dataset that has already been classified by the best reviewers possible so that it can function as a gold standard
The classifier is run against the control set, and its classifications are compared against the experts’
![Page 32: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/32.jpg)
TESTING CLASSIFIERS, CONT.
Example– A simple random sample of 2,396 documents has been
reviewed by subject matter experts for relevance to function as a control set
– A search string proposed by plaintiff is run against the control set
– It returns 1,200 out of the 2,396 documents, a mix of documents deemed relevant and not relevant by the subject matter experts
– How do you sort out the recall and precision?
![Page 33: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/33.jpg)
TESTING CLASSIFIERS, CONT.
Relevant (SMEs) Not Relevant (SMEs)
Returned (P. Search) 400 800
Not Returned (P. Search) 600 596
There are 400 true positives– Deemed relevant by SME and returned by P. Search
There are 596 true negatives– Deemed not relevant by SME and not returned by P. Search
There are 800 false positives– Deemed not relevant by SME but returned by P. Search
There are 600 false negatives– Deemed relevant by SME but not returned by P. Search
![Page 34: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/34.jpg)
TESTING CLASSIFIERS, CONT.
Calculating recall:– Recall is the percentage of all relevant documents
correctly returned– 400 (TP) out of 1000 (TP+FN) = .40 = 40%– The P. Search has recall of 40%
Calculating precision:– Precision is the percentage of the returned documents
that are relevant– 400 (TP) out of 1200 (TP +FP) = .33 = 33%– The P. Search has precision of 33%
![Page 35: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/35.jpg)
FINDING OUT HOW GOOD YOUR REVIEWERS ARE, AKA QUALITY
CONTROL
![Page 36: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/36.jpg)
QUALITY CONTROL
Finding out how good your reviewers are
Accuracy and error rate in document review– Accuracy is the measure of how many reviewer determinations
are correct– Error rate is the measure hoe many reviewer determinations are
incorrecto Together, accuracy and error rate should total 100%
Like testing classifiers, involves the comparison of one set of classifications (the initial reviewer’s) to another (the more senior reviewer performing quality control review)
![Page 37: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/37.jpg)
QUALITY CONTROL, CONT.
Lot acceptance sampling:– Methodology employed in pharmaceutical
manufacturing, military contract fulfillment, and other high-volume, quality-focused processes
How it works:– A sampling protocol and maximum acceptable error rate
is established– Each lot has a random sample taken from it for testing – If the acceptable error rate is exceeded, the entire lot is
rejected without further evaluation
![Page 38: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/38.jpg)
QUALITY CONTROL, CONT.
In the context of document review, the “lot” can correspond to individual reviewer batches or to aggregations, such as an individual reviewer’s total completed review for each day
Statistics on batch rejection can be tracked by reviewer, by team, by source material, or by other useful properties
Choosing not to acknowledge or measure the error rate in a document review project does not mean that it does not exist
![Page 39: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/39.jpg)
FINDING OUT HOW MUCH STUFF YOU MISSED, AKA MEASURING
ELUSION
![Page 40: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/40.jpg)
MEASURING ELUSION
Finding out how much stuff you missed
Elusion– Elusion is a measure of how much relevant material was left
behind– Typically referenced in the context of predictive coding and
the portion of a dataset left unreviewed by a technology-assisted review process
An elusion measurement can also be thought of as an estimation or prevalence specific to the remainder pile
![Page 41: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/41.jpg)
MEASURING ELUSION, CONT.
Example– Even after targeted collection and processing, 1,000,000
documents
– Use key word and Boolean searches to cull that 1,000,000 to 250,000 for human document review
– In addition to documenting the culling techniques employed and their rationales, an elusion/prevalence measurement can be taken
![Page 42: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/42.jpg)
MEASURING ELUSION, CONT.
Example, cont.– For a remainder pile of 750,000 documents
– Reviewing a simple random sample of just 16,229 documents
– Will allow you to measure the elusion o (aka the prevalence of relevant materials in the remainder)
– With a 99% confidence level
– And a confidence interval of +/-1%
![Page 43: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/43.jpg)
MEASURING ELUSION, CONT.
Example, cont.– Assuming, hypothetically, that your review of that sample found
5% of that sample of the remainder to be relevanto You would then be able to say with 99% confidence o That no more than 4-6% of the remainder is relevant
– (aka 30,000-45,000 documents out of 750,000)
– You would also learn about what types of missed materials existo To develop a method of targeted retrievalo To demonstrate low utility of retrieval
– (i.e., by showing material is duplicative or of low probative value relative to additional expense required for retrieval)
![Page 44: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/44.jpg)
REVIEW AND PREVIEW
![Page 45: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/45.jpg)
REVIEW
Estimating prevalence– Discovery intelligence– Four steps
o Identify sampling frameo Determine sample size (measurement strength)o Take and review simple random sampleo Calculate prevalence estimate
Testing classifiers– Evaluated for recall and precision
o Recall is how under-inclusiveo Precision is how over-inclusive
– Must be tested against control set
![Page 46: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/46.jpg)
REVIEW, CONT.
Quality control of human document review– Evaluated for accuracy and error rate– Lot acceptance sampling– Acceptable error rate
Measuring elusion– Elusion is the prevalence of relevant materials in the
remainder– Measured the same way as estimating prevalence– Useful both for process validation and for defensibility
![Page 47: RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY](https://reader038.vdocuments.us/reader038/viewer/2022103005/56649db05503460f94a9d8ae/html5/thumbnails/47.jpg)
ENGAGE WITH MODUS
Modus email coming tomorrow with slides, recording, survey and invite to next webinar
Visit us online at http://discovermodus.com/webinars/ for more information on next webinar on January 27th at 1:00PM EST entitled: – “Meeting the Microsoft SharePoint Discovery Challenge”
Make sure to visit our website for valuable white papers, blogs and other information at www.discovermodus.com – White Paper: “Practical Applications of Random Sampling in
eDiscovery”
Thank you!