random sampling practical applications in ediscovery

RANDOM SAMPLINGPRACTICAL APPLICATIONS IN eDISCOVERY

WELCOME

Thank you for joining Numerous diverse attendees Today’s topic and presenters Question submission for later response You will receive slides, recording and survey

tomorrow Coming up next month – SharePoint webinar

SPEAKERS

Matthew Verga– Director, Content Marketing and eDiscovery Strategy

TODAY’S TOPICS

Sampling’s opaque ubiquity and the dark ages of discovery

Finding out what’s in a new dataset, aka estimating prevalence

Finding out how good a search string is, aka testing classifiers

Finding out how good your reviewers are, aka quality control

Finding out how much stuff you missed, aka measuring elusion

SAMPLING’S OPAQUE UBIQUITY AND THE DARK AGES OF

DISCOVERY

INTRODUCTION

The topic of “sampling” comes up constantly – when referring to collections, – early case assessment – and review

o both human o and technology-assisted

Before review software incorporated sophisticated sampling tools, practitioners were taking samples manually

INTRODUCTION, CONT.

When I started out, nearly 8 years ago, the best wisdom:

– Included iterative testing of search strings by partners or senior attorneys, who would informally sample the results of each revised search string to inform their next revision

– Suggested employing a 3-pass document review process with successively more senior attorneys performing each pass:o The first pass reviewed everythingo The second pass re-reviewed a random 10% sampleo And the third pass re-reviewed a random 5% sample

INTRODUCTION, CONT.

But, but, but:– Why is a search that returns more documents than

expected invalid? – How many search results are enough to sample? – Why re-review 10% and 5%? – What’s the basis?

o (Answer Key: It’s not necessarily; Not just however many “feels right”; Mostly because it’s what was done before; There isn’t much of one)

Turns out, law schools need to add statistics courses

FINDING OUT WHAT’S IN A NEW DATASET, AKA ESTIMATING

PREVALENCE

ESTIMATING PREVALENCE

Finding out what’s in a new, unknown dataset

Prevalence– Prevalence is the portion of a dataset that is relevant

to a particular information need– For example, if one third of a dataset was relevant in a

case, the prevalence of relevant materials would be 33%– Always known at the end of a document review project

Why estimate it at the beginning?

ESTIMATING PREVALENCE, CONT.

Knowing the prevalence of relevant materials can guide the selection of culling and review techniques to be employed– (It can also provide a measuring stick for overall progress)

Knowing the prevalence of different subclasses of materials can guide decisions about resource allocation and prioritization– (e.g., associates vs. contract attorneys vs. LPO)

Knowing the prevalence of specific features facilitates more accurate estimation of project costs:– (e.g., volume to review, volume to redact, volume to privilege log,

etc.)


Estimating prevalence of one or more features of a new, unknown dataset is fundamentally valuable because it provides discovery intelligence for data-driven decision making, replacing gut-feelings and anecdotes with data and knowledge


Now that we know why estimating prevalence can be valuable, how do we do it?

Steps:– Step 1: Identify your sampling frame– Step 2: Determine your needed sample size– Step 3: Take and review your simple random sample– Step 4: Calculate your prevalence estimate


Step 1: Identify your sampling frame

Generally, the same pool that would be subjected to review:– A pool with system files removed (de-NISTed)– A pool with documents outside of any applicable date

range removed– A pool that has been de-duplicated– A pool to which any other obvious, objective culling

criteria have been appliedo (e.g., court mandated key word or custodian filtering)


Step 2: Determine your sample size

The sample size you should take depends on:– The strength of the measurement you wish to take– The size of your sampling frame– The prevalence of relevant material within the dataset

Let’s look at how each affects sample size


Step 2: Determine your sample size, cont.– The strength of the measurement you want to take

Expressed through two values: confidence level and interval– Confidence level

o How certain you are about the results you geto How many times out of 100 would you get the same resultso Typically 90%, 95%, or 99%

– Confidence intervalo how precise your results areo how much uncertainty there is in your resultso Typically between +/-2% and +/-5%


Step 2: Determine your sample size, cont.– The strength of the measurement you want to take

The higher the confidence level sought, the larger the sample size that will be required

The lower the confidence interval sought, the larger the sample size that will be required


Step 2: Determine your sample size, cont.– The size of your sampling frame

The larger the sampling frame, the larger the sample size needed – but only up to a point– Required sample size eventually levels off– The sample size needed for 1,000,000 documents is not

much larger than the sample size needed for 100,000 documents

– Potential for cost savings compared to old methods


Step 2: Determine your sample size, cont. – The prevalence of relevant material within the dataset

Sample size decreases as prevalence increases or decreases from 50%– When taking a sample to estimate prevalence,

prevalence is unknown– So, the most conservative option (i.e., resulting in the

largest sample size) should be used, which is 50%– Most sampling calculators default to this; on some it is

not variable


Step 2: Determine your sample size, cont.

An Example:– Sampling frame of 1,000,000– Desired strength of 95%, +/-2%– Assumed prevalence of 50%

Resulting Sample Size: 2,396 Documents

Sampling calculators are integrated into most review tools and are also available online


Step 3: Take and review your simple random sample

Simple random sample: one in which every document has an equal chance of being selected– Most modern review programs include sampling tools– Spreadsheet programs can generate lists of random numbers

Ensure the highest quality review possible, as any errors in the review of the sample will be effectively amplified in the estimations based on that review


Step 4: Calculate your prevalence estimate

Resulting Sample Size: 2,396 Documents (95%, +/-2%)– If review finds 599/2,396 (i.e., 25%) are relevant,

then…

o You would have 95% confidence that the overall prevalence of relevant documents is between 23% and 27%

o You would have 95% confidence that between 230,000 and 270,000 of the 1,000,000 document sampling frame are relevant


Beyond relevant documents, you could measure the prevalence of:– Privileged documents– Documents requiring redaction– Documents presenting HIPAA or FOIA issues– Documents requiring other special handling

You can use this information to:– Guide the selection of culling and review techniques– Provide a measuring stick for overall progress– Guide decisions about resource allocation– Estimate project costs more accurately

FINDING OUT HOW GOOD A SEARCH STRING IS, AKA TESTING

CLASSIFIERS

TESTING CLASSIFIERS

Finding out how good a search string is

Classifiers– Tools, mechanisms or processes by which documents are

classified into categories like responsive/nonresponsive or privileged/non-privileged

– The tools, mechanisms, and processes employed could include:o Keyword or Boolean searcheso Individual human reviewerso Overall team review processeso Machine categorization by latent semantic indexingo Predictive coding by probabilistic latent semantic analysis

TESTING CLASSIFIERS, CONT.

What’s the value of testing classifiers?

Testing classifiers, like estimating prevalence, is a source of discovery intelligence to guide data-driven decision making

– First, testing classifiers can improve the selection and iterative refinement of search strings and other classifiers

– Second, testing provides a stronger basis for argument for or against search strings or classifiers during pre-trial proceedings


The efficacy of classifiers is expressed through two values: recall and precision

– Recall is a measurement of how much of the material sought was returned by the classifiero e.g., if 250,000 relevant documents exist and a search returns 125,000 of

them, it has a recall of 50%, finding half of what is soughto Under-inclusiveness

– Precision is a measurement of how much unwanted material is returned by a classifiero e.g., if a search returns 150,000 documents of which only 50,000 are

relevant, it has a precision of just 33%, having returned 100,000 unwanted items

o Over-inclusiveness


Testing classifiers before applying them to a full dataset requires the creation of a control set against which they can be tested

A control set is a representative sample of the full dataset that has already been classified by the best reviewers possible so that it can function as a gold standard

The classifier is run against the control set, and its classifications are compared against the experts’


Example– A simple random sample of 2,396 documents has been

reviewed by subject matter experts for relevance to function as a control set

– A search string proposed by plaintiff is run against the control set

– It returns 1,200 out of the 2,396 documents, a mix of documents deemed relevant and not relevant by the subject matter experts

– How do you sort out the recall and precision?


Relevant (SMEs) Not Relevant (SMEs)

Returned (P. Search) 400 800

Not Returned (P. Search) 600 596

There are 400 true positives– Deemed relevant by SME and returned by P. Search

There are 596 true negatives– Deemed not relevant by SME and not returned by P. Search

There are 800 false positives– Deemed not relevant by SME but returned by P. Search

There are 600 false negatives– Deemed relevant by SME but not returned by P. Search


Calculating recall:– Recall is the percentage of all relevant documents

correctly returned– 400 (TP) out of 1000 (TP+FN) = .40 = 40%– The P. Search has recall of 40%

Calculating precision:– Precision is the percentage of the returned documents

that are relevant– 400 (TP) out of 1200 (TP +FP) = .33 = 33%– The P. Search has precision of 33%

FINDING OUT HOW GOOD YOUR REVIEWERS ARE, AKA QUALITY

CONTROL

QUALITY CONTROL

Finding out how good your reviewers are

Accuracy and error rate in document review– Accuracy is the measure of how many reviewer determinations

are correct– Error rate is the measure hoe many reviewer determinations are

incorrecto Together, accuracy and error rate should total 100%

Like testing classifiers, involves the comparison of one set of classifications (the initial reviewer’s) to another (the more senior reviewer performing quality control review)

QUALITY CONTROL, CONT.

Lot acceptance sampling:– Methodology employed in pharmaceutical

manufacturing, military contract fulfillment, and other high-volume, quality-focused processes

How it works:– A sampling protocol and maximum acceptable error rate

is established– Each lot has a random sample taken from it for testing – If the acceptable error rate is exceeded, the entire lot is

rejected without further evaluation

QUALITY CONTROL, CONT.

In the context of document review, the “lot” can correspond to individual reviewer batches or to aggregations, such as an individual reviewer’s total completed review for each day

Statistics on batch rejection can be tracked by reviewer, by team, by source material, or by other useful properties

Choosing not to acknowledge or measure the error rate in a document review project does not mean that it does not exist

FINDING OUT HOW MUCH STUFF YOU MISSED, AKA MEASURING

ELUSION

MEASURING ELUSION

Finding out how much stuff you missed

Elusion– Elusion is a measure of how much relevant material was left

behind– Typically referenced in the context of predictive coding and

the portion of a dataset left unreviewed by a technology-assisted review process

An elusion measurement can also be thought of as an estimation or prevalence specific to the remainder pile

MEASURING ELUSION, CONT.

Example– Even after targeted collection and processing, 1,000,000

documents

– Use key word and Boolean searches to cull that 1,000,000 to 250,000 for human document review

– In addition to documenting the culling techniques employed and their rationales, an elusion/prevalence measurement can be taken


Example, cont.– For a remainder pile of 750,000 documents

– Reviewing a simple random sample of just 16,229 documents

– Will allow you to measure the elusion o (aka the prevalence of relevant materials in the remainder)

– With a 99% confidence level

– And a confidence interval of +/-1%


Example, cont.– Assuming, hypothetically, that your review of that sample found

5% of that sample of the remainder to be relevanto You would then be able to say with 99% confidence o That no more than 4-6% of the remainder is relevant

– (aka 30,000-45,000 documents out of 750,000)

– You would also learn about what types of missed materials existo To develop a method of targeted retrievalo To demonstrate low utility of retrieval

– (i.e., by showing material is duplicative or of low probative value relative to additional expense required for retrieval)

REVIEW AND PREVIEW

REVIEW

Estimating prevalence– Discovery intelligence– Four steps

o Identify sampling frameo Determine sample size (measurement strength)o Take and review simple random sampleo Calculate prevalence estimate

Testing classifiers– Evaluated for recall and precision

o Recall is how under-inclusiveo Precision is how over-inclusive

– Must be tested against control set

REVIEW, CONT.

Quality control of human document review– Evaluated for accuracy and error rate– Lot acceptance sampling– Acceptable error rate

Measuring elusion– Elusion is the prevalence of relevant materials in the

remainder– Measured the same way as estimating prevalence– Useful both for process validation and for defensibility

ENGAGE WITH MODUS

Modus email coming tomorrow with slides, recording, survey and invite to next webinar

Visit us online at http://discovermodus.com/webinars/ for more information on next webinar on January 27th at 1:00PM EST entitled: – “Meeting the Microsoft SharePoint Discovery Challenge”

Make sure to visit our website for valuable white papers, blogs and other information at www.discovermodus.com – White Paper: “Practical Applications of Random Sampling in

eDiscovery”

Thank you!

http://discovermodus.com/webinars/

http://discovermodus.com/webinars/

http://www.discovermodus.com/