castillo high quality program evaluation in nonprofits

High Quality Program Evaluation in Nonprofit Organizations

February 2015 DCPNI – Isaac Castillo - @Isaac_outcomes 1

Isaac D. CastilloDeputy Director

DC Promise Neighborhood Initiative@Isaac_outcomes

[email protected]

February 19, 2015

mailto:[email protected]

Why Bother With All of This?

February 2015 Isaac Castillo - @isaac_outcomes 2

Ultimately, you should be measuring outcomes or effectiveness for a

single reason:

To better serve your clients / population.

Learning Objectives

• Understand methods to determine if your program is really leading to positive change (as opposed to change happening due to chance)

• Learn best practices in using and analyzing surveys and how to avoid common mistakes

• Identify best ways to balance costs and quality when doing program evaluation


Outputs vs Outcomes

• Output measures assess what you do and who you serve. Examples include:

• Served 100 youth during summer camp

• Provided 2,250 hours of tutoring during the academic year

• 9 out of 10 youth attended at least 75 % of available art instruction classes offered

Outcome measures assess changes in your target population. Examples include:

• 75 % of youth increased their knowledge of local history during the summer camp

• 50% of youth increased math grades by one grade level during the academic year

• 25% fewer youth reported being involved in bullying over the last year


Outputs

• Outputs DO: – Tell you about whether your program was

implemented well. For example, they indicate whether a program:• delivered the intended number of sessions• reached its intended population• resulted in adequate participation levels

• Outputs DO NOT: – Tell you if participants benefited from your program– Serve as indicators of program success or

effectiveness


Outcomes

• Outcomes DO:

– Tell you if participants benefited from your program

– Serve as indicators of program success or effectiveness

• Outcomes DO NOT:

– Tell you about whether your program was implemented well (or provide clues about how your program improved participant outcomes)


What is Program Evaluation?

• Process to determine if your program / intervention / approach is effective.

• Need to define what is ‘success’ for your program first.

• Program evaluation does NOT need to be done by specialists or outsiders – but those people do add credibility and rigor (in most cases)

7February 2015 @Isaac_outcomes

The Basics of Program Evaluation –An Example

• The concept of dieting – if you understand dieting, you understand the basics of program evaluation.

• What is the goal of dieting (how do you define dieting ‘success’)?

• How do you know if your diet ‘works’?


Data and Dieting


Person weighs 200 Pounds (90 Kilograms)

• Does that data point alone tell us anything?

• Context Matters – what if person is 4 feet tall and 10 years old? • Timing Matters – is this at beginning, end, or middle of diet?

Could Be About More Than Weight


• Other things that could be measured:• Body Mass Index (BMI)• Physical fitness• Blood measures (cholesterol levels)• Own perceptions of health / feeling• Appearance / muscle tone

Outcomes vs. Impact

• “Impact” gets used loosely. Precise meaning in evaluation world: impact = difference between program outcomes and comparison group (usually through RCT).

• “Outcomes” focus on measuring the effectiveness of your program. Help to determine the effectiveness of your program.

• Be aware of the differences in terms and who your audience is.

February 2015 Isaac Castillo - @Isaac_outcomes 11

How Can Nonprofits Measure Change?

• Easiest thing to do is to measure before and after for your participants.

• Can also compare to other groups.

• How and what you measure is just as important.


Traditional (Time Series)

• Most common type of program evaluation.

• Looking to see if things have changed over time.

• What was situation before program, then what was situation after program.

• Must measure same things, in same ways, at both points in time.


Before Program Program Delivered After Program

Comparison Group

• A time series study that compares to another group (that does not receive programming).

• More rigorous, but more challenging


Program Delivered

Before ProgramNo (or minimal)

programming After Program

Who / What Will You Evaluate?

• Need to define the population that will be evaluated.

• Need to define ‘success measures’ (outcomes) – what are you trying to achieve?

• Once these questions are answered, then need to consider which participants will be part of the evaluation (and maybe who gets programming).


In Time Series, This is Simple

• Usually just serve and evaluate those that enroll in the program:

• First come, first served is what is frequently used if there are too many potential participants.


Before Program Program Delivered After Program

Self-selection

Comparison Groups Are More Complicated

• Can select by randomizing participants into groups:


Program Delivered



Random Selection

Compare across high/low dosage

• Can use self-selection:


Program Delivered



High Attendance

Self-selection

Low Attendance

But How Do You Measure Change?

• Most common ways:

– Use data that someone else has collected (report cards, health status, etc.)

– Pre/post-tests or surveys – at least two points in time.

– Focus groups or interviews.

• Can (and should) combine these.


How can you quickly analyze data?

• You can do a lot in Excel.

• Think about assumptions and questions ahead of time.

• Think about your analysis before your program starts.

• Open ended and text responses are time consuming to analyze…..

• But you can put numbers to a lot of things.


Importance of identified data• Try to avoid use of anonymous or grouped data. • Ideally, you would be able to match (and track)

data at individual level. • That means you need names or unique

identifiers. • Your analysis would then focus on those that

have data and multiple points in time and that data can be matched to same individual.

• Different from whole group analysis (compare whole group at point 1 to whole group at point 2 – even though there are different people in groups).


Group vs. Individual AnalysisParticipant Pre-Test Score Post-Test Score Difference

Participant 1 10 No post-test ??

Participant 2 20 No post-test ??

Participant 3 10 10 0

Participant 4 20 20 0

Participant 5 No pre-test 20 ??

Participant 6 No pre-test 30 ??

Average: 15 20 + 5


Are There Some Things That Can’t be Measured?

• The key is properly defining what success looks like.

• Large and fuzzy concepts ARE difficult to measure.

• But their component parts can usually be measured.

• Let’s start with an example….


Your engagement in workshopCategory Description Numerical

Value

Poor Openly not paying attention to presentation. Not in room, or on unrelated internet sites (Facebook). Are you playing Candy Crush now? If so, you are in this category.

1

Fair Not taking notes, but at least listening. Askingquestions or making comments that are distracting or do not contribute positively to learning.

2

Good Taking notes and listening actively to content, but not participating in any other way (no questions, no comments)

3

Excellent Active listening / note-taking and asking questions. Questions push discussion in positive ways.

4


How can you analyze this?

0

1

2

3

4

5

0 1 2 3 4 5 6 7 8 9

Graphing One Person’s Engagement


Does adding a line help?

0

1

2

3

4

5

0 1 2 3 4 5 6 7 8 9


January 28th 2015 Isaac Castillo - @Isaac_outcomes 26

What about a Trendline?

0

1

2

3

4

5

0 1 2 3 4 5 6 7 8 9



Comparing multiple points in time

• Easy to compare changes between two points in time (pre/post), but what if you have multiple data points?

• If you have data at four points in time, do you only compare first and last? What do you do with middle two points?

• What about 20 points of data? Still first and last, or do you want something that more accurately collects what happens over entire time (like regressions / trendlines)?

• Is using the first/last data point even the best thing to do (will they be the most accurate)?


Avoiding common mistakes

• Collecting different data in different ways over time (post test is different from pre test).

• Should you even be giving pre-tests? (Retrospective post-then-pre-tests and normalization of skill over time).

• Are there things that shouldn’t be self-reported (too much bias)?

• Is a very complex outcome oversimplified?


How Detailed or Rigorous Does the Evaluation Need to Be?

• What do you want to do with the results? – Prove to yourself the program works? – Use the results to market/fundraise? – Publish the results through your own materials? – Publish the results in peer-reviewed journals?

• How ‘certain’ do you want to be about the results? – Are you fine with some doubt? – Will you be comfortable answering concerns and

criticisms?

• Are you willing to live with negative results?


Costs and Rigor

• The more you want to do with the results, the more you need to spend on evaluation.

• Approximately 3% of organization’s budget should be spent on evaluation activities.

• Can grow capacity over time – start small.

• Very little cost to do simple data collection –don’t overcomplicate at the beginning.


Some More Complex Data Questions

• Let’s try to delve into some deeper questions.

– What does your target population look like, and is it different than from what you anticipated?

– Do you have a way to know if participants are re-enrolling in programming?

– How do you define a program participant? And what does it take to get a person ‘enrolled’?


Assessing your service population

• Basic demographics are easy place to start

• Can you include other characteristics to measure need/risk levels of populations? – Income level (or proxies)

– Education level (or proxies)

– Other characteristics that are important

• Where do you get this data? – Administrative sources (someone else collects)

– Screening tools (you collect)


How do you know if you have ‘repeat customers’?

• Does your data system have unique identifiers for participants?

• Does your data system have a way to track multiple enrollments in the same program at different periods of time?

• Does your program have distinct end points (and criteria for exit) and are those trackable?

• Big question: Are repeat customers a good thing? In some instances, they could actually be a negative outcome if participants repeat programs.


At What Point is Someone “In” Your Program?

• Do you have defined criteria as to when a participant is officially enrolled in your program? – Is it when they fill out an intake form? Or when they

have signed a consent form?

– Or is it when they attend their first (or second) event?

• What is the process to make this happen? – What paperwork or other things do prospective

participants have to go through?

– Do you have any idea of how long this process takes (with data – not just guesses)?


How Can You Use This Data?

• Are you serving the ‘right’ population?

• Are your participants getting ‘enough’ service to obtain outcomes?

• Should you change who you serve?

• Should you change what you do?


Fictional Mentoring Program – Dosage by Quartile


Target: 50 hours of mentoring per year

30 hours

15 hours (also median)

6 hours

5% hit the 50 hour target

What if we combine dosage and outcomes?


Target: 50 hours of mentoring per year

‘Best outcomes’ – avg of 48 hours

‘Moderate outcomes’ – avg of 12 hours

‘No change’ – avg of 6 hours

‘Negative change’ – avg of 2 hours

How does this help redefine targets?

We could add in population factors...

Outcome level Females Males Transgender

Strong positive outcomes

36 days 52 days 56 days

Moderate positive outcomes


No or negativeoutcomes



Average number of program days attended by each subpopulation

…or risk factors

Outcome level Very low income Low Income Moderate Income

Strong positive outcomes


Moderate positive outcomes


No or negativeoutcomes



Average number of program days attended by each subpopulation

And then you can delve into cells to ask and answer questions.

• It took moderate income participants far more days than very low income participants to see effects. Why?

• Does this mean should exclude moderate income participants (and low income)?

• Should we change our dosage target?

• Should we change how we define and measure the outcomes?


Contact Information

Isaac Castillo

E-mail: [email protected]

Twitter: isaac_outcomes

castillo high quality program evaluation in nonprofits

Government & Nonprofit

program evaluationfebruary

dcpni isaac castillo

indicators of program

program intervention

dieting success

effectiveness outcomes

outcomes person

fewer youth