evaluating the educational effectiveness of an ... · t4a seeks to identify how the programme...

Evaluating the Educational Effectiveness of An Intervention Programme For

Social-Emotional Learning

Paper presented at the BERA 2010 Conference

Mary K. Sheard & Steven M. Ross

September 2010

Pag

e2

Evaluating the Educational Effectiveness of An Intervention Programme For Social-Emotional Learning

Abstract The paper presents a critical reflection on the first phase of a longitudinal randomised evaluation of T4A (T4A), a social and emotional intervention programme recently introduced in the Craigavon area of Northern Ireland. Implementation of the T4A programme began in six Primary Schools in September 2008 as a pilot phase of the programme‟s development. The evaluation uses quantitative and qualitative methods to explore the experiences of T4A of a range of key stakeholders in the school community, and to investigate possible changes in children‟s personal development, attitudes and behaviour associated with the implementation of the programme. Observation measures were developed to capture a range of teaching and pupil behaviours associated with the aims of the T4A programme. Classroom observations and playtime observations feature importantly in the evaluation. The paper reports on the development and reliability of the observation measures of teaching behaviours and pupil behaviours in the classroom and pupil behaviours in the playground, the findings obtained from the observation measures in the first year of the evaluation. The paper also reports on the statistical analysis of the first-year data from the observation measures and from teacher ratings, completed at the beginning and end of the first year and a half of implementation. Findings from these measures showed directional gains that favoured the intervention. From a socio-cultural perspective, the paper argues that an evaluation of the educational effectiveness of a teaching programme should carefully consider the processes of implementation and pedagogic change as well as outcomes related to pupil achievement and attainment. Conclusions are drawn about the reported evaluation process and the contribution made by the observation measures in particular to our understanding of educational effectiveness in the context of a social and emotional intervention programme. Key Words: evaluation, intervention, observation Introduction The paper presents a critical reflection on the first year of a longitudinal randomised evaluation of the Together 4 All (T4A) programme for schools, a social and emotional intervention programme based on the PATHS (Promoting Alternative Thinking Strategies) prevention curriculum developed in the US. Firstly, the paper offers insights into the methods and conduct of the early phases of the evaluation and preliminary findings from observations of teaching and pupil behaviours and teacher ratings of pupil behavior after implementation periods of one year and six months respectively. Secondly, the paper discusses methodological, analytical and ethical issues highlighted by the evaluation process and considers the implications these raise for research into educational effectiveness. While the focus will be primarily on findings relating to the impacts of the T4A programme on children‟s social-emotional development, the paper will also consider related issues around the role of classroom observation and rating scales in evaluating educational effectiveness, fidelity of implementation, and the concept of „control‟ that the evaluation has raised thus far.

Pag

e3

The purposes of the evaluation study are to assess the implementation progress and impacts of T4A. In the pilot phase of the programme‟s development, T4A was recently introduced into six Primary Schools in Craigavon, an area of religious, cultural, social and economic diversity in Northern Ireland. Given the turbulent history of Northern Ireland, a major goal in adopting the programme was to inculcate in this generation of children positive attitudes toward citizenship, respect for others, and recognizing and expressing feelings. The main evaluation questions are: What are the impacts of the T4A Programme on the social-emotional development of primary school children in the four communities? What is the implementation fidelity of the T4A Programme for schools? What are the trends for the well-being of children over time (where well-being refers to feeling good and functioning effectively)? Implementation fidelity refers to the degree of fit between the original programme and its application. The present study conceptualises implementation fidelity in terms of the following components identified by Eames et al (2009) and Mihalic et al, 2002): adherence (whether the programme is being delivered as designed and intended); exposure (degree to which length and frequency of programme delivery complies with programme specification); integrity (implemented as intended with specified materials and training); differentiation (that the programme content for each lesson differs as intended and follows the intended sequence); quality of programme delivery, (the manner of practitioners‟ delivery and use of methods and strategies) and practitioner responsiveness (the extent to which practitioners „enact‟ and apply the programme‟s principles and are involved in the activities and content of the programme). The paper takes a socio-cultural theoretical perspective. It argues that the T4A programme represents the socio-cultural theory of learning through its focus on children‟s interactions with other people, objects and events in the environment and activities that require cognitive and communicative participation. As an intervention programme to enhance social-l emotional learning, mutual respect and understanding and pro-social behaviour, the T4A programme is an agent of cultural change and social action based on cultural knowledge and an understanding of cultural practices, meanings, belief systems and value systems. Central to this is the view that, because communities organise themselves through conflict as well as through co-operation, individuals are often prevented from learning to see the world as others do, and may even be led to believe that there is only one way of seeing or doing and that that is the best way (Lemke, 2009). Such has been an issue historically and presently still is in some communities in Northern Ireland. From this socio-cultural perspective, teachers, parents and other significant adults are more experienced social partners from whom children learn the social practices and cultural conventions of social interactions. The social-emotional language of T4A and its associated discourses and visual representations. For example, compliment slips, Child for the Day, and control signals, may be viewed as a resource for cultural change. The evaluation study of T4A seeks to identify how the programme impacts on the pedagogic behaviours of teachers as pupils‟ social partners and agents of cultural change. Identifying changes in teaching behaviours is central to evaluating programme effects on children‟s social interaction and pro-social behaviours in and out of the classroom. Method The evaluation draws on quantitative and qualitative methods to explore a range of key stakeholders‟ experiences of T4A to investigate possible changes in children‟s personal

Pag

e4

development, attitudes and behaviour associated with the implementation of the programme. While the research instruments used in the evaluation importantly include individual assessments of pupils‟ social-emotional competence, interviews with key stakeholders, and teacher‟s programme implementation ratings, the main data sources to be considered in this paper are classroom observations of teaching behaviour (COBs teaching) and of pupil behaviour (COBs pupil behaviours), playtime observations of pupil behaviour (POBs) and teacher ratings of pupil behaviour. The reason for reporting here on these particular data sources was that they provided sub-sets of data that could be analysed discretely and meaningfully at a relatively early stage of the evaluation at a time when other interim and posttest data had not been collected. In addition, the authors considered that the observation measures and outcomes, which are a main feature of the evaluation, would be of particular interest to other researchers working in the field at the present time.

i) Classroom Observations (COBs) and playtime observations (POBs) of

behaviour

Based on a systematic design process including programme review, expert feedback, and iterative development approach, the COB was designed to focus on 11 teaching and 9 pupil behaviours and the POB on 10 pupil behaviours. The COB teaching behaviours related to four domains consisting of (I) Managing Behaviour and Problems, (II) Supporting Emotional Development, (III) Facilitating Peer Interactions, and (IV) Supporting Mutual Respect and Understanding, as shown in Figure 1. Figure 1. Model of COB Teaching Behaviours

The 9 COB pupil behaviours encompass children‟s positive interactions with and respect for others (listening, complimenting, co-operating, showing mutual respect and understanding), feelings (coping, self-expression, identifying in others), and classroom behaviour (compliance, engagement/involvement). The 10 identified POB behaviours related to positive or negative prosocial play activity (taking turns, complimenting others, criticizing others, taunting others, assisting others, being physically aggressive), promoting social interactions (avoidance, including others, excluding others), and complying with playground rules.

Pag

e5

A supplementary rubric was used to rate the degree of mutual respect and understanding (MRU) demonstrated overall.

Inter-rater Reliability Study. Trained COB/POB observers visited classrooms or play areas at randomly selected or pre-arranged times (depending on conditions of the study) for a 15-minute interval. An inter-rater reliability study was conducted by pairing nine (9) trained observers in different combinations and assigning each pair to observe the same classroom (COB) or playground (POB) for the full observation interval. In total, the pairs conducted 46 COB observations of teaching behaviour, 50 COB observations of pupil behaviour, and 25 observations of POB pupil behaviour. For the reliability trials, the pair members were given separate data rating forms to complete independently and a consensual form on which to record shared ratings based on a discussion of individual ratings and resolution of any disagreements. The latter ratings were treated as actual data for the T4A evaluation study. The pair members were told to observe silently and then complete the rating form independently as they would if they were conducting a solo observation. Then, when both independent sets were completed, they should insert the forms in an envelope and seal the envelope. At that point, they were told to take out the consensual rating form and complete it together, choosing whatever score that they, as a team, felt was most accurate for each behaviour. Inter-rater reliability analyses, consisting of descriptive item analyses and Cohen‟s Kappa statistics, were conducted on the COB teacher and pupil behaviour ratings and the POB pupil behaviour ratings. Results from all three analyses were strongly supportive of the instruments‟ reliability.

COB teaching behaviour.

Across all 506 items, 80% of pair members chose the identical rating (0 to 3) as their counterparts. In no case, did two pair members chose ratings that were more than one level apart

Behaviours with the strongest inter-rater agreement were Social Problem Solving (91%), Provision of Interpersonal Support (87%), Providing Feedback on Peer Interactions (87%), Supporting Mutual respect and Understanding (85%), and Positive Behaviour Modeling (85%). COB teacher behaviours associated with lowest levels of agreement were Attention/Engagement (65%), Supporting Peer Interaction (70%), and Emotion Expression (71%).

Cohen‟s Kappa coefficient for COB teacher behaviours was +.63, indicating a “substantial” inter-rater correlation (Landis & Koch, 1977). The overall Pearson r of +.82 likewise reflected very strong inter-rater agreement.

COB pupil behaviour.

81% of pair members selected the identical ratings as their counterparts across all pairs and items

Individual behaviours associated with the strongest inter-rater agreement were Compliance (96%), Positive Coping Strategies (94%), and Engagement/Involvement (92%). Behaviours associated with lowest levels of agreement were Cooperative in Learning and Shows Mutual Respect and Understanding (both 68%).

An overall Kappa statistic of +.73 was obtained, indicating a “substantial” inter-rater correlation. The close similarity of pair members‟ ratings is further reflected in the very high overall Pearson r of +.90.

Pag

e6

POB pupil behaviours.

Members of observer pairs chose the identical 0-3 ratings on 85% of the items.

The highest inter-rater agreement POB was found for the MRU Rubric (100%) and the behaviours of Complied with Playground Rules (96%), and Individual Avoided Social Contact (88%). Only the behavior of Included Others (64%) was associated with less than 75% agreement. .

A “substantial” inter-rater correlation for POB was indicated by the overall Kappa statistic of +.73. The overall Pearson r of +.80 is also supportive of high consistency of ratings.

The COB-teacher, COB-pupil, and POB-pupil measures all received strong support from both the item analyses and Kappa statistics. Specifically, their overall percentages of identical ratings across all observers and behaviours were 80%, 81%, and 85%, respectively. No behaviour on any instrument was associated with less than 62% agreement. Kappa statistics were +0.63, +0.73, and +0.73, respectively, all indicative of “substantial” inter-rater reliability.

ii) Teacher ratings of pupil behaviour

The Teacher Ratings of Pupil Behaviour (adapted from the Strengths and Difficulties Questionnaire, Goodman, 1997; 1999) was administered in the baseline stage, Autumn 2008 and again in the initial posttesting in Spring 2009. Respondents were Intervention and Control teachers of children enrolled in P1/P2 (4 and 5 year olds, n = 798) and P5/P6 classes (9 and 10 year olds), n = 913). Following removal of surveys that were not usable (due to failure to follow directions or missing data), complete sets of Sweep 1 and Sweep 2 surveys were obtained for 673 (84.3%) P1/P2 teachers and 757 (82.9%) P5/P6 teachers. Table 1 summarizes the sampling outcomes for usable surveys. Table 1. Sample sizes for analyses of teacher ratings Treatment Group

Class Level

P1 P2 P1/P2 P5 P6 P5/P6 Total

Intervention 150 133 283 165 202 367 650

Control 206 184 390 216 174 390 780

Total 356 317 673 393 382 757 1430

Two parallel versions of the survey, appropriate for the ages of the participating pupils, were developed for use in P1/P2 and P5/P6. Following removal of overlapping or non-essential items in the original SDQ scales, the final instruments used for data analysis each contained 26 items. The first eight items on the P1/P2 version and the first seven items on the P5/P6 version used a 6-point scale (1=Never or Almost Never; 6=Almost Always). The remaining 18 and 19 items, respectively, used a 3-point scale (1=Not True; 3=Certainly True).

Preliminary Findings

i) Classroom Observations (COBs) and playtime observations (POBs)

To provide a comprehensive examination of the data, we conducted analyses separately by class level (P2, P3, P6, P7) and for all class levels combined. The primary analyses were 2-way (Treatment x Rating) chi-square tests for independence, with associated descriptive results (cross-tab tables).

Pag

e7

Observations of classrooms and playgrounds revealed both similarities and differences between the intervention and control treatments. A summary of behaviours, by observation instrument, reflecting significant treatment differences for all classes combined is provided in Table 2. Table 2. Behaviours significantly differing between intervention and control treatments (all classes combined) Instrument

Favours Intervention Treatment Favours Control Treatment

COB Teaching Supporting Peer Interactions Positive Behaviour Management

Providing Feedback on Peer Interactions

Supporting MRU

COB Pupil Complements Others Is Compliant

Demonstrates MRU Demonstrates Attention/Engagement

POB Pupil Took Turns in Participating in Games or Sports

Complied with Playground Rules

MRU rubric

On the COB-Teaching Behaviours measure (see Table 2), the intervention classes were significantly more likely to exhibit Supporting Peer Interactions, Providing Feedback on Peer Interactions, and Supporting Mutual Respect and Understanding. Control classes, however, were significantly more likely to exhibit Positive Behaviour Management. Across both treatments, the behaviours most likely to be observed were Positive Behaviour Management, Gaining Pupil Attention/Engagement, and Supporting Peer Interaction. Least likely to be seen were Provision of Interpersonal Support and Facilitating Social Problem Solving. It should be noted the frequent set is associated with everyday classroom teaching activities, whereas the rarely observed set requires the teacher to intervene in response to an emotional or social problem that occurs during the interval. On the COB-Pupil behaviours measure (see Table 8), intervention classes were significantly more likely to exhibit Complements Others and Shows Mutual Respect and Understanding. Control classes, however, were more likely to exhibit Is Compliant and Demonstrates Engagement/Involvement. Directional results favouring the intervention over control classes were also indicated for Cooperative in Learning. Overall, prosocial behaviours taught by T4A were somewhat more visible (significantly or directionally) in intervention classes. The more routine positive classroom behaviours of complying with rules and demonstrating engagement were more common in control classes. As would be expected, the latter behaviours were the most pervasive of all 9 target pupil behaviours across intervention and control classes. Least observed were Displays Positive Coping Strategies and Identifies Others‟ Feelings. On the POB measure, significant differences between treatments favoured intervention classes on Took Turns in Participating in Games or Sports and on the MRU rubric. Control classes were favoured on Complied with Playground Rules. Most often observed across both treatments were Complied with Playground Rules and Took Turns in Participating in Games or Sports. Least common were Shouted At or Taunted Others, Criticised Others, or Complimented Others. Although these analyses were conducted at a relatively early phase (after only one year) of the T4A implementation, they reflect some consistent findings across observation measures. On all three measures, the intervention contexts exhibited significantly higher MRU practices or behaviours than did the control contexts. In general, although some of the differences were small, there were directional tendencies for prosocial behaviours and feeling-related

Pag

e8

behaviours to be more common in the intervention classes. Control contexts tended to feature more compliance with rules than did intervention classes. Interpretation of the latter trend would be highly speculative without further data, but could possibly reflect the influences of T4A in fostering more open and pupil-centred environments. Further observation analyses in subsequent sweeps should provide further insights into these possible T4A impacts.

ii) Teacher ratings of pupil behaviour

Several types of analyses were conducted on the data. First, we conducted a factor analysis on the combined P1/P2 and combined P/5/P6 data. Second, we compared Intervention and Control teachers on individual items within the resultant factors for the (a) baseline (Sweep 1), using t tests for independent samples; and (b) posttest (Sweep 2), using analysis of covariance (ANCOVA) to control for baseline scores. Third, we averaged item scores within factors, and compared intervention and control teachers on factor means on the baseline (using t tests) and posttest (using ANCOVA) ratings. Fourth, we analyzed the results for each of the four grade levels separately to examine trends relative to the combined-class results. Factor Analysis A principal axis analysis was conducted to extract key factors. Initial eigenvalues and the scree plots were used to determine the optimal number of factors to retain for rotation and analysis. The resulting factors were then orthogonally rotated using the Kaiser-Varimax method, which minimizes the number of variables that have high loadings on each factor. The next step was to assign an item to the dimension on which it had the highest factor loading. The minimum item loading value accepted had to be equal to or greater than +/- 0.40. Each factor was then labelled to reflect the most salient underlying construct for its component items. P1/P2 Teacher Ratings P1/P2 factors. As shown in Table 4, a total of five components with eigenvalues of 1.00 or greater were extracted for the P1/P2 item ratings. These five components explained a relatively large amount of variance (65%). Factor 1 explained considerably more variance than the other four factors (31% compared to 11%, 9%, 8% 5%, respectively). Based on the items that loaded highly on each, the major constructs assessed by the five factors are:

Factor 1 Empathy, Coping, Co-operation Factor 2 Actively Assists Others Factor 3 Negative Affect Factor 4 Fighting and Aggression Factor 5 Socially Withdrawn

Table 3 presents a summary of baseline and posttest results indicating the number and percentage of statistically significant and directional (of any size) outcomes favouring the Intervention and Control groups. At baseline, Control pupils were rated significantly more positively than Intervention pupils on five items (19% of the total) and were directionally higher on 18 items (69%). After adjusting for baseline scores, posttest comparisons were significant on only two items, both favouring the control pupils: copes well with disappointment (effect size = -0.16) and often has temper tantrums (ES = -0.20). However, reversing the negative trend at baseline, intervention means were directionally higher on 13 items (50%), with the largest differences indicated on: thinks things out before acting (ES = +0.12), nervous or clingy (ES = +0.11), initiates interactions with others (ES = +0.11), and kind to younger pupils (ES = +0.10).

Pag

e9

Table 3. Summary of baseline and posttest treatment comparisons for P1/P2 combined Assessment and Treatment Advantage

Significant Directional Total

Baseline f % f % f %

Intervention 0 0 7 0 7 27

Control 5 19 13 50 18 69

Neither 1 4 1 4

Posttest


Control 2 8 7 27 9 35

Neither 4 15 4 15

Note. All percentages are based on a 26-item total. Combined P1/P2 item results by factor. Intervention and Control ratings were also compared on each factor by averaging component item scores within factors (See Table 4). The t-test results for the baseline (Sweep 1) assessment indicated only one significant effect—an advantage for the Control over the Intervention treatment (p < .05, ES = -0.16) on Factor 1: Empathy, Coping, and Cooperation. Posttest results, as summarized in Table 4, revealed no significant effects or strong effect sizes. Table 4. A summary of treatment comparisons on factor scores for P1/P2 combined

Factors Mean (Interv.) N=283

Mean (Control)

N=390

Adj. Mean

(Interv.)

Adj. Mean

(Control) SD

(Control) Adj. ES p

Empathy, Coping,

Co-operation 3.48 3.58 3.53 3.55 0.78 -0.02 .72

Actively Helps Others

2.41 2.48 2.41 2.48 0.48 -0.15 .07

Negative Affect

1.14 1.12 1.14 1.13 0.27 -0.03 .65

Fighting and Aggression

1.17 1.20 1.17 1.20 0.32 0.09 0.21

Socially Withdrawn

1.19 1.17 1.20 1.16 0.34 -0.09 0.20

Separate P1 and P2 item analyses. To determine whether the treatment comparisons were consistent for the two class levels, we analyzed the posttest results for each separately using ANCOVA (using baseline scores as covariate). A summary of the posttest outcomes is provided in Table 5. Especially noteworthy are the directly contrasting patterns for P1 and P2. Specifically, at the P1 level, after being adjusted for baseline scores, the posttest results strongly favoured the Control treatment, which had a statistically significant advantage on 5 items (19%) and a directional advantage on a total of 19 items (73%). By comparison, the Intervention treatment had no significant effects and only five directional advantages. Items that significantly favoured the Control group were:

Understands other people‟s feelings (ES = -0.20)

Listens to others‟ point of view (ES = -0.27)

Resolves problems with other children (ES = -0.28)

Copes well with disappointment or frustration (ES = -0.23)

Pag

e10

Helpful if someone is hurt, upset, or feeling ill (ES = -0.37)

Table 5. Summary of separate P1 and P2 posttest treatment comparisons Class Level and Treatment Advantage


P1 f % f % f %


Control 5 19 14 54 19 73

Neither 2 8 2 8

P2


Control 0 0 7 27 7 27

Neither 0 0 0 0

Note. All percentages are based on a 26-item total. Conversely, at the P2 level, the intervention treatment was significantly superior on 8 items (31%) and directionally superior on a total of 19 (73%). The control treatment, however, had no significant effects and a directional superiority on seven items (27%). Items reflecting significant intervention treatment effects were:

Understands other people‟s feelings (ES = +0.26)

Listens to others‟ point of view (ES = +0.28)

Expresses needs and feelings appropriately (ES = +0.35)

Initiates interactions and joins others (ES = +0.33)

Considerate of other people‟s feelings (ES = +0.24)

Nervous or clingy in new situations (ES = +0.24)

Thinks things out before acting (ES = +0.29

Sees tasks through to the end, good attention span (ES = +0.22). P5/P6 Teacher Ratings P5/P6 factors. For P5/P6, a total of five components (with eigenvalues over 1.00) were extracted. As for P1/P2, these components explained a relatively large amount of variance (64%). Factor 1 explained considerably more variance than the other four factors (27% compared to 13%, 10%, 8% 6%, respectively). The fifth factor, however, consisted of only two items—“picked on or bullied” and “gets on better with adults than with children,” which were not reflective of a conceptually coherent construct. Thus, it was decided to analyze outcomes for these items separately rather than as a category. For the first four factors, item loadings suggested measurement of the following constructs :

Factor 1 Empathy and Cooperation Factor 2 Reflectivity and Perseverance Factor 3 Negative Affect Factor 4 Fighting and Aggression

Baseline and posttest results for combined P5/P6 groups are summarized in Table 6 in terms of the number and percentage of statistically significant and directional outcomes favouring the Intervention and Control groups. At baseline, Control pupils were rated significantly more positively than Intervention pupils on two items (8% of the total) and were directionally higher on 17 items (65%). Intervention pupils had 0 significant and 9 (35%) directional advantages. After adjusting for baseline scores, posttest comparisons were significant on five items, three of which favoured the intervention group (recognizes and labels feelings accurately, ES = +0.15; picked on or bullied by other children, ES = +0.17; gets on with adults better than children, ES = +0.27) and two of which favoured the control group (thinks things out before

Pag

e11

acting, ES = -0.17; sees tasks through to the end, ES = -0.13). In total, Intervention pupils received directionally higher ratings on 14 items (54%) and Control pupils on 12 items (46%). Table 6. Summary of baseline and posttest treatment comparisons for P5/P6 combined Assessment and Treatment Advantage


Baseline f % f % f %


Control 2 8 15 57 17 65

Neither 0 0 0 0

Posttest


Control 2 8 10 38 12 46

Neither 0 0 0 0

Note. All percentages are based on a 26-item total. Combined P5/P6 item results by factor. Intervention and Control ratings were compared on each factor by averaging component item scores within factors. No significant findings were indicated for the baseline assessment. However, two significant but directly contrasting effects occurred on the posttest (see Table 7). The Control treatment was superior on Factor 2: Reflectivity and Perseverance (p < .01, ES = -0.14), whereas the Intervention treatment was superior on Factor 3: Negative Affect, which is a positive result (p < .01, ES = +0.23).

Table 7. A summary of treatment comparisons on factor scores for P5/P6 combined

Factors Mean (Interv.) N=367

Mean (Control)

N=390

Adj. Mean

(Interv.)

Adj. Mean

(Control) SD

(Control) ES p-

Empathy and Cooperation

3.88 3.91 3.92 3.89 0.71 +0.04 .45

Reflectivity and Perseverance

2.34 2.40 2.33 2.41 0.56 -0.14 .00*

Negative Affect

1.21 1.29 1.21 1.30 0.39 +0.23 .00*

Fighting and Aggression

1.14 1.15 1.14 1.15 0.36 +0.03 .58

*Statistically significant.

Separate P5 and P6 item analyses. As done for P1 and P2, we analyzed the posttest results for P5 and P6 separately using ANCOVA (using baseline scores as covariate). A summary of the posttest outcomes is provided in Table 8. In P5, the Intervention treatment had a statistically significant advantage on five items (19%) and a directional advantage on a total of 15 items (58%). By comparison, the Control treatment had two (8%) significant effects and 11 (42%) directional advantages. Items that significantly favoured the Intervention group were:

Provides help (ES = +0.19)

Often fights with other children, bullies them (ES = +0.16)

Often unhappy, downhearted, or tearful (ES = +0.23)

Nervous or clingy in new situations (ES = +0.29)

Gets on better with adults than with children (ES = +0.22)

Pag

e12

Those favouring the Control treatment were:

Kind to younger children (ES = -0.22)

Thinks things out before acting (ES = -.37)

As Table 8 shows, at the P6 level, results were mixed, with the Intervention treatment having the only significant effect “gets on better with adults” (ES = +0.37) and each treatment being favoured directionally on close to half of the items. Table 8. Summary of separate P5 and P6 posttest treatment comparisons Class Level and Treatment Advantage


P5 f % f % f %


Control 2 8 9 34 11 42

Neither 0 0 0 0

P6


Control 0 0 14 54 14 54

Neither 0 0 0 0

Note. All percentages are based on a 26-item total. Summary of findings from Teacher Ratings of Pupil behaviour.

In Sweep 1 (baseline), Control teachers rated both P1/P2 and P5/P6 pupils more positively than did Intervention teachers. This tendency was particularly evident for the P1/P2 cohort.

In Sweep 2, the ratings for the Intervention group improved relative to the Control group at baseline, particularly for the P5/P6 cohort. In fact, after statistically adjusting for baseline ratings, the P5/P6 Intervention group mean significantly surpassed the Control group mean on three items, whereas the Control mean was superior on only two items. Directional differences favoured the Intervention group on 14 (54%) items and the Control group on 12 (46%). Of note, however, when analysis were conducted separately for class levels, the most favourable outcomes for the Intervention treatment occurred at P2, with significant advantages indicated on 11 (42%) items and directional advantages on 19 (73%) items. The least favourable were at P1, with 0 (0%) significant and only 5 (19%) directional advantages. These results, although preliminary with respect to the early implementation of the intervention at the time of the Sweep 2 data collection, suggest that older pupils (P2 and beyond) may have greater readiness to grasp the strategies and model the behaviours than the younger pupils (in P1) participating in the evaluation. As the pupils are followed longitudinally over the next year or more, impacts on both cohorts should become more evident.

Methodological, analytical and ethical issues highlighted by the evaluation process: a critical gaze While preliminary with respect to the early implementation of the intervention at the time of the data collection, findings from classroom and playtime observations and teacher ratings of pupil behaviour suggest that the T4A programme for schools impacts positively on children‟s social-emotional development through the provision of interpersonal, instructional and environmental supports for teaching and learning. The findings are therefore consistent with those from a meta-analysis to be reported by Durlak et al (in press) on the positive impact of

Pag

e13

SEL programmes, where, compared to controls, SEL participants demonstrated significant improvements in social and emotional skills, attitudes and behaviour. However, the interesting question is whether and, if so, how the preliminary findings reported here are sustained or enhanced over the evaluation‟s subsequent two years. Our immediate critical attention, meanwhile, may usefully be drawn to a number of issues relating to evaluation design, process and analysis in pursuit of evidence of educational effectiveness that the current evaluation has highlighted thus far. We discuss these issues below as i) identifying the „Optimal Evaluation Moment‟; ii) the role of classroom observation in evaluating educational effectiveness; iii) measuring fidelity of implementation; iv) securing evidence of SEL outcomes; v) the concept of „Control‟; and vi) on triangulating data and seeking deeper understandings.

i) Identifying ‘Optimal Evaluation Moments’.

From a socio-cultural perspective, teachers, parents and other significant adults are more experienced social partners from whom children learn the social practices and cultural conventions of social interactions. The need for teachers to learn and adopt the social practices and cultural conventions associated with an SEL programme may be underestimated. Pre-requisites for children‟s learning from an SEL programme would seem to be increased teacher knowledge and understanding of SEL teaching and learning strategies, followed by changes in teacher attitudes about eliciting and managing SEL, and ultimately by changes in teaching behaviours associated with promoting children‟s SEL development. These considerations raise the design question of when is the „optimal evaluation moment‟. That is to say, how should an evaluation of educational effectiveness be best designed and timed to coincide with a secure level of implementation to capture meaningful learner outcomes? What processes build towards this optimal „moment‟, and can they be effectively incorporated and assimilated within the evaluation design? When and how can the pedagogic factors/influences that mediate learner outcomes best be evaluated? While a raft of individual child assessments comprises the main pretest and posttest measures. The evaluation team favoured delaying the posttest of individual child assessments for 18 months from the pretest, to ensure an adequate interim period for possible SEL development to take place. However, the decision to conduct more frequent teacher ratings of pupil behaviour and observations of teaching and pupil behaviour allows for a more detailed and complex analysis over the evaluation‟s duration. Of particular interest are capturing possible evidence over time of variations and trends in teaching and learner behaviours associated with the programme‟s implementation, without being overly intrusive or assessing children more than is necessary. Therefore, choosing the „optimal evaluation moments‟ within the continuum of effectiveness research presents one of the main design and methodological challenges to educational evaluators.

ii) The role of classroom observation and rating scales in evaluating educational

effectiveness.

The PATHS program, on which T4A is based, did not use observations of teaching and pupil behaviours. The present evaluation introduces a new, robust and reliable measure for evaluating teaching and pupil behaviours relating to the development of social-emotional learning and mutual respect and understanding. Based on a systematic design process including programme review, expert feedback, and iterative development approach, the COB was designed to focus on 11 teaching and 9 pupil behaviours and the POB on 10 pupil behaviours. This would seem to offer a relevant contribution to the field,

Pag

e14

particularly in view of the findings by Durlak et al (in press) that interpersonal, instructional and environmental supports such as proactive classroom management and co-operative learning produce better school performance.

The primary data recorded by observers are the frequency ratings for each target behaviour. Supplementary data are brief verbal descriptions of the actual instances of behaviours observed. For the target behaviour frequency ratings, the following four-point scale is used: 0 = Not applicable 1 = Not Seen 2 = Rarely/Occasionally 3 = Frequently. The “Not Applicable” category is used when a behaviour is precluded from occurring during the entire 15-minute interval. Examples are if children are performing independent work in silence or if the teacher is whole class teaching for the entire observation period. There is much debate and controversy around the validity and reliability of rating scales in educational research, particularly around the robustness, sensitivity, and capacity of rating scales to faithfully represent the phenomeneon under scrutiny (Cohen et al, 2000). However, it is our view that the 4 point scale used in the present evaluation meets the above requirements, minimizes the risk of unreliability associated with higher degrees of inference about events observed, and is fit for purpose. This view is supported by the positive findings of the inter-rater reliability study (Kappa statistics were +0.63, +0.73, and +0.73, respectively for teaching behaviours, classroom pupil behaviours and playtime pupil behaviours, all indicative of “substantial” inter-rater reliability), by the quality of data produced, and by reports of the observer users. Moreover, the complexity of the 20 social and emotional behaviours to be considered in each 15 minutes observation period should not be under-estimated. Rather than using an extended scale, we would argue that an observation of such complex and multiple behaviours is best represented on a „simple‟ and reliable 4 point scale that combines frequency or saliency of the target behaviour with duration, or the length of the „active time interval‟ during which the target behaviour could occur (in this case 7.5 or more minutes). One area for improvement, however, is the requirement for additional contextual information to support analysis and meaningful interpretation of „not applicable‟ and „not seen‟ category entries, in order to understand more clearly features of teaching and learning that appear to preclude or constrain social-emotional teaching and learning behaviours. At the same time, the limitations of observing teacher practices given time and contextual factors should not be under-estimated. That is, observers might visit classrooms many times and not see particular behaviours simply because they are not ones conducive to regular frequent practices. iii) Measuring fidelity of implementation Research into educational effectiveness will necessarily be concerned with the fidelity of implementation at practitioner level, where „implementation fidelity‟ may be defined as the degree of fit between the original programme and its application. It follows that measures of implementation fidelity strengthen effectiveness research. In the present study, the fidelity components of adherence, exposure, integrity and differentiation are measured indirectly through interviews with T4A coaches who support programme delivery and development , through interviews with teachers and teacher surveys. However, quality of programme delivery, (the manner of practitioners‟ delivery and use of methods and strategies) and practitioner responsiveness (the extent to which practitioners „enact‟ and apply the programme‟s principles and are involved in the activities and content of the programme) are measured in part by the randomised classroom observations of teaching behaviours. It may be argued that, while randomly assigned, classroom observations were none-the-less expected over the data collection period, and that indicators of implementation fidelity are only ever partial and tentative. However, the observation schedule would seem to contribute useful information on implementation fidelity when triangulated with that from other measures in the evaluation.

Pag

e15

Mihilac et al (2002) suggest that the more a programme developer is involved in the implementation of the programme, the more faithfully and effectively the programme will be delivered. An important consideration for the present evaluation, therefore, is the high level of support invested by T4A programme developers in assuring implementation fidelity. This has included T4A coaches working closely with teachers in the intervention schools, leadership training for the principals of the intervention schools, and supported networking and training for in-school T4A co-ordinators. For true reporting of educational effectiveness, therefore, evaluations should use a range of methods to consider and measure implementation fidelity on multiple levels. In this sense, effectiveness research should take a broad ecological perspective.

iv) Securing meaningful evidence of SEL outcomes In an evaluation of the effectiveness of a SEL programme, it would seem appropriate to consider the following 6 learner outcomes used by Durlak et al (in press) in their meta-analysis : social and emotional skills, attitudes towards self and others, positive social behaviours, conduct problems, emotional distress, and academic performance. The teacher ratings of pupil behaviour and the observations of pupil behaviour in the present study address all the outcomes with the exception of academic performance. In our view, programme implementation would need to be securely embedded over a longer period of time before impact on pupil academic performance could be measured with any confidence. Conflation effects are also a strong possibility, particularly as the T4A programme pilot coincided with the introduction of a revised statutory national curriculum in Northern Ireland. We believe that a more relevant and reliable indicator of impact on academic performance will be to track patterns and trends of pupil assessments longitudinally and after a more sustained period of implementation, using Standard Assessment Tasks (SAT) data and schools‟ tracking systems for assessing pupil progress from the year prior to their point of entry into the programme, for comparison with expected trends. Similar arguments influenced our choice of analyses. For example, we chose not to cluster the observation data by teacher or school because of the small numbers of schools (n = 6) in each treatment and the fairly small number of observations per type per school (20 for classroom observations and 8 for playtime observations).With these numbers, statistical power would be unacceptably low for detecting intervention effects. Clustering by teacher would have much fewer observations and thus be unreliable. Our analysis, however, makes adjustments for inflation in the Type I error rate (falsely rejecting the null hypothesis) as a result of aggregating the data across schools. Our rationale for not including a clustering design with the early teacher rating data in the preliminary analysis was that liberal analyses with high power would be more appropriate since the intervention was fairly recent and implementation was at an early stage. However, there might be some wisdom in considering whether the recent ratings data could be analysed by clustering within teachers, as a replication analysis using basic and HLM analyses. However, the power of the analyses would probably be low, since the units of analysis (teachers) would not be very large in number within the focus year groups.

v) The concept of ‘Control’

As a randomised control evaluation, the concept of the control treatment is central to the design and to the analysis of findings. However, the concept is problematic. While not receiving the intervention, control schools will become involved in other educational initiatives and programmes over the duration of an evaluation period, the effects of which will often be unknown to evaluators of the intervention programme but may be critical with respect to reporting effectiveness findings. In the present evaluation, introduction of the Revised Curriculum for Northern Ireland coincided with the introduction of the T4A pilot programme for schools. The Control Schools were legally required to introduce the Personal

Pag

e16

Development and Mutual Understanding (PDMU) component of the revised curriculum, which addresses similar content to the T4A programme. The concept of „control‟ is therefore one of difference in the processes supporting implementation and pedagogic change as well as differences in the content and delivery of the T4A and PDMU programmes. This highlights the importance of mixed method approaches to ensure that the true nature of the „control‟ is faithfully represented when reporting findings of educational effectiveness.

vi) On triangulating data and seeking deeper understandings

Preliminary findings from Teacher Ratings of Pupil Behaviour and evaluators‟ observations corroborate positive effects of the T4A programme on children‟s SEL. However, perhaps a more critical question is „How do observation findings correspond with teachers‟ ratings of pupil behaviour?‟ We have yet to analyse data in this way. However, data collected more recently and contiguously will allow this important analytical development. Such analysis will be needed to determine whether and how the observation tool might compliment teachers‟ ratings of pupil behaviour in the future, as a combined method for evaluating ongoing programme effectiveness where implementation is sustained over time. Such an analytical approach to seeking deeper understandings through the corroboration of contiguous yet distinct data may have implications for future research into educational effectiveness. Contact: [email protected] [email protected]

References Cohen, L., Manion, L. & Morrison, K. (2002). Research Methods in Education. 5th Edition. Routledge: London Durlak, J. A., Weissberg, R.P., Dymnicki, A.B., Taylor, R.D. & Schellinger, K.B. (in press) The Impact of Enhancing Pupils‟ Social and Emotional Learning: A Meta-analysis of School-based Universal Interventions. To appear in Child Development (2011) Eames, C ., Daley, D., Hutchings,J.,Whitaker, C.J.,Jones, K.,Hughes, J.C., Bywater, T. 2009). Treatment fidelity as a predictor of behaviour change in parents attending group Based parent training. Child Care Health and Development. Vol. 35, No. (5), pp 603-612. Goodman, R. (1997).The Strengths and Difficulties Questionnaire: a research note. Journal of Child Psychology and Psychiatry 38, 581-586.

Goodman, R. (1999) The extended version of the strengths and difficulties questionnaire as a guide to child psychiatric caseness and consequent burden. Journal of Child Psychology and Psychiatry 40, 791-799.

Lemke, J.L. (2009) Articulating Communities: Sociocultural Perspectives on Science Education http://academic.brooklyn.cuny.edu/education/jlemke/papers/jrst-1.htm Retrieved 18/12/2009 Mihalic, S., Fagan, A., Irwin, K., Ballard, D. & Elliott, D.S. (2002). Blueprints for violence prevention. Replications: Factors for implementation success. Boulder, CO: University of Colorado.

mailto:[email protected]

http://academic.brooklyn.cuny.edu/education/jlemke/papers/jrst-1.htm

Pag

e17

Zins, J.E; Weissberg, R.P; Wang, M.C and Walberg, H.J, (Eds) 2004 Building Academic Success on Social and Emotional Learning. What Does The Research Say? Teachers College Press, New York This document was added to the Education-line collection on 13 April 2011

evaluating the educational effectiveness of an ... · t4a seeks to identify how the programme...

Documents