how to catch a cheater - demetra opinioni.net...to catch this type of cheater, we evaluated the...

1
HOW TO CATCH A CHEATER RELEVANCE & RESEARCH QUESTION: Methods & Data: Online surveys are self-administered by respondents seeking to receive incentives for completing questionnaires. Some respondents use minimal cognitive effort in order to quickly complete the survey in return for incentives. However, this can trigger behaviour such as not reading the questions carefully, racing through the survey or intentionally cheating, resulting in poor data quality. This paper aims to investigate the behaviour of cheaters among online respondents from a non-probability-based panel analyzing seven techniques for detecting cheaters, applied in different ways in order to find an efficient methodology that leads to the elimination of the greatest possible number of cheaters without remo- ving honest panelists. Direct instruction in question body (direct_1, direct_2): these are questions with a direct instruction in the body which aims to check if the text of the application has been carefully read. In this case we used two Radio Buttons Question Type: 1. direct_1: “To continue click on Friday...” 2. direct_2: “To demonstrate that you have read this instruction, please do not answer the question below. Inste- ad click on the Next button to continue filling out the questionnaire”. We used data from 2 web surveys conducted in Italy (during January 2019) on members of our own panel, Opinioni.net, which is composed of 20,558 active panelists. The 2 surveys considered in our study have the following characteristics: a sample size of 1,073 for the first dataset and of 1,004 for the second, the same population target and a food consump- tion topic. Sample members were stratified by geographic area, gender and age in order to be representative of the Ita- lian population. In both questionnaires, we asked a particular question: “Are you affected or have you been affected in the past by one or more of the following long-term illnesses or pathological conditions?”. We considered suffering of “Allergies” the target variable of our studies. The first survey was used as training set to determine a method for identifying cheaters. In particular, we analyzed the estimates of the target variable in each check and in any combination thereof, in comparison with the estimation of the same question from the Multi-purpose Survey on Families: “Health conditions and appeal to health services” conducted by Istat in 2016 in order to assess the quality of our data. Once the method was defined, we validated it using the second survey as test set. THE TECHNIQUES USED TO DETECT CHEATERS ARE: WWW.OPINIONI.NET Dr.ssa MANUELA RAVAGNAN, DEMETRA OPINIONI.NET S.R.L. Dot. MARCO FORNEA, DEMETRA OPINIONI.NET S.R.L. 21TH GENERAL ONLINE RESEARCH CONFERENCE 6 MARCH - 8 MARCH 2019, COLOGNE answer AB1543.986 1 Unlikely events (unlik): At the beginning of a questionnaire there are often some screening questions aimed at de- tecting if the respondent has the characteristics required to access the survey. The respondent may state that he has all the required characteristics, even if this is not the case, with the aim of proceeding with the questionnaire and gaining the final incentive. To identify this type of cheater it is sufficient to insert, among the screening questions, a question asking whether the respondent has one or more characteristics, which he would not normally have. 2 Bad Open Questions (open): In a mandatory open-ended question, participants can use an inappropriate or unre- liable answer (eg, 'Asdfhjkl') as a way to indicate the lack of a meaningful answer or as a means of simply not lea- ving the space blank . 4 Consistency checks (coer_1, coer_2): The validation checks consist of two or more questions related to each other that are placed at different points of the questionnaire. In these circumstances the answer to the second question should correspond or at least not contradict the first one. 5 Speeder checks (time): Speeders are survey participants who finish too quickly. The problem with this kind of method is to set the cut-off to define “how quick is too quick”. We considered a percentage of mean/median time of compilation: 1. 33% of the mean time of compilation after excluding outliers. 2. 48% of the median time of compilation. For each path in the survey we set the cut-off as the mean of these two values. To catch this type of cheater, we evaluated the average of the differences in absolute value between adjacent scores combined with the time taken to complete the set of questions. When the answers are presented in a straight line (for example 1,1,1,1…), the average of the differences should be around 0. Instead, when they have a “zig-zag” shape, like 1,2,1,2... or 1,2,3,2,1…, this score should be around 1. This kind of behaviour reduces the time of completion because the respondent is not engaged in the questionnaire. In trials, it was found that the average CAWI responder needs 300 milliseconds to understand a single word in a sentence. This factor multiplied by the number of words in the question determines the time needed to read and understand the question properly. For these reasons, those respondents for whom the average of the differences was 0 or 1 and the time of compilation was less than the required reading time, were considered to have failed the straightlining checks. In this case we found the same results of the first data set for the group of those who fail 4+ checks while, for the group of those who fail 3 checks there is a conside- rable decrease in the number of people who claim to be allergic and the percentage is quite similar to the benchmark. Therefore it may be that it is not necessary to eliminate those who fail 3 controls but just those failing 4+. We analyzed the percentage of people suffering al- lergies in different sub-groups based on the number of failed checks. In the sub-group of those who fail 3 checks or 4+ checks, there is a considerable increase in those who claim to be allergic, maybe because the “Yes” option was the first one and they didn’t read the question properly. This could be an indicator that those who fail 3 or more checks should be eliminated from the survey. 6 Straightlining checks (straigh_1, straigh_2, straigh_3): Straightlining occurs when survey respondents give identi- cal (or with predictable pattern) answers to items in a battery of questions using the same response scale. 7 Trap questions (Fake brand/Names) (fake): These consist of incorporating fictitious (ghost) brands or names in a que- stion. For one survey we chose a Radio Buttons Question Type: “Have you ever heard of NAME OF FAKE SERVICE?” (Yes/No) For the other we chose to incorporate a fake brand name to other real ones in a yes/no grid. 3 10.7 % 12.85 % 12.7 % 12.58 % 13.95 % 10.34 % 20 % 0% 5% 10% 15% 20% Benchmark Total Failed none Failed 1 Failed 2 Failed 3 Failed 4+ Number of Failed Estimated Benchmark Suffering allergies 10.7 % 15.84 % 16.83 % 13.68 % 12.16 % 25 % 20.83 % 0% 5% 10% 15% 20% 25% Benchmark Total Failed none Failed 1 Failed 2 Failed 3 Failed 4+ Number of Failed Estimated Benchmark Suffering allergies This histogram represents the percentage of respon- dents who fail each control, while the red line shows the estimates of the target variable for the sample with the exception of those who fail the single check. There is a high percentage of those who fail the trap questions (fake brand/names), which also corre- sponds to a higher estimate of the target variable than that of other checks. This type of question could be tricky because there will always be some percen- tage of respondents who think they’ve heard of a brand even when the brand doesn’t exist, but they’re honestly mistaken. Punishing innocence could be a means of biasing the sample and in fact there is a rise in the estimate of the target variable. Another intere- sting result is the difference in the percentage of re- spondents who fail the direct instruction in the body of the question: for the second check the answer op- tions were “Yes”, “No” or “I do not know”, while for the first check the answer options were the day of the week which were de-standardized compared to the classic response options and this may have caught the respondent's attention. In this case we chose to incorporate a fake brand to other real ones in a yes/no grid and with re- spect to the previous data set, there is a conside- rable reduction in the number of those who fail the check. We checked for straightlinings in 3 batteries placed-at the beginning, in the middle and at the end of the questionnaire with the result that there is an increase in the number of those who fail the control based on the longer time spent completing the questionnaire. 0.75 % 6.15 % 6.99 % 36.07 % 2.33 % 1.49 % 0.93 % 1.21 % 1.96 % 4.94 % 10.7 % 15.77 % 15.49 % 15.73 % 16.62 % 15.74 % 15.61 % 15.62 % 16.04 % 15.97 % 15.88 % 0% 5% 10% 15% 20% 25% 30% 35% Benchmark direct_1 direct_2 unlik fake open coer_1 coer_2 time straigh_1 straigh_2 Quality Control Question Percentage Partecipants Failing each question 0.9 % 5.08 % 8.76 % 3.78 % 5.38 % 2.89 % 0.8 % 0.2 % 11.45 % 26.29 % 10.7 % 12.96 % 12.8 % 12.77 % 12.73 % 12.84 % 12.72 % 12.85 % 12.87 % 12.6 % 12.57 % 0% 5% 10% 15% 20% 25% Benchmark direct_1 direct_2 unlik fake open coer_1 time straigh_1 straigh_2 straigh_3 Quality Control Question Percentage Partecipants Failing each question Finally, we compared the estimates of those suffering from allergies for different sub-groups based on the cumulate of the checks failed. There is no clear impro- vement of the estimate of the target variable under any circumstances. Eliminating those who fail 0 checks from the survey results in keeping only 57,6% of the sample, but this worsens the estimate so it is possible that we are eliminating too much informa- tion. On the other hand, those who fail 3 or more checks are too few to have an impact on the estimate of a dichotomous variable. The results are quite comparable to that of the other data set. 10.7 % 15.84 % 16.83 % 15.73 % 15.48 % 15.73 % 15.65 % 0% 5% 10% 15% Number of Failed Estimated Benchmark Suffering allergies 10.7 % 12.85 % 12.7 % 12.66 % 12.57 % 12.7 % 12.6 % 0% 5% 10% Benchmark Total Failed none Failed less then or equal to 1 Failed less then or equal to 2 Failed less then or equal to 3 Failed less then or equal to 4 Benchmark Total Failed none Failed less then or equal to 1 Failed less then or equal to 2 Failed less then or equal to 3 Failed less then or equal to 4 Number of Failed Estimated Benchmark Suffering allergies Number of check failed Frequency Percentage 0 1 2 3 4 5 6 Total 567 56.5 302 30.1 86 8.6 29 2.9 8 .8 7 .7 5 .5 1004 100.0 Number of chek failed Frequency Percentage 0 1 2 3 4 5 6 Total 618 57.6 329 30.7 74 6.9 28 2.6 12 1.1 7 .7 5 .5 1073 100.0 The two biggest factors in determining the number of participants eliminated from a study are: 1) The way the quality control question is designed. We have found that the way in which direct instruction in question body and trap question with fake brands are designed, significantly changes the percentage of people who fail the checks. Also the position of the questions influences the failure of checks. 2) The number of quality control questions asked: Removing respondents who fail a single quality control question does not improve data quality. In our analysis, participants flagged for removal should fail at least 3 quality control measurements. CONCLUSION:

Upload: others

Post on 14-Oct-2020

5 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: HOW TO CATCH A CHEATER - Demetra opinioni.net...To catch this type of cheater, we evaluated the average of the di˜erences in absolute value between adjacent scores combined with the

HOW TO CATCH A CHEATERRELEVANCE & RESEARCH QUESTION:

Methods & Data:

Online surveys are self-administered by respondents seeking to receive incentives for completing questionnaires. Some respondents use minimal cognitive e�ort in order to quickly complete the survey in return for incentives. However, this can trigger behaviour such as not reading the questions carefully, racing through the survey or intentionally cheating, resulting in poor data quality. This paper aims to investigate the behaviour of cheaters among online respondents from a non-probability-based panel analyzing seven techniques for detecting cheaters, applied in di�erent ways in order to �nd an e�cient methodology that leads to the elimination of the greatest possible number of cheaters without remo-ving honest panelists.

Direct instruction in question body (direct_1, direct_2): these are questions with a direct instruction in the body which aims to check if the text of the application has been carefully read. In this case we used two Radio Buttons Question Type:1. direct_1: “To continue click on Friday...”2. direct_2: “To demonstrate that you have read this instruction, please do not answer the question below. Inste-ad click on the Next button to continue �lling out the questionnaire”.

We used data from 2 web surveys conducted in Italy (during January 2019) on members of our own panel, Opinioni.net, which is composed of 20,558 active panelists. The 2 surveys considered in our study have the following characteristics: a sample size of 1,073 for the �rst dataset and of 1,004 for the second, the same population target and a food consump-tion topic. Sample members were strati�ed by geographic area, gender and age in order to be representative of the Ita-lian population. In both questionnaires, we asked a particular question: “Are you a�ected or have you been a�ected in the past by one or more of the following long-term illnesses or pathological conditions?”. We considered su�ering of “Allergies” the target variable of our studies. The �rst survey was used as training set to determine a method for identifying cheaters. In particular, we analyzed the estimates of the target variable in each check and in any combination thereof, in comparison with the estimation of the same question from the Multi-purpose Survey on Families: “Health conditions and appeal to health services” conducted by Istat in 2016 in order to assess the quality of our data. Once the method was de�ned, we validated it using the second survey as test set.

THE TECHNIQUES USED TO DETECT CHEATERS ARE:

WWW.OPINIONI.NETDr.ssa MANUELA RAVAGNAN, DEMETRA OPINIONI.NET S.R.L.Dot. MARCO FORNEA, DEMETRA OPINIONI.NET S.R.L.

21TH GENERAL ONLINE RESEARCH CONFERENCE6 MARCH - 8 MARCH 2019, COLOGNE

answerAB1543.986

1

Unlikely events (unlik): At the beginning of a questionnaire there are often some screening questions aimed at de-tecting if the respondent has the characteristics required to access the survey. The respondent may state that he has all the required characteristics, even if this is not the case, with the aim of proceeding with the questionnaire and gaining the �nal incentive. To identify this type of cheater it is su�cient to insert, among the screening questions, a question asking whether the respondent has one or more characteristics, which he would not normally have.

2

Bad Open Questions (open): In a mandatory open-ended question, participants can use an inappropriate or unre-liable answer (eg, 'Asdfhjkl') as a way to indicate the lack of a meaningful answer or as a means of simply not lea-ving the space blank .

4

Consistency checks (coer_1, coer_2): The validation checks consist of two or more questions related to each other that are placed at di�erent points of the questionnaire. In these circumstances the answer to the second question should correspond or at least not contradict the �rst one.

5

Speeder checks (time): Speeders are survey participants who �nish too quickly. The problem with this kind of method is to set the cut-o� to de�ne “how quick is too quick”.We considered a percentage of mean/median time of compilation:1. 33% of the mean time of compilation after excluding outliers. 2. 48% of the median time of compilation. For each path in the survey we set the cut-o� as the mean of these two values.

To catch this type of cheater, we evaluated the average of the di�erences in absolute value between adjacent scores combined with the time taken to complete the set of questions. When the answers are presented in a straight line (for example 1,1,1,1…), the average of the di�erences should be around 0. Instead, when they have a “zig-zag” shape, like 1,2,1,2... or 1,2,3,2,1…, this score should be around 1. This kind of behaviour reduces the time of completion because the respondent is not engaged in the questionnaire. In trials, it was found that the average CAWI responder needs 300 milliseconds to understand a single word in a sentence. This factor multiplied by the number of words in the question determines the time needed to read and understand the question properly. For these reasons, those respondents for whom the average of the di�erences was 0 or 1 and the time of compilation was less than the required reading time, were considered to have failed the straightlining checks.

In this case we found the same results of the �rst data set for the group of those who fail 4+ checks while, for the group of those who fail 3 checks there is a conside-rable decrease in the number of people who claim to be allergic and the percentage is quite similar to the benchmark. Therefore it may be that it is not necessary to eliminate those who fail 3 controls but just those failing 4+.

We analyzed the percentage of people su�ering al-lergies in di�erent sub-groups based on the number of failed checks. In the sub-group of those who fail 3 checks or 4+ checks, there is a considerable increase in those who claim to be allergic, maybe because the “Yes” option was the �rst one and they didn’t read the question properly. This could be an indicator that those who fail 3 or more checks should be eliminated from the survey.

6

Straightlining checks (straigh_1, straigh_2, straigh_3): Straightlining occurs when survey respondents give identi-cal (or with predictable pattern) answers to items in a battery of questions using the same response scale.

7

Trap questions (Fake brand/Names) (fake): These consist of incorporating �ctitious (ghost) brands or names in a que-stion. For one survey we chose a Radio Buttons Question Type: “Have you ever heard of NAME OF FAKE SERVICE?” (Yes/No)For the other we chose to incorporate a fake brand name to other real ones in a yes/no grid.

3

10.7 %

12.85 % 12.7 % 12.58 %13.95 %

10.34 %

20 %

0%

5%

10%

15%

20%

Benc

hmar

k

Tota

l

Faile

d no

ne

Faile

d 1

Faile

d 2

Faile

d 3

Faile

d 4+

Number of Failed

Estim

ated

Ben

chm

ark

Suffering allergies

10.7 %

15.84 %16.83 %

13.68 %12.16 %

25 %

20.83 %

0%

5%

10%

15%

20%

25%

Benc

hmar

k

Tota

l

Faile

d no

ne

Faile

d 1

Faile

d 2

Faile

d 3

Faile

d 4+

Number of Failed

Estim

ated

Ben

chm

ark

Suffering allergies

This histogram represents the percentage of respon-dents who fail each control, while the red line shows the estimates of the target variable for the sample with the exception of those who fail the single check. There is a high percentage of those who fail the trap questions (fake brand/names), which also corre-sponds to a higher estimate of the target variable than that of other checks. This type of question could be tricky because there will always be some percen-tage of respondents who think they’ve heard of a brand even when the brand doesn’t exist, but they’re honestly mistaken. Punishing innocence could be a means of biasing the sample and in fact there is a rise in the estimate of the target variable. Another intere-sting result is the di�erence in the percentage of re-spondents who fail the direct instruction in the body of the question: for the second check the answer op-tions were “Yes”, “No” or “I do not know”, while for the �rst check the answer options were the day of the week which were de-standardized compared to the classic response options and this may have caught the respondent's attention.

In this case we chose to incorporate a fake brand to other real ones in a yes/no grid and with re-spect to the previous data set, there is a conside-rable reduction in the number of those who fail the check. We checked for straightlinings in 3 batteries placed-at the beginning, in the middle and at the end of the questionnaire with the result that there is an increase in the number of those who fail the control based on the longer time spent completing the questionnaire.

0.75 %

6.15 %6.99 %

36.07 %

2.33 %1.49 %

0.93 % 1.21 %1.96 %

4.94 %

10.7 %

15.77 % 15.49 % 15.73 %16.62 %

15.74 % 15.61 % 15.62 % 16.04 % 15.97 % 15.88 %

0%

5%

10%

15%

20%

25%

30%

35%

Benchmark direct_1 direct_2 unlik fake open coer_1 coer_2 time straigh_1 straigh_2

Quality Control Question

Perc

enta

ge

Partecipants Failing each question

0.9 %

5.08 %

8.76 %

3.78 %

5.38 %

2.89 %

0.8 %0.2 %

11.45 %

26.29 %

10.7 %

12.96 % 12.8 % 12.77 % 12.73 % 12.84 % 12.72 % 12.85 % 12.87 % 12.6 % 12.57 %

0%

5%

10%

15%

20%

25%

Benchmark direct_1 direct_2 unlik fake open coer_1 time straigh_1 straigh_2 straigh_3

Quality Control Question

Perc

enta

ge

Partecipants Failing each question

Finally, we compared the estimates of those su�ering from allergies for di�erent sub-groups based on the cumulate of the checks failed. There is no clear impro-vement of the estimate of the target variable under any circumstances. Eliminating those who fail 0 checks from the survey results in keeping only 57,6% of the sample, but this worsens the estimate so it is possible that we are eliminating too much informa-tion. On the other hand, those who fail 3 or more checks are too few to have an impact on the estimate of a dichotomous variable.

The results are quite comparable to that of the other data set.

10.7 %

15.84 %16.83 %

15.73 % 15.48 % 15.73 % 15.65 %

0%

5%

10%

15%

Number of Failed

Estim

ated

Ben

chm

ark

Suffering allergies

10.7 %

12.85 % 12.7 % 12.66 % 12.57 % 12.7 % 12.6 %

0%

5%

10%

Benc

hmar

k

Tota

l

Faile

d no

ne

Faile

d les

s the

nor

equ

al to

1

Faile

d les

s the

nor

equ

al to

2

Faile

d les

s the

nor

equ

al to

3

Faile

d les

s the

nor

equ

al to

4

Benc

hmar

k

Tota

l

Faile

d no

ne

Faile

d les

s the

nor

equ

al to

1

Faile

d les

s the

nor

equ

al to

2

Faile

d les

s the

nor

equ

al to

3

Faile

d les

s the

nor

equ

al to

4

Number of Failed

Estim

ated

Ben

chm

ark

Suffering allergies

Number of check failed

Frequency Percentage0123456Total

567 56.5302 30.1

86 8.629 2.9

8 .87 .75 .5

1004 100.0

Number of chek failed

Frequency Percentage0123456Total

618 57.6329 30.7

74 6.928 2.612 1.1

7 .75 .5

1073 100.0

The two biggest factors in determining the number of participants eliminated from a study are:1) The way the quality control question is designed. We have found that the way in which direct instruction in question body and trap question with fake brands are designed, signi�cantly changes the percentage of people who fail the checks. Also the position of the questions in�uences the failure of checks.2) The number of quality control questions asked: Removing respondents who fail a single quality control question does not improve data quality. In our analysis, participants �agged for removal should fail at least 3 quality control measurements.

CONCLUSION: