Download - Testing for Success

> Tes&ng for Success < Elements of a Successful Tes0ng

Program

> Agenda

§ Why Test? § Problem Diagnosis § Deciding what to Test § Test Execu0on and Measurement § Test Repor0ng

December 2011 © Datalicious Pty Ltd 2

> Why Test?

101011010010010010101111010010010101010100001011111001010101010100101011001100010100101001101101001101001010100111001010010010101001001010010100100101001111101010100101001001001010


1. Why does your business/organisa0on exist?

2.  How can your business/organisa0on improve?


EVERYONE’S GOT AN OPINION

> Why Test?

1.  Systema0c Innova0on 2.  Avoid costly mistakes 3.  Know why things go right, know why things

go wrong 4.  BeRer employee engagement

§  Requires planning and governance!


> Problem Diagnosis

101011010010010010101111010010010101010100001011111001010101010100101011001100010100101001101101001101001010100111001010010010101001001010010100100101001111101010100101001001001010


> What is the business problem?


Analy&cs and metrics frameworks

Acquisi0on Up-‐Sell Reten0on Advocacy

> Case Study


> Further Diagnosis

December 2011 9

PROBLEM: Sales through online

Not enough site traffic

© Datalicious Pty Ltd

High home page bounce rate

Low conversion on product page

Checkout fallout

> Further Diagnosis II

© Datalicious Pty Ltd 10 December 2011

Source: www.feng-‐gui.com

> Some&mes the small things count



> Further diagnosis III

Wrong message? Wrong channel? Wrong person? Wrong 0me?

> Tes&ng as risk mi&ga&on


Roll-‐out Channel

Press TV Radio Outdoor

Test Channel

eDM/DM Offer,

Crea&ve, Call-‐to-‐Ac&on

Call-‐to-‐Ac&on

Offer, Call-‐to-‐Ac&on

Offer, Call-‐to-‐Ac&on

Paid Search

Offer Offer Offer Offer

Display Media

-‐ Crea&ve Offer, Call-‐to Ac&on

Crea&ve, Offer, Call-‐to Ac&on

> Tes&ng as standard prac&ce


% Uplib in Sales

Test Market Control Market (no ATL)

Time

> Deciding what to Test

101011010010010010101111010010010101010100001011111001010101010100101011001100010100101001101101001101001010100111001010010010101001001010010100100101001111101010100101001001001010



Don’t reinvent the wheel

© Datalicious Pty Ltd December 2011 17

> What are the solu&on(s)?

© Datalicious Pty Ltd 17


What are your visitors trying to achieve by visi2ng your site?

> Consumer Empathy

> Consumer Empathy


1.   Make it visible –  People can’t convert if they can’t find your

‘Buy Now’ buRon

2.   Make it relevant –  Need to resolve consumer reserva0ons/

ques0ons

3.   Make it easy –  Easy naviga0on, easy form comple0on, easy to

read, quick page load

> Start with the basics… 1. The headline – Have a headline! – Headline should be concrete – Headline should be first thing visitors look at

2. Call to ac&on – Don’t have too many calls to ac0on – Have an ac0onable call to ac0on – Have a big, prominent, visible call to ac0on

3. Social proof –  Logos, number of users, tes0monials, case studies, media coverage, etc



> Start with the basics…

> Case Study


> Further Examples


TEST A EXISTING

> Further Examples


EXISTING

TEST

> Deciding What to Test

Test Selec0on Checklist

§  Is the measurement infrastructure in place already? §  Can I readily execute the solu0on? §  Do I have enough sample to draw valid conclusions? §  Will this prove the value of tes0ng in the business?


[ ] ✔ [ ] ✔ [ ] ✔ [ ] ✔

> Do you have the repor&ng?


Test Channel

ATL DM eDM Online

Response Channel

Online

Mailroom

Call Centre

Bricks & Mortar

Channels in Aggregate ✔

✔

✔ ✔

For each of Segment X, Y and Z...

> Offline conversions from online


Cookie

Website.com Research

Phone Orders

Retail Orders

Online Orders



Online Order Confirma&on

Virtual Order Confirma&on



@

@

@

Cookie Cookie

Online Ad Campaign

Tying offline conversions back to online campaign and research behavior using standard cookie technology by triggering virtual online order confirma0on pages for offline sales using email receipts.

> Search call to ac&on for offline


> OTP Response


– Different numbers for different media channels – Different numbers for different product categories

– Different numbers for different conversion steps – Call origin becoming useful to shape call script – Feasible to pause numbers to improve integrity

> Whose help do you need?


Technology/IT

Analytics!

Creative Agency

UX Agency Your boss, Your boss’ boss

Customer Contact Management

> Proving the Value


GO BIG

> How much sample do I need?


# on Segments, # of Treatments

BAU/Baseline Conversion Rate

Time in Market [Digital Only]

Expected Δ in Conversion n

> Sta&s&cal Significance


Q. How much am I willing to accept that the difference in the results between my test group and control group may have been due to chance?

A.   Not much. I want to be confident that if I

repeated the test 100 &mes, then I would observe this difference 95 &mes.

This is ‘95% confidence’

> Type I and Type II Error


Type I: Accept result to be true when it’s actually false (false posi&ves)

Type II: Accept result to be false when it’s

actually true (false nega&ves)

> Es&ma&ng Sample Size (%s)


n = 2(1.645+1.282) *

p1(1− p1)+ p2 (1− p2 )Δ2

#

$%

&

'(

Where: n = es0mated sample size for each group p1 = expected conversion rate for your test treatment p2 = expected conversion rate for your control treatment Δ = expected minimum percentage point difference between test and control results

The value of 1.645 reflects that we accept Type I error probability of .05 The value of 1.282 reflects that we accept Type II error probability of .10

> Es&ma&ng Sample Size (%s)


n = 2(1.645+1.282) *

0.025*0.975+ 0.030*0.9700.0052

!

"#

$

%&

Typical Champion (control) vs. Challenger (test) A|B test, typical champion response rate of 2.5%.

•  Only going to replace Champion with Challenger if Challenger response rate is 3.0% (0.5% is a meaningful difference)

Sample size = 18,326 For each of the Champion and Challenger groups If 1.0% our meaningful difference then sample size is only 5,378

> Es&ma&ng Sample Size ($s)


n = (1.645+1.282)2 *(s1

2 + s22 )

Δ2

Where: n = number of observa0ons for each group s1 = expected standard devia0on of value for your test treatment s2 = expected standard devia0on of value for your control treatment Δ = expected minimum difference in value between test and control results

The value of 1.645 reflects an accepted Type I error probability of .05 The value of 1.282 reflects an accepted Type II error probability of .10

> Standard Devia&on


Where: n = number of observa0ons xi = the result for the ith observa0on x = mean (average) for your data

Standard devia0on is measure of the variability of your results, whether some your results are quite different to your mean (average) result or whether they are quite similar.

s =(xi − x )

i=1

n

∑n−1

> Es&ma&ng Sample Size ($s)


Typical Champion (control) vs. Challenger (test) A|B test, typical champion mean response value of $20, typical response rate of 5%

•  Only going to replace Champion with Challenger if Challenger mean response value is is $30 ($10 is a meaningful difference)

•  Standard devia0on of Champion results is $5 (based on past results). We’ll assume the same for the Challenger.

n = (1.645+1.282)2 *(52 + 52 )

102Number of observa0ons = 4.3 (~5) for each of the Champion and Challenger groups. Then divide through with the expected response rate to get minimum sample size of 86 for each of Challenger and Control groups (4.3/0.05)

> Further Complexity I


If we wanted to test the performance of Challenger vs. Champion for different segments of consumers:

Using same assump0ons as in earlier example need 18,326 per cell, 18,326*6=109,956 in total .

Response Rate

Champion Challenger

Segment

A % %

B % %

C % %

> Further Complexity II


If we wanted to test the performance of Challenger vs. Champion for difference segments of consumers AND had 3 different types of Champion crea0ve:

Using same assump0ons as in earlier example need 18,326 per cell, 18,326*12=219,912 in total.

Response Rate

Champion Challenger #1

Challenger #2

Challenger #3

Segment

A % % % %

B % % % %

C % % % %

> Further Complexity III


If we wanted to test the performance of Challenger crea0ve that was specifically customised for difference segments of consumers, then we’re actually only running 6 tests

Using same assump0ons as in earlier example need 18,326 per cell, 18,326*6=109.956 in total.

Response Rate

Champion Challenger #1

Challenger #2

Challenger #3

Segment

A % %

B % %

C % %

> Mul&variate Tes&ng


Mul0variate Tes0ng (commonly called MVT) is a term used for tes0ng different varia0ons of typical elements of a landing page, direct mail leRer, etc. The aim is to determine which combina0on delivers the best result.

Element #1: Prominent headline

Element #2: Call to ac0on

Suppor0ng content

Element #3: Social proof / trust

Terms and condi0ons

§  Element #1 –  2 varia0ons (1 exis0ng, 1 new)

§  Element #2 –  2 varia0ons (1 exis0ng, 1 new)

§  Element #3: –  2 varia0ons (1 exis0ng, 1 new)

> MVT – Full Factorial


A full factorial design requires every unique combina0on of page elements and can therefore be very sample hungry.

Element

Headline Call to Ac&on Social Proof

Treatment

1 H1 CTA1 SP1

2 H1 CTA1 SP2

3 H1 CTA2 SP1

4 H1 CTA2 SP2

5 H2 CTA1 SP1

6 H2 CTA1 SP2

7 H2 CTA2 SP1

8 H2 CTA2 SP2

To calculate the number of treatments just need to mul0ply the number of varia0ons for each factor together: 2 x 2 x 2 = 8

> MVT – Frac&onal Factorial


The alterna0ve is called a frac0onal factorial design which is some smaller set of elements combina0ons. The design should be ‘balanced’ -‐ every varia0on is tested the same number of 0mes and each combina0on of varia0ons occurs the same number of 0mes.

Element

Headline Call to Ac&on Social Proof

Treatment

1

2 H1 CTA1 SP2

3 H1 CTA2 SP1

4

5 H2 CTA1 SP1

6

7

8 H2 CTA2 SP2

Reduced sample requirements 4x18,326=73,304

> Layout Before Content §  Phase #1: A|B test

–  Test the same landing page content in completely different layouts

§  Phase #2: MV test –  Then test different content element combina0ons within the winning layout

§  Phase #3: MV test (if req’d) –  Test with reduced set of elements


Element #1: Prominent headline

Element #2: Call to ac0on

Suppor0ng content

Element #3: Social proof / trust

Terms and condi0ons

> Case Study


§  Yes, the measurement infrastructure is in place §  I can readily execute the test design §  I have enough sample to draw valid conclusions §  Yes, this design will prove the value of tes0ng in my

business

> Execu&on & Measurement

101011010010010010101111010010010101010100001011111001010101010100101011001100010100101001101101001101001010100111001010010010101001001010010100100101001111101010100101001001001010



Before you leap…

> Sample Selec&on

§  Each sample needs to be alike in terms of their predisposi0on to conversion


TEST CONTROL

Conversion: low rate credit card applica0on form comple0on

18-‐34 Mostly Male

Mostly Low Income

35-‐64 Mostly Female

Mostly High Income

> Timing is Important


Sales

Time

‘Burst’ Non BAU ATL Campaign

Ideal Test Window

> A|A Tes&ng


§  Set a test that splits your visitors 50/50 between the same treatment – Check that sample sizes are actually 50/50 –  Is there should be no difference in your conversion rates

– Are volumes of conversions matching other repor0ng?

> Measuring your performance


§  Propor0ons (conversion rates) § Means (average $s) §  Variability of Means (standard devia0on)

§  Use confidence intervals

Would my winning treatment s2ll be the winner across all my customers/visitors/consumers?

> Confidence Intervals


Conversio

n Ra

te

Treatments A B C

Revenu

e pe

r Re

spon

se

Treatments A B C

> Confidence Intervals


> Confidence Interval (%s)


Where: p = response rate n = sample size for treatment

The value of 1.96 reflects a 95% confidence level

p̂±1.96* p̂(1− p̂)n

^

> Confidence Interval Es&ma&on


1.7%±1.96* .017(1−.017)60850

Typical Champion (control) vs. Challenger (test) A|B Test

Treatment

Champion Challenger

Mailed 60850 52812

Responses 1055 455

Response Rate 1.7 0.9

0.9%±1.96* .009(1−.009)52812

1.7%± 0.10% 0.9%± 0.08%

1.69% ≤ Champion ≤ 1.71% 0.82% ≤ Challenger ≤ 0.98%



p1 − p2 ±1.96*p1(1− p1)

n1+p2 (1− p2 )

n2

Where: p1 = response rate for challenger p2 = response rate for champion n1 = sample size for challenger n2 = sample size for challenger




0.9−1.7±1.96* .009(1−.009)52812

+.017(1−.017)60850


Treatment

Champion Challenger

Mailed 60850 52812

Responses 1055 455


−0.8± 0.13-‐0.93% ≤ Difference Between Challenger and Champion ≤ -‐0.67%

> Control Group Sample Size


p1 − p2 ±1.96*p1(1− p1)

n1+p2 (1− p2 )

n2

Where: nc = sample size for control group nt = sample size for test group pc = forecast response rate for control group nt = forecast response rate for test group m = desired level of precision (% that is a meaningful difference)


nc =pc (1− pc )

m1.96"

#$

%

&'2

−pt (1− pt )

nt

Rearranged:

> Control Group Sample Size


nc =.01(1−.01)

.011.96"

#$

%

&'2

−.02(.02−.02)40, 000

We have 50,000 customers that we could include in our test design, what would our control sample need to be if we tested 40,000 customers, our

‘natural’ cross-‐sell rate was 1.0% and an incremental response rate of 1.0% points would be deemed to be meaningful?

This result suggests we could actually test more of our available customer base than we might have ini0ally expected (~40,600).

nc = 387

> Confidence intervals ($s)


Where: x = mean revenue among treatment responders s = standard devia0on of revenue among some treatment’s responders n = number of responders to the treatment

The value of 1.96 reflects a 95% level of confidence.

x ±1.96* sn

> Standard Devia&on (reminder)


Where: n = number of observa0ons xi = the result for the ith observa0on x = mean (average) for your data

Standard devia0on is measure of the variability of your results, whether some your results are quite different to your mean (average) result or whether they are quite similar.

s =(xi − x )

i=1

n

∑n−1



Where: x1 = mean value among among responders to a treatment x2 = mean value among among responders to a different treatment s1 = std. dev. of value among one treatment’s responders s2 = std. dev. of value among the other treatment’s responders n1 = number of responders to the treatment n2 = number of responders to the other treatment

The value of 1.96 reflects a 95% level of confidence.

n1 and n2 is sufficiently large to es0mate the std. dev. in the popula0on with the std. dev. of the sample.

x1 − x2 ±1.96*s12

n1+s22

n2




Treatment

Champion Challenger

Mailed 60850 52812

Responses 1055 455


Total Value $36,925 $38,675

Mean Value $35 $85

Std Dev $30 $50

85−35±1.96* 502

455+302

105550± 4.9

At a minimum, we should expect an incremental $45.1 if we rolled out the Challenger crea0ve as BAU (although our total amount of incremental revenue would be less).

> Case Study


> Main Effects


> Main Effects


Element Results

Headline Call to Ac&on Social Proof Visitors

Tested Conversions Conversion Rate

Treatment

1 H1 CTA1 SP1 1237 456 37%

2 H1 CTA1 SP2 1456 345 24%

3 H1 CTA2 SP1 1245 234 19%

4 H1 CTA2 SP2 2123 432 20%

5 H2 CTA1 SP1 1342 234 17%

6 H2 CTA1 SP2 1102 123 11%

7 H2 CTA2 SP1 1365 700 51%

8 H2 CTA2 SP2 1243 643 52%

Typical Landing Page Test

Treatment #7 and #8 were the clear winners and It looks as if the Headline and Call-‐to-‐Ac0on were much bigger drivers of posi0ve performance than the Social Proof. Lets check this!

> Main Effects



Element Results

Headline Call to Ac&on

Social Proof

Visitors Tested

Conversion Rate

Treatment

1 H1 CTA1 SP1 1237 37%

2 H1 CTA1 SP2 1456 24%

3 H1 CTA2 SP1 1245 19%

4 H1 CTA2 SP2 2123 20%

5 H2 CTA1 SP1 1342 17%

6 H2 CTA1 SP2 1102 11%

7 H2 CTA2 SP1 1365 51%

8 H2 CTA2 SP2 1243 52%

Avg H1=24%

The Main Effect of the Headline is simply the (weighted) average conversion rate for Headline 2 less the (weighted) average conversion rate for Headline 1 (33%-‐24%=9%)

Avg H2=33%

> Main Effects



Main Effect

Element Headline 9.4%

Call to Ac&on 11.1% Social Proof 5.3%

In actual fact, it was varia0ons in Call to Ac0on that had the most posi0ve impact on our results, improving conversions by 11.1% points.

> Interac&on Effects



Element Results


Social Proof

Visitors Tested

Conversion Rate

Treatment

1 H1 CTA1 SP1 1237 37%

2 H1 CTA1 SP2 1456 24%

7 H2 CTA2 SP1 1365 51%

8 H2 CTA2 SP2 1243 52%

3 H1 CTA2 SP1 1245 19%

4 H1 CTA2 SP2 2123 20%

5 H2 CTA1 SP1 1342 17%

6 H2 CTA1 SP2 1102 11%

An interac0on effect is present where the performance of one element is dependent on which varia0on of the another variable is present. In this example, we are looking at whether the results for each of the Headlines is dependent on which Call-‐to-‐Ac0on.




Element Results


Social Proof

Visitors Tested

Conversion Rate

Treatment

1 H1 CTA1 SP1 1237 37%

2 H1 CTA1 SP2 1456 24%

3 H1 CTA2 SP1 1245 19%

4 H1 CTA2 SP2 2123 20%

5 H2 CTA1 SP1 1342 17%

6 H2 CTA1 SP2 1102 11%

7 H2 CTA2 SP1 1365 51%

8 H2 CTA2 SP2 1243 52%

Wtd Avg H1CTA1=30%

The first step is to create weighted average response rates between for each of the two factors (ignoring Social Proof).

Wtd Avg H1CTA2=20%

Wtd Avg H2CTA1=14%

Wtd Avg H2CTA2=51%




Call to Ac&on

CTA1 CTA2 Diff

Headline

H1 30% 20% -‐10%

H2 14% 51% 37%

Diff -‐16% 31%

The next step is to calculate the difference in performance of one factor across different variants of the other factor. If the difference of this difference is non-‐zero (or not very close to zero), then you have an interac0on effect. For example, there is an interac0on effect between the Headline and Call to Ac0on as the difference in the difference in performance is non-‐zero (31%-‐(-‐16%)=47%). This is very large interac0on when compared to the Main Effects!

0%

20%

40%

60%

H1 H2

CTA1

CTA2




Sociol Proof

SP1 SP2 Diff

Headline

H1 28% 22% -‐6%

H2 34% 33% -‐1%

Diff -‐6% 11% 0%

20%

40%

H1 H2

SP1

SP2

Sociol Proof

SP1 SP2 Diff

Call to Ac&on

CTA1 27% 18% -‐9%

CTA2 36% 32% -‐4%

Diff 9% 14%

0%

20%

40%

CTA1 CTA2

SP1

SP2

> Repor&ng

101011010010010010101111010010010101010100001011111001010101010100101011001100010100101001101101001101001010100111001010010010101001001010010100100101001111101010100101001001001010



Document Everything!

> 1. Describe the test

§  Describe the outcome(s) you’re trying to influence

§  Describe your target audience §  Describe the different treatments including copies of crea0ve


> 2. Jus&fy the test design

§  Detail why you’ve chosen the par0cular outcome you’re trying to influence

§  Detail why you’ve chosen the consumers you are trying to influence

§  Detail why your interven0on should work – Past test results/Useability test/Case studies – Marketers intui0on/logic


> 3. Results & Conclusions

§  Detail all the performance results §  Discuss your hypotheses §  Future tests §  ‘Meta’ repor0ng of your test program


> Case Study


> List of (Some) Resources

§  hRp://visualwebsiteop0mizer.com/case-‐studies.php

§  hRp://www.whichtestwon.com/ §  hRp://www.feng-‐gui.com §  hRp://www.smashingmagazine.com/2010/06/24/the-‐ul0mate-‐guide-‐to-‐a-‐b-‐tes0ng



Contact us [email protected]

Learn more

blog.datalicious.com

Follow us twizer.com/datalicious

Data > Insights > Ac&on

Download - Testing for Success

Top Related