assurance scoring: using machine learning and analytics to reduce risk in the public sector

25
Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector Matt Thomson 17/11/2016

Upload: south-west-data-meetup

Post on 21-Apr-2017

11 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector

Matt Thomson17/11/2016

Page 2: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

2Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Outline

IntroductionTraditional Fraud DetectionAssurance ScoringMachine LearningBusiness RulesAnomaly DetectionGraph Links

Page 3: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

3Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Who am I?

Matt ThomsonSenior Data Scientist at CapgeminiPhD in Astrophysics (http://arxiv.org/abs/1010.3315)Several years experience in fraud detection

CapgeminiBig Data Analytics team~100 Data Scientists, Big Data Engineers and Data AnalystsFocus on Open Source and Big Data technologies to solve client

problemsSponsor the meetup today!

Page 4: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

4Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Introduction to the Problem

Public sector constantly working in an environment of reduced resources

Want to provide a better service but with greater efficiency

Therefore very important that limited resources are focussed correctly

Assurance Scoring Use ML and other analytical methods to identify the least risky people or applications so

that investigators resources can be targeted on the most risky

Page 5: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

5Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Hypothetical Example – 2016 Olympics tickets

Imagine running the application process for selling tickets to the 2016 Olympics

Avoid selling tickets to touts/resellers Vast majority of people applying for tickets are genuine Fraud detection with big class imbalance problem (<0.1%) Avoid approach of investigating each person applying

Lets say we know from 2012 Olympics which people ended up reselling their tickets – training data

Use ML to identify the 30% (say) least likely to be touts – fast tracked

Investigators focus on the high risk

Page 6: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

6Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Traditional Fraud Detection

Identify Historical

Training Data

Feature Engineering

Model Training and Evaluation

Model Execution

Feedback

Page 7: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

7Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Focus on low-risk

Allows resources to be better focussed

Not limited to Machine Learning

Built using Python! Pandas, Scikit-learn etc Scala version using Spark MLlib

Page 8: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

8Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Page 9: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

9Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

POLE ‘Analytical’ Data Layer

Disparate data sources - Atomic Layer

Atomic data is Transformed and Loaded into POLE

POLE Layer

EventLocationObjectPerson

Page 10: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

10Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

POLE ‘Analytical’ Data Layer

POLE contains ALL entities from the Atomic Layer, plus their inter-linkages

Page 11: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

11Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Page 12: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

12Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Machine learning

Transform Selection Model

Training

Validation

Test

Feature extraction and selection Model Building

Variety of output files: logs, graphics, pickle models, etcTesting: Unit tests, monitoring tests and integration tests

Vector BuildInput Data

Manipulate, ExploreData

Framework: Structure, flexibility, consistency

Page 13: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

13Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Machine learning : Feature Engineering

SQL, Python

Transform

Explore

Select

Ask questions, validate

Refine features

• Feature Extraction

• Data exploration

• Feature selection

Historical Data

Page 14: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

14Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Machine Learning: Model Building

Training

Validation

Test

Split Datasets

Build Models

Hyper-parameter tuning

Selectedfeatures Models

Training results

Validation results

Testsresults

Compare Models

Page 15: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

15Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Low risk? High risk? Depends on classifier’s threshold

• True-positives : applications the model correctly classifies as high risk

• True negatives: applications model correctly classifies as low risk

• False-positives: applications the model scores as high risk but are not

• False-negatives: applications the model scores as low risk but were in fact high risk

Page 16: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

16Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Page 17: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

17Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Business Rules

Identifying Fraud often been done using deterministic rules

Look for transactions near a threshold or at the end of the day

Primarily data queries on your feature vector

Olympics example – Anyone applying for more than £10,000 tickets

Page 18: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

18Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Page 19: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

19Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Anomaly Detection

Use the training data to create a baseline of applications by postcode (say)

If a particular postcode has a larger than expected number of applications then those cases pushed into high-risk bucket

Page 20: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

20Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Page 21: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

21Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Graph Links - Matching

Key part of assurance scoring – bringing data together from disparate sources

Probability of Match: 80%

Attribute Data Source 1 Data Source 2

Name Matt Thomson Matthew Thosmon

Phone Number 07123 456 789 07123 456 798

Favourite Sport Football Cricket

Page 22: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

22Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Page 23: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

23Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Further Details

Come and find [email protected] / @MattGThomsonAssurance Scoring brochure: http://ow.ly/4nbEUIBlogs:

• Introduction: https://www.capgemini.com/node/1380596• Integrating multiple techniques: http://bit.ly/24BmszV • Machine Learning: http://bit.ly/1QTMGnq• Many more on other topics

Page 24: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

24Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

We’re Hiring!

Data Sciencehttps://www.uk.capgemini.com/careers/jobs/data-scientist-0

Big Data Engineerhttps://www.uk.capgemini.com/careers/jobs/big-data-engineer

[email protected]

Page 25: Assurance Scoring: using machine learning and analytics to reduce risk in the public sector

The information contained in this presentation is proprietary.© 2012 Capgemini. All rights reserved.

www.capgemini.com

About CapgeminiWith more than 120,000 people in 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2011 global revenues of EUR 9.7 billion.Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business ExperienceTM, and draws on Rightshore ®, its worldwide delivery model.

Rightshore® is a trademark belonging to Capgemini