fiscal: a machine learning approach...distance, jaro-winkler coefficient, n-gram distance)...

28
FISCAL: A MACHINE LEARNING APPROACH

Upload: others

Post on 06-Dec-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACH

Page 2: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHTABLE OF CONTENT

• Start of the secondment

• What does FISCAL do?

• FISCAL approach

• Our approach

• Preliminary results

• Next steps

Page 3: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHSTART OF THE SECONDMENT

Preparation• Meeting begin October with Glen, Michele, CFO, Head of Data and a RHUL facilitator• Meeting begin November with Serena at FISCAL• Secondment agreements for RHUL and EDB:

• Intellectual Property

• Data sensitivity

• Publishing

First weeks• Understanding their software• Understanding the data• Literature research

Page 4: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHWHAT DOES FISCAL DO?

FISCAL Company• Located in Reading, UK (40 miles west of London)• ±35 employees• 6 developers (mostly C#)FISCAL homepage:“At FISCAL, we pride ourselves on delivering industry-leading forensic software solutions that

empower procure-to-pay (P2P) teams to protect organizational spend.”Meaning:FISCAL builds and sells software to companies that:• loads all the invoices from the database of the company

• tries to find duplicates to avoid unnecessary payments for the company

Page 5: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHFISCAL APPROACH

Customer

ID Invoice Number Date Amount SupplierName

01 INV09218 02/09/2018 2000 Flour Inc.

02 293AKDL 02/09/2017 1999 Seeds & Nuts Shop

03 293ALKL 13/12/2018 890 NUTS SEEDS

04 09218AB 11/12/2016 2000 Fl0ur 1nc.

… … … …

Dataset of invoices

Page 6: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHFISCAL APPROACH

Single invoices

Invoice #1

Invoice #2

Invoice #3

Invoice #4

Invoice #5

Invoice #6

Group A

Invoice #1

Invoice #3

Group B

Invoice #2

Invoice #6

Group C

Invoice #4

Invoice #5

Invoice pairs

Pair #1

Pair #2

Pair #3

Page 7: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHFISCAL APPROACH

Tests are designed that give a similarity score eg. if the invoice dates are the same then +20

Invoice pair Dupl. test #1 Dupl. test #2 Dupl. test #3 Total Score

Pair #1 10 20 -50 -20

Pair #2 20 50 20 90

Pair #3 20 10 0 30

… … … … …

Rectangular cut is applied -> If total score is >0 then possible duplicate!

Page 8: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHFISCAL APPROACH

Customer

Invoice pairs

Pair #2

Pair #3

Page 9: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHFISCAL APPROACH

Customer

Invoice pairs

Pair #2

Pair #3

Pair #2021

Pair #3005

All need to be checked by the customer!

Page 10: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHOUR APPROACH

Project goal:Reduce number of returned invoice duplicate candidates

How:• Improve their invoice grouping scheme (Blocking/bucketing)• Improve similarity measurement between invoice pairs• Introduce a new machine learning classification modelWhy:• A lot of literature available• Clear machine learning approach => Binary classification problem (Duplicate or not)

Page 11: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHOUR APPROACH – BLOCKING/BUCKETING SCHEME

Goal of the blocking scheme:

• Group invoices together that are similar according to some function to decrease the number of permutations

Suggested grouping functions:

• Based on date

• Based on paid amount

• Based on phonetic encoding of supplier names

• Sorted neighbourhood approach

• Suffix-Array approach

Still very much in the literature research phase!

Page 12: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHOUR APPROACH – SIMILARITY METRICS

Now we have a large set of paired invoices => Now what?

• Quantify the similarity by comparing the variables of the invoices

• Many different similarity functions!

• Finding best function and variable combination

Invoice ID Invoice Number

Invoice #1 INV09218

Invoice #2 09218AB

Pair #2

x1 = INV09218x2 = 09218AB f(x1, x2) =

VariablesSimilarityfunction

Similaritymetric

0.743

Page 13: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHOUR APPROACH – NEURAL NETWORK IMPLEMENTATION

Dataset of pairs with many similarity variables AND duplicate labels

Build a model that takes the metrics and calculates the probability of a pair containing a duplicate -> Neural Network:

• Well documented application for a binary classification

• Many easy to use and fast python libraries available (Keras/Tensorflow)

Invoice pairs Metric #1 Metric #2 Metric #3 Duplicate

Pair #1 0.878 0.723 0.984 Yes

Pair #2 0.123 0.423 0.223 No

… … … … …

Page 14: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHPRELIMINARY RESULTS

• Randomly choose a set of similarity functions on invoice number only

• Already some promising results => Framework works

Page 15: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACH NEXT STEPS

• Find the optimal blocking scheme as preprocessing step

• Optimize the similarity function and variable combination

• Increase training dataset

• Tweak model hyperparameters:

• Number of layers/nodes

• Loss function, minimization algorithm, learning rate, decay rate, momentum…

• Weight initialization

• Unsupervised learning options => Autoencoder anomaly detection

Page 16: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

Back-up

Page 17: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHSTEP 1 – INDEXING/BLOCKING

• Cleaned data are ready to be compared.• Potentially each record needs to be compared with all the other records:

• This lead to a total number of N!/(2(N-2)!) comparisons not scale with very large databases.

• Also not useful lot of comparisons will be between records that clearly don’t match (do not have any feature in common).

To avoid the comparison between records that clearly don’t match, blocking criteria can be applied.

Page 18: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHBLOCKING CRITERIA

Blocking criteria filter out record pairs that are very unlikely to correspond to matches• Split the database in smaller blocks

• Comparison of records within the same block

Example of blocking criteria:• We can use, for example, buckets 2,3,5 of NXG scoring:

• Block 1: Same invoice amount, invoice date.• Block 2: Same invoice amount, numeric invoice number.• Block 3: Same invoice date, numeric invoice number.

• Generate permutation of pairs within the 3 blocks.Others idea of blocking:• We can try to use phonetic encoding algorithms to be applied on either strings or numbers (for

example supplier names, Invoice numbers)• These criteria will create probably blocks larger than NXG buckets find the best idea on how to

apply these criteria• To be tested, studies ongoing.

Page 19: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHBLOCKING CRITERIA: EXAMPLE

Block 1

Block 2

Block 3

Permutations of pairs

Permutations of pairs

Permutations of pairs

(B1,B15), (B3,B17), (B5,B8) …

(R3,R10), (R7,R1), (R8,R12) …

(T5,T3), (T8,T17), (T28,T2) …

Same date, amount

Same amount, numeric invoice N.

IDs of pairs(B1,B15)(B3,B17)(B5,B8)(R3,R10)(R7,R1)(R8,R12)(T5,T3)(T8,T17)(T28,T2)

• Create unique Dataframe with all the pairs generated from different blocks.

Same date, numeric invoice N.

Page 20: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHSTEP 2 – RECORD COMPARISON (SIMILARITY METRICS)

Goal: Compare record pairs to determine overall similarity.• This will allow to classify pairs as Duplicate or Not duplicate • The more features (variables) to compare we have, the more precise the similarity score will be.

Main steps for comparison:1. Compare several record variables ( invoice date, invoice number, supplier names, etc.)

• Build new variables that will be useful to find similarity in record pairs (if the variables that we have are not enough).

2. Apply similarity comparison functions, either on strings or on numbers, to find similarity scores.• Several similarity comparison functions in literature (Dice coefficient, Damerau-LevenshteinDistance, Jaro-Winkler coefficient, N-Gram distance)• Studies ongoing to find the best ones.

3. Evaluate Sum of scores to be used for the next step of Classification of pairs of duplicate ornot duplicate invoices.

Page 21: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHSIMILARITY CLASSIFICATION - EXAMPLE

IDs of pairs Invoice Number

Supplier Name

Supplier Reference

Invoice Date

Sum of similarity

B1 Brian 7356 23102016

B15 Bryan 1824 23102016

Scores 0.5 0.8 0.0 1.0 2.3

Variables

Record pair(B1,B15)

641091

641375

Similarity degree:• Similarity of 1.0 -> exact match between two variables • Similarity of 0.0 -> total dissimilarity between two values• Similarity in-between 0.0 and 1.0 -> some degree of similarity between two attributes values

Sum of similarity

• 0 < Sum of Sim < 4• 4 is the max value of score because 4 are the variables• From the evaluated sum of Sim pairs can be classified as pairs of “Duplicates” or “Not Duplicates”

Page 22: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHRECORD PAIR CLASSIFICATION

IDs of pairs Sum Similclassification

ClassificationIsDuplicate

(B1,B15) 2.3 No

(B3,B17) 3.7 Yes

(B5,B8) 1.5 No

(R3,R10) 3.1 Yes

(R7,R1) 2.5 No

(R8,R12) 1.1 No

(T5,T17) 3.2 Yes

Sum > 3 -> DuplicateSum < 3 -> No Duplicate

This is a binary classification problem that can be solved with a supervised classification ML algorithm.

Page 23: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHCHOOSING MODEL

• Dataset of pairs with many similarity variables -> Now what?

• Current approach: Calculate total score and apply rectangular cut

• Proposed approach: Build a model that takes the metrics and calculates the probability of a pair containing a duplicate -> Neural Network

• Why neural networks:• Are good in finding complex and abstract patterns (The pattern indicating a duplicate)

• Very good at generalizing i.e. recognizing patterns that it was not

trained on

• Well documented in binary classification problem:

• Powerful available python libraries (Tensorflow)

• Literature to build and improve model for your specific problem

Page 24: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHWHAT IS A NEURAL NETWORK

Dice coefficient

Damerau distance

Jaro coefficient

Transaction pair

Duplicate prob.

Page 25: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHTHE FISCAL APPROACH

Example invoices(dummy):

Example tests:

• Add to the score if invoice numbers are the same after a mangling process

• Add to the score if supplier names contain the same words

• Subtract from the score if the invoices were paid on different dates

ID Invoice Amount Invoice Number Invoice Date Supplier Ref. Supplier Name Paid Date

0021032 123000 INV20K2C 14/09/17 3291029 INSIGHTS Inc. 23/12/17

0021102 120300 20K2CA 14/09/17 3291101 MC ITN INSIGHTS 22/11/17

Page 26: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHTHE FISCAL APPROACH

Example invoices(dummy):

Duplicate identification process:

• Create pairs of invoices based on three criteria sets (Buckets/blocks)

• Assign a risk score [-100, 100] based on a set of tests that quantify similarity

• Deduplicate with a separate algorithm and pass on the pairs with a score >0

ID Invoice Amount Invoice Number Invoice Date Supplier Ref. Supplier Name Paid Date

0021032 123000 INV20K2C 14/09/17 3291029 INSIGHTS Inc. 23/12/17

0021102 120300 20K2CA 14/09/17 3291101 MC ITN INSIGHTS 22/11/17

Page 27: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHDRAWBACKS AND POSSIBILITIES

Drawbacks and possibilities:

• Scalability:• 106 invoices => (106)! / (2! (106 - 2)!) => ~1011 possible pairs

• Improve bucketing/blocking scheme

• High false positive rate:• Still thousands that are returned to the customer => Need to be checked by hand!

• Improve similarity tests and duplication identification

• Fraudulent invoices:• Not taken into account

• Whole separate and new field w.r.t. deduplication

Page 28: FISCAL: A MACHINE LEARNING APPROACH...Distance, Jaro-Winkler coefficient, N-Gram distance) •Studies ongoing to find the best ones. 3.EvaluateSum of scores to be used for the next

FISCAL: A MACHINE LEARNING APPROACHGENERAL STRATEGY

Raw Data Cleaning

Indexing/Blocking cleaned data

Comparison of record pairs

Classification

Is Duplicate Not Duplicate

Train Deep Learning model

• Raw data cleaning not needed because our data are already cleaned.