fiscal: a machine learning approach...distance, jaro-winkler coefficient, n-gram distance)...
TRANSCRIPT
FISCAL: A MACHINE LEARNING APPROACH
FISCAL: A MACHINE LEARNING APPROACHTABLE OF CONTENT
• Start of the secondment
• What does FISCAL do?
• FISCAL approach
• Our approach
• Preliminary results
• Next steps
FISCAL: A MACHINE LEARNING APPROACHSTART OF THE SECONDMENT
Preparation• Meeting begin October with Glen, Michele, CFO, Head of Data and a RHUL facilitator• Meeting begin November with Serena at FISCAL• Secondment agreements for RHUL and EDB:
• Intellectual Property
• Data sensitivity
• Publishing
First weeks• Understanding their software• Understanding the data• Literature research
FISCAL: A MACHINE LEARNING APPROACHWHAT DOES FISCAL DO?
FISCAL Company• Located in Reading, UK (40 miles west of London)• ±35 employees• 6 developers (mostly C#)FISCAL homepage:“At FISCAL, we pride ourselves on delivering industry-leading forensic software solutions that
empower procure-to-pay (P2P) teams to protect organizational spend.”Meaning:FISCAL builds and sells software to companies that:• loads all the invoices from the database of the company
• tries to find duplicates to avoid unnecessary payments for the company
FISCAL: A MACHINE LEARNING APPROACHFISCAL APPROACH
Customer
ID Invoice Number Date Amount SupplierName
01 INV09218 02/09/2018 2000 Flour Inc.
02 293AKDL 02/09/2017 1999 Seeds & Nuts Shop
03 293ALKL 13/12/2018 890 NUTS SEEDS
04 09218AB 11/12/2016 2000 Fl0ur 1nc.
… … … …
Dataset of invoices
FISCAL: A MACHINE LEARNING APPROACHFISCAL APPROACH
Single invoices
Invoice #1
Invoice #2
Invoice #3
Invoice #4
Invoice #5
Invoice #6
…
Group A
Invoice #1
Invoice #3
Group B
Invoice #2
Invoice #6
Group C
Invoice #4
Invoice #5
Invoice pairs
Pair #1
Pair #2
Pair #3
…
FISCAL: A MACHINE LEARNING APPROACHFISCAL APPROACH
Tests are designed that give a similarity score eg. if the invoice dates are the same then +20
Invoice pair Dupl. test #1 Dupl. test #2 Dupl. test #3 Total Score
Pair #1 10 20 -50 -20
Pair #2 20 50 20 90
Pair #3 20 10 0 30
… … … … …
Rectangular cut is applied -> If total score is >0 then possible duplicate!
FISCAL: A MACHINE LEARNING APPROACHFISCAL APPROACH
Customer
Invoice pairs
Pair #2
Pair #3
…
FISCAL: A MACHINE LEARNING APPROACHFISCAL APPROACH
Customer
Invoice pairs
Pair #2
Pair #3
…
Pair #2021
Pair #3005
All need to be checked by the customer!
FISCAL: A MACHINE LEARNING APPROACHOUR APPROACH
Project goal:Reduce number of returned invoice duplicate candidates
How:• Improve their invoice grouping scheme (Blocking/bucketing)• Improve similarity measurement between invoice pairs• Introduce a new machine learning classification modelWhy:• A lot of literature available• Clear machine learning approach => Binary classification problem (Duplicate or not)
FISCAL: A MACHINE LEARNING APPROACHOUR APPROACH – BLOCKING/BUCKETING SCHEME
Goal of the blocking scheme:
• Group invoices together that are similar according to some function to decrease the number of permutations
Suggested grouping functions:
• Based on date
• Based on paid amount
• Based on phonetic encoding of supplier names
• Sorted neighbourhood approach
• Suffix-Array approach
Still very much in the literature research phase!
FISCAL: A MACHINE LEARNING APPROACHOUR APPROACH – SIMILARITY METRICS
Now we have a large set of paired invoices => Now what?
• Quantify the similarity by comparing the variables of the invoices
• Many different similarity functions!
• Finding best function and variable combination
Invoice ID Invoice Number
Invoice #1 INV09218
Invoice #2 09218AB
Pair #2
x1 = INV09218x2 = 09218AB f(x1, x2) =
VariablesSimilarityfunction
Similaritymetric
0.743
FISCAL: A MACHINE LEARNING APPROACHOUR APPROACH – NEURAL NETWORK IMPLEMENTATION
Dataset of pairs with many similarity variables AND duplicate labels
Build a model that takes the metrics and calculates the probability of a pair containing a duplicate -> Neural Network:
• Well documented application for a binary classification
• Many easy to use and fast python libraries available (Keras/Tensorflow)
Invoice pairs Metric #1 Metric #2 Metric #3 Duplicate
Pair #1 0.878 0.723 0.984 Yes
Pair #2 0.123 0.423 0.223 No
… … … … …
FISCAL: A MACHINE LEARNING APPROACHPRELIMINARY RESULTS
• Randomly choose a set of similarity functions on invoice number only
• Already some promising results => Framework works
FISCAL: A MACHINE LEARNING APPROACH NEXT STEPS
• Find the optimal blocking scheme as preprocessing step
• Optimize the similarity function and variable combination
• Increase training dataset
• Tweak model hyperparameters:
• Number of layers/nodes
• Loss function, minimization algorithm, learning rate, decay rate, momentum…
• Weight initialization
• Unsupervised learning options => Autoencoder anomaly detection
Back-up
FISCAL: A MACHINE LEARNING APPROACHSTEP 1 – INDEXING/BLOCKING
• Cleaned data are ready to be compared.• Potentially each record needs to be compared with all the other records:
• This lead to a total number of N!/(2(N-2)!) comparisons not scale with very large databases.
• Also not useful lot of comparisons will be between records that clearly don’t match (do not have any feature in common).
To avoid the comparison between records that clearly don’t match, blocking criteria can be applied.
FISCAL: A MACHINE LEARNING APPROACHBLOCKING CRITERIA
Blocking criteria filter out record pairs that are very unlikely to correspond to matches• Split the database in smaller blocks
• Comparison of records within the same block
Example of blocking criteria:• We can use, for example, buckets 2,3,5 of NXG scoring:
• Block 1: Same invoice amount, invoice date.• Block 2: Same invoice amount, numeric invoice number.• Block 3: Same invoice date, numeric invoice number.
• Generate permutation of pairs within the 3 blocks.Others idea of blocking:• We can try to use phonetic encoding algorithms to be applied on either strings or numbers (for
example supplier names, Invoice numbers)• These criteria will create probably blocks larger than NXG buckets find the best idea on how to
apply these criteria• To be tested, studies ongoing.
FISCAL: A MACHINE LEARNING APPROACHBLOCKING CRITERIA: EXAMPLE
Block 1
Block 2
Block 3
Permutations of pairs
Permutations of pairs
Permutations of pairs
(B1,B15), (B3,B17), (B5,B8) …
(R3,R10), (R7,R1), (R8,R12) …
(T5,T3), (T8,T17), (T28,T2) …
Same date, amount
Same amount, numeric invoice N.
IDs of pairs(B1,B15)(B3,B17)(B5,B8)(R3,R10)(R7,R1)(R8,R12)(T5,T3)(T8,T17)(T28,T2)
…
• Create unique Dataframe with all the pairs generated from different blocks.
Same date, numeric invoice N.
FISCAL: A MACHINE LEARNING APPROACHSTEP 2 – RECORD COMPARISON (SIMILARITY METRICS)
Goal: Compare record pairs to determine overall similarity.• This will allow to classify pairs as Duplicate or Not duplicate • The more features (variables) to compare we have, the more precise the similarity score will be.
Main steps for comparison:1. Compare several record variables ( invoice date, invoice number, supplier names, etc.)
• Build new variables that will be useful to find similarity in record pairs (if the variables that we have are not enough).
2. Apply similarity comparison functions, either on strings or on numbers, to find similarity scores.• Several similarity comparison functions in literature (Dice coefficient, Damerau-LevenshteinDistance, Jaro-Winkler coefficient, N-Gram distance)• Studies ongoing to find the best ones.
3. Evaluate Sum of scores to be used for the next step of Classification of pairs of duplicate ornot duplicate invoices.
FISCAL: A MACHINE LEARNING APPROACHSIMILARITY CLASSIFICATION - EXAMPLE
IDs of pairs Invoice Number
Supplier Name
Supplier Reference
Invoice Date
Sum of similarity
B1 Brian 7356 23102016
B15 Bryan 1824 23102016
Scores 0.5 0.8 0.0 1.0 2.3
Variables
Record pair(B1,B15)
641091
641375
Similarity degree:• Similarity of 1.0 -> exact match between two variables • Similarity of 0.0 -> total dissimilarity between two values• Similarity in-between 0.0 and 1.0 -> some degree of similarity between two attributes values
Sum of similarity
• 0 < Sum of Sim < 4• 4 is the max value of score because 4 are the variables• From the evaluated sum of Sim pairs can be classified as pairs of “Duplicates” or “Not Duplicates”
FISCAL: A MACHINE LEARNING APPROACHRECORD PAIR CLASSIFICATION
IDs of pairs Sum Similclassification
ClassificationIsDuplicate
(B1,B15) 2.3 No
(B3,B17) 3.7 Yes
(B5,B8) 1.5 No
(R3,R10) 3.1 Yes
(R7,R1) 2.5 No
(R8,R12) 1.1 No
(T5,T17) 3.2 Yes
Sum > 3 -> DuplicateSum < 3 -> No Duplicate
This is a binary classification problem that can be solved with a supervised classification ML algorithm.
FISCAL: A MACHINE LEARNING APPROACHCHOOSING MODEL
• Dataset of pairs with many similarity variables -> Now what?
• Current approach: Calculate total score and apply rectangular cut
• Proposed approach: Build a model that takes the metrics and calculates the probability of a pair containing a duplicate -> Neural Network
• Why neural networks:• Are good in finding complex and abstract patterns (The pattern indicating a duplicate)
• Very good at generalizing i.e. recognizing patterns that it was not
trained on
• Well documented in binary classification problem:
• Powerful available python libraries (Tensorflow)
• Literature to build and improve model for your specific problem
FISCAL: A MACHINE LEARNING APPROACHWHAT IS A NEURAL NETWORK
Dice coefficient
Damerau distance
Jaro coefficient
Transaction pair
Duplicate prob.
FISCAL: A MACHINE LEARNING APPROACHTHE FISCAL APPROACH
Example invoices(dummy):
Example tests:
• Add to the score if invoice numbers are the same after a mangling process
• Add to the score if supplier names contain the same words
• Subtract from the score if the invoices were paid on different dates
ID Invoice Amount Invoice Number Invoice Date Supplier Ref. Supplier Name Paid Date
0021032 123000 INV20K2C 14/09/17 3291029 INSIGHTS Inc. 23/12/17
0021102 120300 20K2CA 14/09/17 3291101 MC ITN INSIGHTS 22/11/17
FISCAL: A MACHINE LEARNING APPROACHTHE FISCAL APPROACH
Example invoices(dummy):
Duplicate identification process:
• Create pairs of invoices based on three criteria sets (Buckets/blocks)
• Assign a risk score [-100, 100] based on a set of tests that quantify similarity
• Deduplicate with a separate algorithm and pass on the pairs with a score >0
ID Invoice Amount Invoice Number Invoice Date Supplier Ref. Supplier Name Paid Date
0021032 123000 INV20K2C 14/09/17 3291029 INSIGHTS Inc. 23/12/17
0021102 120300 20K2CA 14/09/17 3291101 MC ITN INSIGHTS 22/11/17
FISCAL: A MACHINE LEARNING APPROACHDRAWBACKS AND POSSIBILITIES
Drawbacks and possibilities:
• Scalability:• 106 invoices => (106)! / (2! (106 - 2)!) => ~1011 possible pairs
• Improve bucketing/blocking scheme
• High false positive rate:• Still thousands that are returned to the customer => Need to be checked by hand!
• Improve similarity tests and duplication identification
• Fraudulent invoices:• Not taken into account
• Whole separate and new field w.r.t. deduplication
FISCAL: A MACHINE LEARNING APPROACHGENERAL STRATEGY
Raw Data Cleaning
Indexing/Blocking cleaned data
Comparison of record pairs
Classification
Is Duplicate Not Duplicate
Train Deep Learning model
• Raw data cleaning not needed because our data are already cleaned.