relational data pre-processing techniques for improved securities fraud detection andrew fast, lisa...
TRANSCRIPT
![Page 1: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/1.jpg)
Relational Data Pre-Processing Techniques for Improved
Securities Fraud Detection
Andrew Fast, Lisa Friedland, Marc Maier,
Brian Taylor, and David JensenKnowledge Discovery LaboratoryDepartment of Computer Science
University of Massachusetts Amherst
Henry G. Goldberg and John KomoroskeNational Association of Securities DealersFinancial Industry Regulatory Authority
![Page 2: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/2.jpg)
`Reps are required to file disclosures for incidents ranging in severity from customer complaints to criminal charges.
Individual characteristics are not distinctive,…
…it is the collection of related entities that sets this rep apart.
AgeYears in Industry
…
![Page 3: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/3.jpg)
Goals
Overall GoalIdentify new methods and improve existing methods for detecting securities fraud that consider the relationships among reps, branches, and firms.
Goals of This Talk
1) Describe the pre-processing steps needed to prepare the data for knowledge discovery.
2) Demonstrate that pre-processing is both necessary and beneficial for knowledge discovery in relational domains.
![Page 4: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/4.jpg)
Data Pre-ProcessingChallenge 1: Inputs
The Knowledge Discovery Process
Challenge 2: Class Labels
![Page 5: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/5.jpg)
Challenge 1: Preparing Inputs
Small-scale social structure between reps is important.
Branch Associations
Tribes
NOT IN RAW DATA
(Cortes, Pregibon, and Volinsky 2001)(Neville et al. 2005)
![Page 6: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/6.jpg)
?
Identifying Branch Associations
110 Wall Street , NEW YORK, NY, 10005
311 S. WACKER DRIVE , CHICAGO, IL,
60606
1400 World Trade Center, St. Paul, MN, 55101
30 East 7th Street Suite 1400, St. Paul, MN, 55101
110 Wall Street , NY, NY, 10005
110 Wall Street , MANHATTAN, NY, 10005
110 Wall Street , 22ND FLOOR, NEW YORK, NY, 10005
30 East 7th Street Suite 1400, St. Paul, MN, 55101
311 S. WACKER DRIVE , CHICARGO, OH,
60606
1400 World Trade Center, St. Paul, MN, 55101
Self-Reported Addresses are messy!
Use String Matching Algorithm
Unmatched 30% (~1.43 million)
Inferred Branches
Matched 70%(~3.35 million)
![Page 7: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/7.jpg)
BA
Identifying Tribes
Why did they move?
1. Looking for better location to commit fraud.
2. Friend inviting friends to better jobs.
3. Geographic Limitations.
4. Branches merged or acquired.
Anomalous Movement
Background Movement
![Page 8: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/8.jpg)
Identifying Tribes
Tribe - |trīb| noun
A group of reps that works together at many statistically unlikely branches.
• Reps in tribes found by our algorithm: – move between branches
in more zip-codes than the rest of the population.
– have more disclosures than the rest of the population.
For more details on how to find tribes please see talk by Lisa Friedland:
Finding Tribes: Identifying Close-Knit Individuals from Employment Patterns
R11: Pattern Discovery (II) , Tuesday 10:30 am ~ 11:50 am, Regency 2
![Page 9: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/9.jpg)
The Knowledge Discovery Process
![Page 10: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/10.jpg)
Challenge 2: Class Label
Whether a rep is planning to commit fraud is unknown …
… instead we use a surrogate class label called a risk score based on disclosure history.
With NASD guidance, we assigned a weight to each disclosure type based on its severity.
+ = risk score
For branches, compute average risk score over reps.
+
+ + += risk score
For reps, sum over disclosures to determine risk score for a given year.
3
![Page 11: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/11.jpg)
Risk Scores to Class Labels
Simple Approach1) Sort by risk score2) Choose Top N
• But Risk scores are not distributed evenly.
• Scores vary by:– Geography – Year– Firm Demographics
• Want high-risk, but relative to similar branches or reps.
![Page 12: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/12.jpg)
Creating a Normalized Class Label
0 1 2 4 5 6 7 8 93
Small Branch, Small Firm
Small Branch, Large Firm
Medium Branch, Small Firm
Medium Branch, Large Firm
Large Branch, Large Firm
2005
Stratify by year, zip-code, and firm demographicsFor each bin, choose top 5% of all scores, as long as also above median of non-zero scores
![Page 13: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/13.jpg)
The Knowledge Discovery Process
kdl.cs.umass.edu/proximity
![Page 14: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/14.jpg)
Learning Models of Risk
• Two approaches for learning models:– Model highest-risk entities from each bin with a
single model– Model each bin independently (stratified).
• Two possible class labels– Normalized Class Label– Global Class Label (no normalization).
![Page 15: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/15.jpg)
Evaluation
Random = 0.5
Non-Normalized = 0.72
Normalized = 0.77
Stratified Norm = 0.62
Stratified Non-Norm = 0.62
![Page 16: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/16.jpg)
Evaluation
Random = 0.5
Non-Normalized = 0.79
Normalized = 0.80
Stratified Norm = 0.69
Stratified Non-Norm = 0.70
![Page 17: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/17.jpg)
Relational Probability Tree
(Neville et al 2003)
![Page 18: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/18.jpg)
Challenge 1: Scorecard
• Enabled all other analyses.• Branch Features appeared many times in learned trees.
Branch Associations
• Useful as a local pattern.
• Not wide-spread enough to be useful as a global pattern despite desirable characteristics.
Tribes
REQUIRED
![Page 19: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/19.jpg)
Normalized Class Label
Stratified Models
Challenge 2: Scorecard
• Allows Improved Branch models.• No improvement in Rep models.
• Stratified Models do not improve performance.
• No aggregation is required for reps.• Normalization has no effect.
• Risk score on branches is aggregate over reps.• Normalization accounts for discrepancies in sizes.
• Small number of positive instances in each bin leads to lack of generalization.
![Page 20: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/20.jpg)
Summary
• Finding small-scale social structure critical for relational domains.– Branch matching essential, all analyses required branches.– Tribes - interesting local pattern not widely spread for global
modeling.• Continuing to explore other techniques to capitalize on dynamic
structure of complex domains.
• Under the right circumstances, normalization can be useful when creating a class label for relational domains.– Helpful when aggregating over lower level entities.
![Page 22: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/22.jpg)
![Page 23: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/23.jpg)
Consolidation and Link Formation
Consolidation - identify entities of interest
Link Formation - construct structured relations between consolidated entities.
(Goldberg and Senator 1995)
– Raw data don’t always explicitly contain entities and structure of interest.
– These are critical for knowledge discovery in relational data.
– In the securities industry, social structure among reps useful for detecting misconduct.
(Neville et al. 2005)
![Page 24: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/24.jpg)
For more details on tribes please see talk by Lisa Friedland:
Finding Tribes: Identifying Close-Knit Individuals from Employment Patterns
R11: Pattern Discovery (II) , Tuesday 10:30 am ~ 11:50 am, Regency 2
How did we do?– Reps in tribes move between branches in more zip-codes than the general population of reps– Reps in tribes are almost 8 times more likely to be at high-risk of fraud.
![Page 25: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/25.jpg)
Identifying Tribes
• Rather than limiting ourselves to static relations, we can consider groups of reps who coordinate their movement from branch to branch.
• Not all group movement is coordinated, consider:– Two firms in a small town. (Geography)– Branch is sold to another firm. (Acquisitions and Mergers)– Branch is closed. (Branch Closings)
• Not all coordinated movement is nefarious.– Friends inviting friends to better jobs.
Tribe - |trīb| noun
group of reps that coordinate their movement between statistically unlikely branches.
![Page 26: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/26.jpg)
Challenge 2: Scorecard
• Normalized Class Label – Allows better models for branches, not for reps.
• Risk score on branches is aggregate over reps.– Normalization accounts for discrepancies in sizes.
• No aggregation for reps is needed.– A high-scoring rep is high-risk no matter which bin they
belong to.
• Stratified Models– Stratified models perform worse than combined
models.• Number of positives instances per bin is small, does not
generalize well.
![Page 27: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/27.jpg)
Challenge 2: Class Label
• “Will commit fraud in future” flag is not given.– Create a surrogate class label or risk score from
collection of disclosures on reps.
• Risk is not uniformly distributed across the data.– Market conditions vary over time. (Temporal)– Laws vary from state to state. (Geographic)– A small firm may have different market pressures
than a large firm (Demographic)
![Page 28: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/28.jpg)
Weighted Risk Score
• Assign each disclosure type a score based on its severity. – Customer complaints– Bankruptcy– Regulatory Action
• Sum over disclosure types for each rep for each year.• Average over reps for branches.• Who is high-risk?
– Order entire population by risk score, choose top n. – But not uniformly distributed.
Check out your broker:http://brokercheck.finra.org/
http://www.nasdbrokercheck.com
![Page 29: Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cc05503460f9498736b/html5/thumbnails/29.jpg)
Normalizing Risk
• Stratify data into bins with uniform risk score.– Consider each year independently.– Segment USA by zip-code.– Create 5 categories of branches based on branch and firm
size.• Small branch, Small firm.• Small branch, Large Firm.• Medium branch, Small Firm.• Medium branch, Large firm.• Large branch, large firm.
• For each year 5 branch types X 10 zip-code regions = 50 bins.
• Who is high-risk?– Order entire population by risk score, choose top. each of the bins