jan25 - ottawa machine learning meetup
Post on 08-Feb-2017
515 Views
Preview:
TRANSCRIPT
CLASSIFYING OMNIBUS BILLS
OTTAWA MACHINE LEARNING MEETUP - JAN. 25TH, 2016
SAMUEL WITHERSPOON, MATHEW SONKE
DISCLAIMER
THIS IS OUR FIRST ITERATION AND IS A WORK IN PROGRESS.
PURPOSE
WE WANT TO SHOW HOW WE MOVE FROM START TO FIRST SET OF RESULTS IN AN ML PROBLEM
SUMMARY OF EFFORT≈ 50 HOURS SPENT
≈ 120 BILLS MANUALLY CLASSIFIED
SOURCE CODE:https://github.com/switherspoon/MachineLearningMeetup
WHAT IS AN OMNIBUS BILL?
TYPICALLY VERY LONG
TYPICALLY LOTS OF OTHER BILLS MODIFIED
For Example Bill C-51
A BILL THAT HAS A WIDE VARIETY OF TOPICS
THAT DEFINITION INFORMED OUR FEATURES
LENGTH OF BILL
DIVERSITY OF TOPICS IN THE BILL
NUMBER OF OTHER BILLS MODIFIED/REFERENCED
FEATURES:
WHAT DOES AN OMNIBUS LOOK
LIKE?
BILL C-51 - 41st PARLIAMENT 2nd
SESSION
BILL C-54 - 41st PARLIAMENT 2nd
SESSION
4/19 PAGES 1/1 PAGE
GETTING STARTEDWE USED PYTHON3 WITH:
1. NLTK (http://www.nltk.org/) - FOR NLP 2. SCIKIT-LEARN (http://scikit-learn.org/stable/) - FOR CLASSIFIER 3. GENSIM (https://radimrehurek.com/gensim/) - FOR TOPIC MODEL 4. PSYCOPG2 (http://initd.org/psycopg/) - FOR DATA EXTRACT
ALL INSTALLED WITH PIP3
GETTING STARTED (CONT…)
WE SOURCED OUR DATA FROM:
https://openparliament.ca/
http://parl.gc.ca
DATA ANALYSISMANUALLY SKIMMED AND EXTRACTED FEATURES FROM ≈120 BILLS AND BUILT A SPREADSHEET
link: https://docs.google.com/spreadsheets/d/1kpbX78NZQ9bJHGVPoSmLE4LcE4Hht1UXxXg90gV1CVU/edit?usp=sharing
MODEL FEATURESLENGTH OF BILL
NUMBER OF BILLS REFERENCED
AVERAGE SEMANTIC DISTANCE OF TOPICS IN EACH BILL
THE MODEL
THE CLASSIFIERNAIVE BAYES
EASY
FAST
UNDERSTANDABLE
WORKS WELL WITH SMALL TRAINING SET (MAYBE NOT THIS SMALL)
LENGTH OF BILLLENGTH OF RAW STRING READ IN FROM FILES
AS EASY AS: len(raw)
NUMBER OF BILLS REFERENCED
(1) DATA RETRIEVAL
(2) PREPROCESSING
(3) NAMED ENTITY RECOGNITION (NER)
(1) DATA RETRIEVAL
2 DATA SETS TO COLLECT
• CONSOLIDATED LIST OF ACTS
• FULL TEXT OF BILLS
DATA RETRIEVAL CONT…
LIST OF ACTS PROVIDED BY GOVERNMENT OF CANADA (http://laws-lois.justice.gc.ca/eng/acts/)
WE NEEDED A WEB SCRAPER AS NO API IS AVAILABLE • SCRAPY IS POWERFUL BUT NO PYTHON3 SUPPORT • IMPORT.IO WORKED WELL FOR OUR NEEDS
DATA RETRIEVAL CONT…
DATA RETRIEVAL CONT…
TEXT OF BILLS RETRIEVED FROM OPENPARLIAMENT DATABASE USING SQL
(2) PREPROCESSING
OPENPARLIAMENT DATABASE ISN’T PERFECT • REMOVED DUPLICATES • VERIFIED SESSION NUMBER WAS CORRECT • CONVERTED EVERYTHING TO LOWERCASE
(3) NAMED ENTITY RECOGNITION
MANY APPROACHES TO THIS • HAND-CRAFTED GRAMMAR BASED • STATISTICAL MODELS • MATCHING AGAINST A LIBRARY
NAMED ENTITY RECOGNITION CONT…
WE NOTICED COMMON PHRASES LIKE “AMENDS”, “RELATED AMENDMENTS”, “REPLACED BY” WHEN REFERENCING ACTS
ULTIMATELY WE MATCHED BILL TEXT AGAINST A LIBRARY • THIS GAVE US GOOD RESULTS WITH LITTLE CODE • WON’T ALWAYS WORK
SEMANTIC DISTANCE OF TOPICS
HYPOTHESIS:
SINCE AN OMNIBUS BILL HAS MANY DIFFERENT TOPICS THE AVERAGE DISTANCE BETWEEN TOPICS IN AN OMNIBUS BILL WILL BE GREATER THAN A NON-OMNIBUS BILL.
SEMANTIC DISTANCE OF TOPICS PROCEDURE
(1) PREPROCESS A BILL
(2) LDA TOPIC MODELLING ON THE BILL
(3) SEMANTIC SIMILARITY (DISTANCE MEASURE)
(4) AVERAGE TOPIC DISTANCE OF THE BILL
(1) PREPROCESSING• READ IN FILES
•TOKENIZE WORDS
• REMOVE STOP WORDS
•IGNORE WORD ORDER (BAG OF WORDS)
(2) LDA TOPIC MODELING•PROBABILISTIC TOPIC MODEL
•WE ARE NOT USING IT IN ITS OPTIMAL APPLICATION
•PROBABILISTICALLY PRESUMES DOCUMENTS CONTAIN A HIDDEN STRUCTURE BUILT AROUND TOPICS
•IGNORES WORD ORDER
LDA CONT…•MANY BILLS TOO SHORT FOR MEANINGFUL ANALYSIS W/ LDA
•BILLS THAT ARE TOO SHORT GET AN AGGREGATE SIMILARITY SCORE OF ‘1’
•THIS IS A REALLY BAD WORKAROUND
•WE IGNORE THE LDA TOPIC WEIGHTS/PROBABILITIES •THIS IS AN OPTIMIZATION PROBLEM
MORE READING: https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
(3) SEMANTIC SIMILARITYLIN SIMILARITY
BUT WHAT DOES THIS MEAN???
WORDNETA HIERARCHICAL TREE OF WORDS WITH MORE GENERAL WORDS AT THE ROOT AND MORE SPECIFIC WORDS AT
THE LEAF
SIMILARITY CONT…LIN SIMILARITY
*OVERSIMPLIFICATION* THERE IS A GRAPH/NETWORK OF SYNONYMS - LIN SIMILARITY IS THE SHORTEST DISTANCE TO THE FIRST COMMON ANCESTOR (LOWEST COMMON ANCESTOR)
SIMILARITY CONT…SCORES ARE BETWEEN 0 AND 1
>0.8 MEANS VERY SIMILAR
<0.2 MEANS NOT VERY SIMILAR
ie. CAT & DOG = 0.88 OR 0.89 (BROWN AND SEMCOR IC) HOUND & DOG = 0.88 OR 0.87 (BROWN AND SEMCOR IC) CHAIR & DOG = 0.16 OR 0.18 (BROWN AND SEMCOR IC)
(4) AVG. TOPIC DISTANCE IN A BILL
WE CREATED AN AVERAGE SIMILARITY SCORE FOR EACH BILL:
SUM OF ALL COMPARED SCORES/TOTAL NUMBER OF COMPARISONS
THERE ARE FLAWS IN THIS APPROACH •NOUN ONLY •NO WEIGHTING
CLASSIFICATION!WE WERE RUNNING OUT OF TIME…..
WE WANTED TO COMPARE: •NAIVE BAYES •RANDOM FOREST DECISION TREE •SVM
WE COMPARED: •NAIVE BAYES!
CLASSIFIER COMPARISON
:(
NAIVE BAYES •GAUSSIAN •MULTINOMIAL
MODEL EVALUATION
WE WONT SHOW YOU ACCURACY BECAUSE…
CLASS IMBALANCE!
•9 OMNIBUS BILLS IN 120 BILLS
•7.5% CHANCE A BILL IS AN OMNIBUS BILL
•A CLASSIFIER COULD HAVE 92.5% ACCURACY BY PICKING ‘NOT OMNIBUS’ EVERY TIME!
PRECISION True Positives / (True Positives + False Positives)
RECALL (True Positives / (True Positives + False Negatives))
BUT WE HAVE A CLASS IMBALANCE PROBLEM
PRETENDING WE DON’T HAVE A PROBLEM
CLASS IMBALANCE SOLUTION
REMOVE THE IMBALANCE!!!!
WE WENT FROM 65 TRAINING EXAMPLES TO 25 TO 11 BY REMOVING NEGATIVE EXAMPLES
RESULTSTRUE CLASS IMBALANCE
(5:60)
NEW (5:20)
RATIOS ARE (#OMNIBUS:#NOTOMNIBUS)
REMOVING EVEN MORENEW (5:20)
NEWEST (5:6)
FINAL TRAINING SET
} }
CONCLUSIONS
EITHER NEED: (1)SUBSTANTIALLY MORE DATA OR; (2)BETTER ACCURACY ON TOPIC EXTRACTION AND
NAMED ENTITY RECOGNITION
LOTS OF ROOM FOR IMPROVEMENT
WE STILL THINK THREE FEATURES IS ENOUGH
NEED TO DO MORE WORK CLEANING/VALIDATING OUR INPUT DATA
CONCLUSIONS CONT…
WE ARE PERFORMING BETTER THAN RANDOM GUESSING!
WE WOULD LOVE HELP IMPROVING OUR APPROACH
WAYS TO IMPROVEUSE MORE COMPLEX NER IMPLEMENTATION TO IMPROVE ACCURACY
LINKED TOPIC MODELLING
IMPROVE WORD SIMILARITY APPROACH TO INCLUDE WEIGHTINGS
EXPERIMENT WITH DOCUMENT VECTORS AND NEURAL NETS
USE DIFFERENT DISTRIBUTIONS FOR DIFFERENT FEATURES (OPTIMIZATION OF CLASSIFIER)
TRY TF/IDF AS A DIFFERENT METHOD FOR MEASURING THE ‘SEMANTIC DIFFERENCE’ IN A DOCUMENT
EXPERIMENT WITH OTHER CLASSIFIERS
EXPERIMENT WITH MORE FEATURES
…
QUESTIONS?
Machine learning is no cakewalk.
Can we form a group to help Ottawa companies achieve greater success with ML?
What would this group do? Who would be in it?
How would it be funded? Do we have the local talent? What about protecting IP?
Who would make the decisions? Why bother?
We want your feedback! If you'd like to participate in ongoing discussions, please leave
us your contact info.
RELATIVE OPERATING CHARACTERISTICS (ROC)
0
0.25
0.5
0.75
1
0 0.25 0.5 0.75 1
Random GuessGaussianMultinomial
FALSE POSITIVE RATE
TRU
E PO
SITI
VE
RATE
top related