decision trees machine learning
TRANSCRIPT
Decision Trees&
Machine LearningCS16: Introduction to Data Structures & Algorithms
Summer 2021
Machine Learning‣ Algorithms that use data to design algorithms
‣ Allows us to design algorithms‣ that predict the future (e.g., picking stocks)‣ even when we don’t know how (e.g., facial
recognition)2
Learning Algodata Algo Algo
input
output
3CS 147
Applications of ML‣ Agriculture
‣ Astronomy
‣ Bioinformatics
‣ Classifying DNA
‣ Computer Vision
‣ Finance
‣ Linguistics
‣ Medical diagnostics
‣ Insurance
‣ Economics
‣ Advertising
‣ Self-driving cars
‣ Recommendation systems (e.g., Netflix)
‣ Search engines
‣ Translations
‣ Robotics
‣ Risk assessment
‣ Drug discovery
‣ Fraud discovery
‣ Computational Anatomy
Classes of ML‣ Supervised learning‣ learn to make accurate predictions from training data
‣ Unsupervised learning‣ find patterns in data without training data
‣ Reinforcement learning‣ improve performance with positive and negative
feedback
5
Supervised Learning‣ Make accurate predictions/classifications
‣ Is this email spam?‣ Will the snowstorm cancel class?‣ Will this flight be delayed?‣ Will this candidate win the next election?
‣ How can our algorithm predict the future?‣ We train it using “training data” which are past examples‣ Examples of emails classified as spam and of emails classified as non-spam‣ Examples of snowstorms that have lead to cancelations and of snowstorms that
have not‣ Examples of flights that have been delayed and of flights that have left on time‣ Examples of candidates that won and of candidates that have lost
6
Supervised Learning‣ Training data is a collection of examples
‣ An example includes an input and its classification‣ inputs: flights, snowstorms, candidates, …‣ classifications: delayed/non-delayed, canceled/not canceled, win/lose
‣ But how do we represent inputs for our algorithm?‣ What is a student? what is a flight? what is an email?
‣ We have to choose attributes that describe the inputs
‣ flight is represented by: source, destination, airline, number of passengers, …
‣ snowstorm is represented by: duration, expected inches, winds, …
‣ candidate is represented by: district, political affiliation, experience, …
7
Example: Waiting for a Table‣ Design algorithm that predicts if patron will wait
for a table‣ What are the inputs?‣ the “context” of the patron’s decision
‣ What are the attributes of this context?‣ is patron hungry? is the line long?
8
Example: Waiting for a Table?‣ Input attributes
‣ A1: Alternatives = {Yes, No}‣ A2: Bar = {Yes, No}‣ A3: Fri/Sat = {Yes, No}‣ A4: Hungry = {Yes, No}‣ A5: Patrons = {None, Some, Full}‣ A6: Price = {$, $$, $$$}‣ A7: Raining = {Yes, No}‣ A8: Reservation = {Yes, No}‣ A9: Type = {French, Italian, Thai, Burger}‣ A10: Wait = {10-30, 30-60, >60}
‣ Classification: {Yes, No}
9
Training Data
10
S. Russel & P. Norvig. Artificial Intelligence - A Modern Approach
Supervised Learning‣ Classification‣ If classifications are from a finite set ‣ ex: spam/not spam, delayed/not delayed
‣ Regression‣ If classifications are real numbers‣ ex: temperature
11
Decision Trees‣ A decision tree maps ‣ inputs represented by attributes… ‣ …to a classification
‣ Examples‣ snowstorm_dt(12h,8”,strong winds) returns Yes
‣ flight_dt(DL,PVD,Paris,night,no_storm,…) returns No
‣ restaurant_dt(estimate,hungry,patrons,…) returns No
12
Decision Tree Example
13
Decision Tree Example
14
Our Goal: Learning a Decision Tree
15
Training Data
Decision Tree
Learn
What is a Good Decision Tree?
16
What is a Good Decision Tree?‣ Consistent with training data
‣ classifies training examples correctly
‣ Performs well on future examples‣ classifies future inputs correctly
‣ As small as possible‣ Efficient classification
‣ How can we find a small decision tree?‣ there are possible decision
trees‣ so brute force is not possible
17
⌦(22n
)
Iterative Dichotomizer 3 (ID3) Algorithm
18
Data
(Learned) Decision Tree
ID3 Ross Quinlan
ID3
‣ Starting at root‣ node is either an attribute node or a classification node (leaf)‣ outgoing edges are labeled with attribute values
‣ children are either a classification node or another attribute node
‣ Tree should be as small as possible19
ID3
20
Patrons
Some
FullNone
4xYes2xYes4xNo2xNo
6xYes6xNo
Type
French
Thai Burger Italian
1xYes1xNo
2xYes2xNo
2xYes2xNo
1xYes1xNo
6xYes6xNo
ID3
21
French
Thai Burger Italian
1xYes1xNo
2xYes2xNo
2xYes2xNo
1xYes1xNo
Type
6xYes6xNo
Uncertainty about whetherwe should wait or not
ID3
22
Type
French
Thai Burger Italian
1xYes1xNo
2xYes2xNo
2xYes2xNo
1xYes1xNo
Patrons
Some
FullNone
4xYes2xYes4xNo2xNo
6xYes6xNo
6xYes6xNo
ID3‣ Start at root with entire training data‣ Choose attribute that creates a “good split”‣ Attribute “splits” data into subsets‣ good split: children with subsets that are unmixed (with
same classification)‣ bad split: children with subsets that are mixed (with
different classification)
‣ Children with unmixed subsets lead to a classification‣ Children with mixed subsets handled with recursion
23
ID3
24
Type
French
Thai Burger Italian
1xYes1xNo
2xYes2xNo
2xYes2xNo
1xYes1xNo
Patrons
Some
FullNone
4xYes2xYes4xNo2xNo
How do we distinguish
from“bad” attributes “good” attributes
many mixed subsets many unmixed subsets
6xYes6xNo
6xYes6xNo
ID3‣ How do we decide if attribute is good?‣ Compute entropy of each child ‣ quantifies how mixed/alike it is
‣ quantifies amount of certainty/uncertainty
‣ Combine the entropies of all the children‣ Compare combined entropy of children to entropy
of node‣ This is called the information gain
25
Entropy‣ Entropy of a dataset of examples
26
H(data) = �✓
#Yes
#Yes+#No· log2
✓#Yes
#Yes+#No
◆
+
✓1� #Yes
#Yes+#No
◆· log2
✓1� #Yes
#Yes+#No
◆◆
<latexit sha1_base64="CUCt8fc/HqYRscmeEmG1tEW24LM=">AAADGXiclVJNb9QwEHVSPkr42sKRy4gV1VZVo6RCajkgVXDpCRWJpUXr1cpxnF2rThzsCdIqyu/opX+lFw6AOMKJf4M3G6HS5UBHsvw8b+Z5PJ6kVNJiFP3y/LUbN2/dXr8T3L13/8HD3saj91ZXhosh10qbk4RZoWQhhihRiZPSCJYnShwnp68X/PEnYazUxTucl2Kcs2khM8kZOtdkw4sOBzW1GaQMWbMFL2FnkyqR4YBmhvGa9lv2g7BNc/mw3eE3umkoTzVSpaeT3WUq/H8uUCOnM9yCDoRAabBJP1Yshe2glQuX2yCGnWsId7qrtcXXUflTXtCBIJj0+lEYtQarIO5An3R2NOn9oKnmVS4K5IpZO4qjEsc1Myi5Ek1AKytKxk/ZVIwcLFgu7Lhuv7aBZ86TQqaNWwVC672cUbPc2nmeuMic4cxe5RbOf3GjCrP9cS2LskJR8OVFWaUANSzmBFJpBEc1d4BxI12twGfM9Q3dNC2aEF998ioY7oYvwvjt8/7Bq64b6+QJeUoGJCZ75IAckiMyJNw78y68L95X/9z/7H/zvy9Dfa/LeUz+Mv/nb1d/+DI=</latexit><latexit sha1_base64="CUCt8fc/HqYRscmeEmG1tEW24LM=">AAADGXiclVJNb9QwEHVSPkr42sKRy4gV1VZVo6RCajkgVXDpCRWJpUXr1cpxnF2rThzsCdIqyu/opX+lFw6AOMKJf4M3G6HS5UBHsvw8b+Z5PJ6kVNJiFP3y/LUbN2/dXr8T3L13/8HD3saj91ZXhosh10qbk4RZoWQhhihRiZPSCJYnShwnp68X/PEnYazUxTucl2Kcs2khM8kZOtdkw4sOBzW1GaQMWbMFL2FnkyqR4YBmhvGa9lv2g7BNc/mw3eE3umkoTzVSpaeT3WUq/H8uUCOnM9yCDoRAabBJP1Yshe2glQuX2yCGnWsId7qrtcXXUflTXtCBIJj0+lEYtQarIO5An3R2NOn9oKnmVS4K5IpZO4qjEsc1Myi5Ek1AKytKxk/ZVIwcLFgu7Lhuv7aBZ86TQqaNWwVC672cUbPc2nmeuMic4cxe5RbOf3GjCrP9cS2LskJR8OVFWaUANSzmBFJpBEc1d4BxI12twGfM9Q3dNC2aEF998ioY7oYvwvjt8/7Bq64b6+QJeUoGJCZ75IAckiMyJNw78y68L95X/9z/7H/zvy9Dfa/LeUz+Mv/nb1d/+DI=</latexit><latexit sha1_base64="CUCt8fc/HqYRscmeEmG1tEW24LM=">AAADGXiclVJNb9QwEHVSPkr42sKRy4gV1VZVo6RCajkgVXDpCRWJpUXr1cpxnF2rThzsCdIqyu/opX+lFw6AOMKJf4M3G6HS5UBHsvw8b+Z5PJ6kVNJiFP3y/LUbN2/dXr8T3L13/8HD3saj91ZXhosh10qbk4RZoWQhhihRiZPSCJYnShwnp68X/PEnYazUxTucl2Kcs2khM8kZOtdkw4sOBzW1GaQMWbMFL2FnkyqR4YBmhvGa9lv2g7BNc/mw3eE3umkoTzVSpaeT3WUq/H8uUCOnM9yCDoRAabBJP1Yshe2glQuX2yCGnWsId7qrtcXXUflTXtCBIJj0+lEYtQarIO5An3R2NOn9oKnmVS4K5IpZO4qjEsc1Myi5Ek1AKytKxk/ZVIwcLFgu7Lhuv7aBZ86TQqaNWwVC672cUbPc2nmeuMic4cxe5RbOf3GjCrP9cS2LskJR8OVFWaUANSzmBFJpBEc1d4BxI12twGfM9Q3dNC2aEF998ioY7oYvwvjt8/7Bq64b6+QJeUoGJCZ75IAckiMyJNw78y68L95X/9z/7H/zvy9Dfa/LeUz+Mv/nb1d/+DI=</latexit><latexit sha1_base64="CUCt8fc/HqYRscmeEmG1tEW24LM=">AAADGXiclVJNb9QwEHVSPkr42sKRy4gV1VZVo6RCajkgVXDpCRWJpUXr1cpxnF2rThzsCdIqyu/opX+lFw6AOMKJf4M3G6HS5UBHsvw8b+Z5PJ6kVNJiFP3y/LUbN2/dXr8T3L13/8HD3saj91ZXhosh10qbk4RZoWQhhihRiZPSCJYnShwnp68X/PEnYazUxTucl2Kcs2khM8kZOtdkw4sOBzW1GaQMWbMFL2FnkyqR4YBmhvGa9lv2g7BNc/mw3eE3umkoTzVSpaeT3WUq/H8uUCOnM9yCDoRAabBJP1Yshe2glQuX2yCGnWsId7qrtcXXUflTXtCBIJj0+lEYtQarIO5An3R2NOn9oKnmVS4K5IpZO4qjEsc1Myi5Ek1AKytKxk/ZVIwcLFgu7Lhuv7aBZ86TQqaNWwVC672cUbPc2nmeuMic4cxe5RbOf3GjCrP9cS2LskJR8OVFWaUANSzmBFJpBEc1d4BxI12twGfM9Q3dNC2aEF998ioY7oYvwvjt8/7Bq64b6+QJeUoGJCZ75IAckiMyJNw78y68L95X/9z/7H/zvy9Dfa/LeUz+Mv/nb1d/+DI=</latexit>
log(0) = 0<latexit sha1_base64="NDBv5UwjCJzY/ZoaB4AHXijEA4w=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBahXkoignoQil48VjC2kIay2W7apZvdsLsRSujP8OJBxav/xpv/xm2bg7Y+GHi8N8PMvCjlTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLTBHqE8ml6kRYU84E9Q0znHZSRXEScdqORrdTv/1ElWZSPJhxSsMEDwSLGcHGSkGXy0HdPUXXyO1Va27DnQEtE68gNSjQ6lW/un1JsoQKQzjWOvDc1IQ5VoYRTieVbqZpiskID2hgqcAJ1WE+O3mCTqzSR7FUtoRBM/X3RI4TrcdJZDsTbIZ60ZuK/3lBZuLLMGcizQwVZL4ozjgyEk3/R32mKDF8bAkmitlbERlihYmxKVVsCN7iy8vEP2tcNbz781rzpkijDEdwDHXw4AKacAct8IGAhGd4hTfHOC/Ou/Mxby05xcwh/IHz+QNcSY+I</latexit><latexit sha1_base64="NDBv5UwjCJzY/ZoaB4AHXijEA4w=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBahXkoignoQil48VjC2kIay2W7apZvdsLsRSujP8OJBxav/xpv/xm2bg7Y+GHi8N8PMvCjlTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLTBHqE8ml6kRYU84E9Q0znHZSRXEScdqORrdTv/1ElWZSPJhxSsMEDwSLGcHGSkGXy0HdPUXXyO1Va27DnQEtE68gNSjQ6lW/un1JsoQKQzjWOvDc1IQ5VoYRTieVbqZpiskID2hgqcAJ1WE+O3mCTqzSR7FUtoRBM/X3RI4TrcdJZDsTbIZ60ZuK/3lBZuLLMGcizQwVZL4ozjgyEk3/R32mKDF8bAkmitlbERlihYmxKVVsCN7iy8vEP2tcNbz781rzpkijDEdwDHXw4AKacAct8IGAhGd4hTfHOC/Ou/Mxby05xcwh/IHz+QNcSY+I</latexit><latexit sha1_base64="NDBv5UwjCJzY/ZoaB4AHXijEA4w=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBahXkoignoQil48VjC2kIay2W7apZvdsLsRSujP8OJBxav/xpv/xm2bg7Y+GHi8N8PMvCjlTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLTBHqE8ml6kRYU84E9Q0znHZSRXEScdqORrdTv/1ElWZSPJhxSsMEDwSLGcHGSkGXy0HdPUXXyO1Va27DnQEtE68gNSjQ6lW/un1JsoQKQzjWOvDc1IQ5VoYRTieVbqZpiskID2hgqcAJ1WE+O3mCTqzSR7FUtoRBM/X3RI4TrcdJZDsTbIZ60ZuK/3lBZuLLMGcizQwVZL4ozjgyEk3/R32mKDF8bAkmitlbERlihYmxKVVsCN7iy8vEP2tcNbz781rzpkijDEdwDHXw4AKacAct8IGAhGd4hTfHOC/Ou/Mxby05xcwh/IHz+QNcSY+I</latexit><latexit sha1_base64="NDBv5UwjCJzY/ZoaB4AHXijEA4w=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBahXkoignoQil48VjC2kIay2W7apZvdsLsRSujP8OJBxav/xpv/xm2bg7Y+GHi8N8PMvCjlTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLTBHqE8ml6kRYU84E9Q0znHZSRXEScdqORrdTv/1ElWZSPJhxSsMEDwSLGcHGSkGXy0HdPUXXyO1Va27DnQEtE68gNSjQ6lW/un1JsoQKQzjWOvDc1IQ5VoYRTieVbqZpiskID2hgqcAJ1WE+O3mCTqzSR7FUtoRBM/X3RI4TrcdJZDsTbIZ60ZuK/3lBZuLLMGcizQwVZL4ozjgyEk3/R32mKDF8bAkmitlbERlihYmxKVVsCN7iy8vEP2tcNbz781rzpkijDEdwDHXw4AKacAct8IGAhGd4hTfHOC/Ou/Mxby05xcwh/IHz+QNcSY+I</latexit>
[ ]
Entropy‣ Entropy of a dataset of examples
‣ Intuition:‣ H(perfectly mixed) = 1‣ H(all the same) = 0
27
H(data) = �✓
#Yes
#Yes+#No· log2
✓#Yes
#Yes+#No
◆
+
✓1� #Yes
#Yes+#No
◆· log2
✓1� #Yes
#Yes+#No
◆◆
<latexit sha1_base64="CUCt8fc/HqYRscmeEmG1tEW24LM=">AAADGXiclVJNb9QwEHVSPkr42sKRy4gV1VZVo6RCajkgVXDpCRWJpUXr1cpxnF2rThzsCdIqyu/opX+lFw6AOMKJf4M3G6HS5UBHsvw8b+Z5PJ6kVNJiFP3y/LUbN2/dXr8T3L13/8HD3saj91ZXhosh10qbk4RZoWQhhihRiZPSCJYnShwnp68X/PEnYazUxTucl2Kcs2khM8kZOtdkw4sOBzW1GaQMWbMFL2FnkyqR4YBmhvGa9lv2g7BNc/mw3eE3umkoTzVSpaeT3WUq/H8uUCOnM9yCDoRAabBJP1Yshe2glQuX2yCGnWsId7qrtcXXUflTXtCBIJj0+lEYtQarIO5An3R2NOn9oKnmVS4K5IpZO4qjEsc1Myi5Ek1AKytKxk/ZVIwcLFgu7Lhuv7aBZ86TQqaNWwVC672cUbPc2nmeuMic4cxe5RbOf3GjCrP9cS2LskJR8OVFWaUANSzmBFJpBEc1d4BxI12twGfM9Q3dNC2aEF998ioY7oYvwvjt8/7Bq64b6+QJeUoGJCZ75IAckiMyJNw78y68L95X/9z/7H/zvy9Dfa/LeUz+Mv/nb1d/+DI=</latexit><latexit sha1_base64="CUCt8fc/HqYRscmeEmG1tEW24LM=">AAADGXiclVJNb9QwEHVSPkr42sKRy4gV1VZVo6RCajkgVXDpCRWJpUXr1cpxnF2rThzsCdIqyu/opX+lFw6AOMKJf4M3G6HS5UBHsvw8b+Z5PJ6kVNJiFP3y/LUbN2/dXr8T3L13/8HD3saj91ZXhosh10qbk4RZoWQhhihRiZPSCJYnShwnp68X/PEnYazUxTucl2Kcs2khM8kZOtdkw4sOBzW1GaQMWbMFL2FnkyqR4YBmhvGa9lv2g7BNc/mw3eE3umkoTzVSpaeT3WUq/H8uUCOnM9yCDoRAabBJP1Yshe2glQuX2yCGnWsId7qrtcXXUflTXtCBIJj0+lEYtQarIO5An3R2NOn9oKnmVS4K5IpZO4qjEsc1Myi5Ek1AKytKxk/ZVIwcLFgu7Lhuv7aBZ86TQqaNWwVC672cUbPc2nmeuMic4cxe5RbOf3GjCrP9cS2LskJR8OVFWaUANSzmBFJpBEc1d4BxI12twGfM9Q3dNC2aEF998ioY7oYvwvjt8/7Bq64b6+QJeUoGJCZ75IAckiMyJNw78y68L95X/9z/7H/zvy9Dfa/LeUz+Mv/nb1d/+DI=</latexit><latexit sha1_base64="CUCt8fc/HqYRscmeEmG1tEW24LM=">AAADGXiclVJNb9QwEHVSPkr42sKRy4gV1VZVo6RCajkgVXDpCRWJpUXr1cpxnF2rThzsCdIqyu/opX+lFw6AOMKJf4M3G6HS5UBHsvw8b+Z5PJ6kVNJiFP3y/LUbN2/dXr8T3L13/8HD3saj91ZXhosh10qbk4RZoWQhhihRiZPSCJYnShwnp68X/PEnYazUxTucl2Kcs2khM8kZOtdkw4sOBzW1GaQMWbMFL2FnkyqR4YBmhvGa9lv2g7BNc/mw3eE3umkoTzVSpaeT3WUq/H8uUCOnM9yCDoRAabBJP1Yshe2glQuX2yCGnWsId7qrtcXXUflTXtCBIJj0+lEYtQarIO5An3R2NOn9oKnmVS4K5IpZO4qjEsc1Myi5Ek1AKytKxk/ZVIwcLFgu7Lhuv7aBZ86TQqaNWwVC672cUbPc2nmeuMic4cxe5RbOf3GjCrP9cS2LskJR8OVFWaUANSzmBFJpBEc1d4BxI12twGfM9Q3dNC2aEF998ioY7oYvwvjt8/7Bq64b6+QJeUoGJCZ75IAckiMyJNw78y68L95X/9z/7H/zvy9Dfa/LeUz+Mv/nb1d/+DI=</latexit><latexit sha1_base64="CUCt8fc/HqYRscmeEmG1tEW24LM=">AAADGXiclVJNb9QwEHVSPkr42sKRy4gV1VZVo6RCajkgVXDpCRWJpUXr1cpxnF2rThzsCdIqyu/opX+lFw6AOMKJf4M3G6HS5UBHsvw8b+Z5PJ6kVNJiFP3y/LUbN2/dXr8T3L13/8HD3saj91ZXhosh10qbk4RZoWQhhihRiZPSCJYnShwnp68X/PEnYazUxTucl2Kcs2khM8kZOtdkw4sOBzW1GaQMWbMFL2FnkyqR4YBmhvGa9lv2g7BNc/mw3eE3umkoTzVSpaeT3WUq/H8uUCOnM9yCDoRAabBJP1Yshe2glQuX2yCGnWsId7qrtcXXUflTXtCBIJj0+lEYtQarIO5An3R2NOn9oKnmVS4K5IpZO4qjEsc1Myi5Ek1AKytKxk/ZVIwcLFgu7Lhuv7aBZ86TQqaNWwVC672cUbPc2nmeuMic4cxe5RbOf3GjCrP9cS2LskJR8OVFWaUANSzmBFJpBEc1d4BxI12twGfM9Q3dNC2aEF998ioY7oYvwvjt8/7Bq64b6+QJeUoGJCZ75IAckiMyJNw78y68L95X/9z/7H/zvy9Dfa/LeUz+Mv/nb1d/+DI=</latexit>
log(0) = 0<latexit sha1_base64="NDBv5UwjCJzY/ZoaB4AHXijEA4w=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBahXkoignoQil48VjC2kIay2W7apZvdsLsRSujP8OJBxav/xpv/xm2bg7Y+GHi8N8PMvCjlTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLTBHqE8ml6kRYU84E9Q0znHZSRXEScdqORrdTv/1ElWZSPJhxSsMEDwSLGcHGSkGXy0HdPUXXyO1Va27DnQEtE68gNSjQ6lW/un1JsoQKQzjWOvDc1IQ5VoYRTieVbqZpiskID2hgqcAJ1WE+O3mCTqzSR7FUtoRBM/X3RI4TrcdJZDsTbIZ60ZuK/3lBZuLLMGcizQwVZL4ozjgyEk3/R32mKDF8bAkmitlbERlihYmxKVVsCN7iy8vEP2tcNbz781rzpkijDEdwDHXw4AKacAct8IGAhGd4hTfHOC/Ou/Mxby05xcwh/IHz+QNcSY+I</latexit><latexit sha1_base64="NDBv5UwjCJzY/ZoaB4AHXijEA4w=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBahXkoignoQil48VjC2kIay2W7apZvdsLsRSujP8OJBxav/xpv/xm2bg7Y+GHi8N8PMvCjlTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLTBHqE8ml6kRYU84E9Q0znHZSRXEScdqORrdTv/1ElWZSPJhxSsMEDwSLGcHGSkGXy0HdPUXXyO1Va27DnQEtE68gNSjQ6lW/un1JsoQKQzjWOvDc1IQ5VoYRTieVbqZpiskID2hgqcAJ1WE+O3mCTqzSR7FUtoRBM/X3RI4TrcdJZDsTbIZ60ZuK/3lBZuLLMGcizQwVZL4ozjgyEk3/R32mKDF8bAkmitlbERlihYmxKVVsCN7iy8vEP2tcNbz781rzpkijDEdwDHXw4AKacAct8IGAhGd4hTfHOC/Ou/Mxby05xcwh/IHz+QNcSY+I</latexit><latexit sha1_base64="NDBv5UwjCJzY/ZoaB4AHXijEA4w=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBahXkoignoQil48VjC2kIay2W7apZvdsLsRSujP8OJBxav/xpv/xm2bg7Y+GHi8N8PMvCjlTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLTBHqE8ml6kRYU84E9Q0znHZSRXEScdqORrdTv/1ElWZSPJhxSsMEDwSLGcHGSkGXy0HdPUXXyO1Va27DnQEtE68gNSjQ6lW/un1JsoQKQzjWOvDc1IQ5VoYRTieVbqZpiskID2hgqcAJ1WE+O3mCTqzSR7FUtoRBM/X3RI4TrcdJZDsTbIZ60ZuK/3lBZuLLMGcizQwVZL4ozjgyEk3/R32mKDF8bAkmitlbERlihYmxKVVsCN7iy8vEP2tcNbz781rzpkijDEdwDHXw4AKacAct8IGAhGd4hTfHOC/Ou/Mxby05xcwh/IHz+QNcSY+I</latexit><latexit sha1_base64="NDBv5UwjCJzY/ZoaB4AHXijEA4w=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBahXkoignoQil48VjC2kIay2W7apZvdsLsRSujP8OJBxav/xpv/xm2bg7Y+GHi8N8PMvCjlTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLTBHqE8ml6kRYU84E9Q0znHZSRXEScdqORrdTv/1ElWZSPJhxSsMEDwSLGcHGSkGXy0HdPUXXyO1Va27DnQEtE68gNSjQ6lW/un1JsoQKQzjWOvDc1IQ5VoYRTieVbqZpiskID2hgqcAJ1WE+O3mCTqzSR7FUtoRBM/X3RI4TrcdJZDsTbIZ60ZuK/3lBZuLLMGcizQwVZL4ozjgyEk3/R32mKDF8bAkmitlbERlihYmxKVVsCN7iy8vEP2tcNbz781rzpkijDEdwDHXw4AKacAct8IGAhGd4hTfHOC/Ou/Mxby05xcwh/IHz+QNcSY+I</latexit>
[ ]
ID3 - Information Gain
28
Type
French
Thai Burger Italian
1xYes1xNo
2xYes2xNo
2xYes2xNo
1xYes1xNo
Patrons
Some
FullNone
4xYes2xYes4xNo2xNo
6xYes6xNo
6xYes6xNo
1
entropy
1
entropy
1
entropy
1
entropy
0 0 0.8
entropy entropy entropyWeighted Sum
Weighted Sum
entropy
1
remainder
1- = 0entropy
1
remainder
0.4- = 0.6
ID3 - Notation
29
Type
French
Thai Burger Italian
1xYes1xNo
2xYes2xNo
2xYes2xNo
1xYes1xNo
6xYes6xNodata0
data2 data3data1 data4
Information gain‣ Entropy of a dataset of examples
‣ Remainder of an attribute
‣ Information gain of an attribute
30
Rem(Att) =dX
i=1
#datai#data1 + · · ·+#datad
·H(datai)<latexit sha1_base64="bhQeD3I6bm4u58OIK2OjRa+Dlb4=">AAACYXicbVFNT9swGHYyGNAxCHDk8ooKqQipSriwHZBgHODIJgpITRccx6EWdhLZb5CqyH9oP2c3Tlz4Ibgph/LxSpYePx+y/TitpDAYho+e/2Vh8evS8krn2+r3tfVgY/PKlLVmfMBKWeqblBouRcEHKFDym0pzqlLJr9P706l+/cC1EWVxiZOKjxS9K0QuGEVHJcGkiU0Of7iyvRadINo9OILY1CppxFFk/2YQ55qyJu62jowiTYS1b/aRhX2IWVaigf15IbN2xsN5bz6+10mCbtgP24GPIHoF3eMz+HdLCLlIgv9xVrJa8QKZpMYMo7DCUUM1Cia57cS14RVl9/SODx0sqOJm1LQVWdh1TAZ5qd0qEFp2PtFQZcxEpc6pKI7Ne21KfqYNa8x/jBpRVDXygs0OymsJWMK0b8iE5gzlxAHKtHB3BTamrk90vzItIXr/5I9gcND/2Y9+uzJ+kdksk22yQ3okIofkmJyTCzIgjDx5i96at+49+x0/8DdnVt97zWyRN+NvvwA+PbZf</latexit><latexit sha1_base64="2CojO+9lrlrcFJZSlyilaAQHBFo=">AAACYXicbVFNT9swGHayAV3HR+iOXF5RTSpCqhIuwAEJ2GEc2bQOpKZEjuMUCzuJ7DdIVfAf2s/Zbadd+CG4KYfy8UqWHj8fsv04raQwGIb/PP/Dx5XVtc6n7uf1jc2tYLv325S1ZnzESlnq65QaLkXBRyhQ8utKc6pSya/Su29z/eqeayPK4hfOKj5RdFqIXDCKjkqCWRObHH5yZQctOkO0e3ACsalV0oiTyN5kEOeasibut46MIk2EtS/2kYV9iFlWooH9ZSGzdsHDxWA5vtdNgn44DNuBtyB6Bv3T7/AnTh6ml0nwN85KViteIJPUmHEUVjhpqEbBJLfduDa8ouyOTvnYwYIqbiZNW5GFr47JIC+1WwVCyy4nGqqMmanUORXFW/Nam5PvaeMa86NJI4qqRl6wxUF5LQFLmPcNmdCcoZw5QJkW7q7AbqnrE92vzEuIXj/5LRgdDI+H0Q9XxjlZTIfskF0yIBE5JKfkglySEWHkv7fibXpb3qPf9QO/t7D63nPmC3kx/s4T5dK3nQ==</latexit><latexit sha1_base64="2CojO+9lrlrcFJZSlyilaAQHBFo=">AAACYXicbVFNT9swGHayAV3HR+iOXF5RTSpCqhIuwAEJ2GEc2bQOpKZEjuMUCzuJ7DdIVfAf2s/Zbadd+CG4KYfy8UqWHj8fsv04raQwGIb/PP/Dx5XVtc6n7uf1jc2tYLv325S1ZnzESlnq65QaLkXBRyhQ8utKc6pSya/Su29z/eqeayPK4hfOKj5RdFqIXDCKjkqCWRObHH5yZQctOkO0e3ACsalV0oiTyN5kEOeasibut46MIk2EtS/2kYV9iFlWooH9ZSGzdsHDxWA5vtdNgn44DNuBtyB6Bv3T7/AnTh6ml0nwN85KViteIJPUmHEUVjhpqEbBJLfduDa8ouyOTvnYwYIqbiZNW5GFr47JIC+1WwVCyy4nGqqMmanUORXFW/Nam5PvaeMa86NJI4qqRl6wxUF5LQFLmPcNmdCcoZw5QJkW7q7AbqnrE92vzEuIXj/5LRgdDI+H0Q9XxjlZTIfskF0yIBE5JKfkglySEWHkv7fibXpb3qPf9QO/t7D63nPmC3kx/s4T5dK3nQ==</latexit><latexit sha1_base64="AIBnFzLOhWi2JHEzoGiTa9O/yNY=">AAACYXicbVFNS8MwGE7r9/yq8+glOISJMFov6kHw47KjilNhnSVNUw1L2pK8FUbJn/TmyYs/xKzbYXO+EHjyfJDkSVwIrsH3vxx3aXlldW19o7G5tb2z6+01n3ReKsp6NBe5eomJZoJnrAccBHspFCMyFuw5Ht6O9ecPpjTPs0cYFWwgyVvGU04JWCryRlWoU/zApGnX6BrAHONLHOpSRhW/DMxrgsNUEVqFrdqRECARN2ZuHxh8gkOa5KDxyayQGDPhcbc9Gz9uRF7L7/j14EUQTEELTecu8j7DJKelZBlQQbTuB34Bg4oo4FQw0whLzQpCh+SN9S3MiGR6UNUVGXxkmQSnubIrA1yzs4mKSK1HMrZOSeBd/9XG5H9av4T0fFDxrCiBZXRyUFoKDDke940TrhgFMbKAUMXtXTF9J7ZPsL8yLiH4++RF0DvtXHSCe791dTNtYx0doEPURgE6Q1eoi+5QD1H07aw4O86u8+M2XM9tTqyuM83so7lxD34B9yq0rg==</latexit>
Gain(Att) = H(data0)� Rem(Att)<latexit sha1_base64="SPj7aZrFYj9woaAL0BHWfCVaFTI=">AAACJXicbVBNSwMxEM36WetX1aOXYBHag2VXBPVQqHrQo4pVoS1lNs1qaDa7JLNCWfbXePGvePFQRfDkXzHd9uDXg8Cb92aYzPNjKQy67oczNT0zOzdfWCguLi2vrJbW1q9NlGjGmyySkb71wXApFG+iQMlvY80h9CW/8fsnI//mgWsjInWFg5h3QrhTIhAM0ErdUj1tm4CeglBZJadHiFm1Ts/GVQ8Qsq5bpTs0ry95+L2vWyq7NTcH/Uu8CSmTCc67pWG7F7Ek5AqZBGNanhtjJwWNgkmeFduJ4TGwPtzxlqUKQm46aX5mRret0qNBpO1TSHP1+0QKoTGD0LedIeC9+e2NxP+8VoLBQScVKk6QKzZeFCSSYkRHmdGe0JyhHFgCTAv7V8ruQQNDm2zRhuD9Pvkvae7WDmvexV65cTxJo0A2yRapEI/skwY5I+ekSRh5JM9kSF6dJ+fFeXPex61TzmRmg/yA8/kFd56jhQ==</latexit><latexit sha1_base64="SPj7aZrFYj9woaAL0BHWfCVaFTI=">AAACJXicbVBNSwMxEM36WetX1aOXYBHag2VXBPVQqHrQo4pVoS1lNs1qaDa7JLNCWfbXePGvePFQRfDkXzHd9uDXg8Cb92aYzPNjKQy67oczNT0zOzdfWCguLi2vrJbW1q9NlGjGmyySkb71wXApFG+iQMlvY80h9CW/8fsnI//mgWsjInWFg5h3QrhTIhAM0ErdUj1tm4CeglBZJadHiFm1Ts/GVQ8Qsq5bpTs0ry95+L2vWyq7NTcH/Uu8CSmTCc67pWG7F7Ek5AqZBGNanhtjJwWNgkmeFduJ4TGwPtzxlqUKQm46aX5mRret0qNBpO1TSHP1+0QKoTGD0LedIeC9+e2NxP+8VoLBQScVKk6QKzZeFCSSYkRHmdGe0JyhHFgCTAv7V8ruQQNDm2zRhuD9Pvkvae7WDmvexV65cTxJo0A2yRapEI/skwY5I+ekSRh5JM9kSF6dJ+fFeXPex61TzmRmg/yA8/kFd56jhQ==</latexit><latexit sha1_base64="SPj7aZrFYj9woaAL0BHWfCVaFTI=">AAACJXicbVBNSwMxEM36WetX1aOXYBHag2VXBPVQqHrQo4pVoS1lNs1qaDa7JLNCWfbXePGvePFQRfDkXzHd9uDXg8Cb92aYzPNjKQy67oczNT0zOzdfWCguLi2vrJbW1q9NlGjGmyySkb71wXApFG+iQMlvY80h9CW/8fsnI//mgWsjInWFg5h3QrhTIhAM0ErdUj1tm4CeglBZJadHiFm1Ts/GVQ8Qsq5bpTs0ry95+L2vWyq7NTcH/Uu8CSmTCc67pWG7F7Ek5AqZBGNanhtjJwWNgkmeFduJ4TGwPtzxlqUKQm46aX5mRret0qNBpO1TSHP1+0QKoTGD0LedIeC9+e2NxP+8VoLBQScVKk6QKzZeFCSSYkRHmdGe0JyhHFgCTAv7V8ruQQNDm2zRhuD9Pvkvae7WDmvexV65cTxJo0A2yRapEI/skwY5I+ekSRh5JM9kSF6dJ+fFeXPex61TzmRmg/yA8/kFd56jhQ==</latexit><latexit sha1_base64="SPj7aZrFYj9woaAL0BHWfCVaFTI=">AAACJXicbVBNSwMxEM36WetX1aOXYBHag2VXBPVQqHrQo4pVoS1lNs1qaDa7JLNCWfbXePGvePFQRfDkXzHd9uDXg8Cb92aYzPNjKQy67oczNT0zOzdfWCguLi2vrJbW1q9NlGjGmyySkb71wXApFG+iQMlvY80h9CW/8fsnI//mgWsjInWFg5h3QrhTIhAM0ErdUj1tm4CeglBZJadHiFm1Ts/GVQ8Qsq5bpTs0ry95+L2vWyq7NTcH/Uu8CSmTCc67pWG7F7Ek5AqZBGNanhtjJwWNgkmeFduJ4TGwPtzxlqUKQm46aX5mRret0qNBpO1TSHP1+0QKoTGD0LedIeC9+e2NxP+8VoLBQScVKk6QKzZeFCSSYkRHmdGe0JyhHFgCTAv7V8ruQQNDm2zRhuD9Pvkvae7WDmvexV65cTxJo0A2yRapEI/skwY5I+ekSRh5JM9kSF6dJ+fFeXPex61TzmRmg/yA8/kFd56jhQ==</latexit>
H(data) = �✓
#Yes
#Yes+#No· log2
✓#Yes
#Yes+#No
◆
+
✓1� #Yes
#Yes+#No
◆· log2
✓1� #Yes
#Yes+#No
◆◆
<latexit sha1_base64="CUCt8fc/HqYRscmeEmG1tEW24LM=">AAADGXiclVJNb9QwEHVSPkr42sKRy4gV1VZVo6RCajkgVXDpCRWJpUXr1cpxnF2rThzsCdIqyu/opX+lFw6AOMKJf4M3G6HS5UBHsvw8b+Z5PJ6kVNJiFP3y/LUbN2/dXr8T3L13/8HD3saj91ZXhosh10qbk4RZoWQhhihRiZPSCJYnShwnp68X/PEnYazUxTucl2Kcs2khM8kZOtdkw4sOBzW1GaQMWbMFL2FnkyqR4YBmhvGa9lv2g7BNc/mw3eE3umkoTzVSpaeT3WUq/H8uUCOnM9yCDoRAabBJP1Yshe2glQuX2yCGnWsId7qrtcXXUflTXtCBIJj0+lEYtQarIO5An3R2NOn9oKnmVS4K5IpZO4qjEsc1Myi5Ek1AKytKxk/ZVIwcLFgu7Lhuv7aBZ86TQqaNWwVC672cUbPc2nmeuMic4cxe5RbOf3GjCrP9cS2LskJR8OVFWaUANSzmBFJpBEc1d4BxI12twGfM9Q3dNC2aEF998ioY7oYvwvjt8/7Bq64b6+QJeUoGJCZ75IAckiMyJNw78y68L95X/9z/7H/zvy9Dfa/LeUz+Mv/nb1d/+DI=</latexit><latexit sha1_base64="CUCt8fc/HqYRscmeEmG1tEW24LM=">AAADGXiclVJNb9QwEHVSPkr42sKRy4gV1VZVo6RCajkgVXDpCRWJpUXr1cpxnF2rThzsCdIqyu/opX+lFw6AOMKJf4M3G6HS5UBHsvw8b+Z5PJ6kVNJiFP3y/LUbN2/dXr8T3L13/8HD3saj91ZXhosh10qbk4RZoWQhhihRiZPSCJYnShwnp68X/PEnYazUxTucl2Kcs2khM8kZOtdkw4sOBzW1GaQMWbMFL2FnkyqR4YBmhvGa9lv2g7BNc/mw3eE3umkoTzVSpaeT3WUq/H8uUCOnM9yCDoRAabBJP1Yshe2glQuX2yCGnWsId7qrtcXXUflTXtCBIJj0+lEYtQarIO5An3R2NOn9oKnmVS4K5IpZO4qjEsc1Myi5Ek1AKytKxk/ZVIwcLFgu7Lhuv7aBZ86TQqaNWwVC672cUbPc2nmeuMic4cxe5RbOf3GjCrP9cS2LskJR8OVFWaUANSzmBFJpBEc1d4BxI12twGfM9Q3dNC2aEF998ioY7oYvwvjt8/7Bq64b6+QJeUoGJCZ75IAckiMyJNw78y68L95X/9z/7H/zvy9Dfa/LeUz+Mv/nb1d/+DI=</latexit><latexit sha1_base64="CUCt8fc/HqYRscmeEmG1tEW24LM=">AAADGXiclVJNb9QwEHVSPkr42sKRy4gV1VZVo6RCajkgVXDpCRWJpUXr1cpxnF2rThzsCdIqyu/opX+lFw6AOMKJf4M3G6HS5UBHsvw8b+Z5PJ6kVNJiFP3y/LUbN2/dXr8T3L13/8HD3saj91ZXhosh10qbk4RZoWQhhihRiZPSCJYnShwnp68X/PEnYazUxTucl2Kcs2khM8kZOtdkw4sOBzW1GaQMWbMFL2FnkyqR4YBmhvGa9lv2g7BNc/mw3eE3umkoTzVSpaeT3WUq/H8uUCOnM9yCDoRAabBJP1Yshe2glQuX2yCGnWsId7qrtcXXUflTXtCBIJj0+lEYtQarIO5An3R2NOn9oKnmVS4K5IpZO4qjEsc1Myi5Ek1AKytKxk/ZVIwcLFgu7Lhuv7aBZ86TQqaNWwVC672cUbPc2nmeuMic4cxe5RbOf3GjCrP9cS2LskJR8OVFWaUANSzmBFJpBEc1d4BxI12twGfM9Q3dNC2aEF998ioY7oYvwvjt8/7Bq64b6+QJeUoGJCZ75IAckiMyJNw78y68L95X/9z/7H/zvy9Dfa/LeUz+Mv/nb1d/+DI=</latexit><latexit sha1_base64="CUCt8fc/HqYRscmeEmG1tEW24LM=">AAADGXiclVJNb9QwEHVSPkr42sKRy4gV1VZVo6RCajkgVXDpCRWJpUXr1cpxnF2rThzsCdIqyu/opX+lFw6AOMKJf4M3G6HS5UBHsvw8b+Z5PJ6kVNJiFP3y/LUbN2/dXr8T3L13/8HD3saj91ZXhosh10qbk4RZoWQhhihRiZPSCJYnShwnp68X/PEnYazUxTucl2Kcs2khM8kZOtdkw4sOBzW1GaQMWbMFL2FnkyqR4YBmhvGa9lv2g7BNc/mw3eE3umkoTzVSpaeT3WUq/H8uUCOnM9yCDoRAabBJP1Yshe2glQuX2yCGnWsId7qrtcXXUflTXtCBIJj0+lEYtQarIO5An3R2NOn9oKnmVS4K5IpZO4qjEsc1Myi5Ek1AKytKxk/ZVIwcLFgu7Lhuv7aBZ86TQqaNWwVC672cUbPc2nmeuMic4cxe5RbOf3GjCrP9cS2LskJR8OVFWaUANSzmBFJpBEc1d4BxI12twGfM9Q3dNC2aEF998ioY7oYvwvjt8/7Bq64b6+QJeUoGJCZ75IAckiMyJNw78y68L95X/9z/7H/zvy9Dfa/LeUz+Mv/nb1d/+DI=</latexit>
ID3 Pseudocode
31
id3_algorithm(data, attributes, parent_data):if data is empty:
return new node w/ most frequent classification in parent_data else if all examples in data have same classification:
return new node with that classificationelse if attributes is empty:
return new node w/ most frequent classification in dataelse:
A = attribute with largest information gaintree = new decision tree with root Afor each value a of attribute A
new_data = all examples in data such that example.A = asubtree = id3_algorithm(new_data, attributes - A, data)attach subtree to tree with branch labeled “a”
return tree
No examples w/ this combination of attribute values
Examples w/ this combination of
attribute values but they have different
classifications
Testing Decision Trees‣ ID3 learns a decision tree from training data‣ How good is this decision tree?
‣ How accurately does it classify inputs it hasn’t seen?‣ New flights, new snowstorms, new patrons, …
‣ Split examples into two non-overlapping sets ‣ training data & testing data‣ by assigning each example to one set uniformly at random
‣ Accuracy of decision tree‣ # of correct classifications on testing data / # of testing examples
32
ML Applications‣ Learned algorithms are different than classical algorithms
‣ A learned algorithm depends on training data‣ A learned algorithm depends on features
‣ Learned algorithms are being embedded everywhere‣ Traditional applications
‣ spell checking‣ ad targeting‣ recommendations‣ speech & handwriting recognition‣ computer vision
33
ML Applications‣ Newer applications‣ News feeds, self-driving cars‣ insurance companies, medical systems, K-12 education‣ Risk assessment
‣ These have serious impact on society‣ should news be tailored to your political beliefs?‣ should car save driver or pedestrian?‣ should we deny freedoms based on risk assessments?
34
Risk Assessment‣ Learned algorithms that predict risk‣ Insurance premiums‣ Credit‣ Crime‣ Recidivism
‣ Bonds
‣ Sentencing
35
Criminal Risk Assessment‣ ML in criminal justice system‣ better predict who will commit new crimes
‣ Purported goals‣ a fairer system‣ less people in jail‣ for less time
36
Criminal Risk Assessment Tools‣ COMPAS by Northpointe predicts‣ Risk of new violent crime‣ Risk of general recidivism‣ Pretrial risk (failure to appear)
‣ Public Safety Assessment by Arnold Foundation‣ helps Judges make release/detention decisions
37
Criminal Risk Assessment Tools‣ Used in ‣ Arizona, Colorado, Delaware, Kentucky, Louisiana,
Maryland, New York, Oklahoma, Virginia, Washington, Wisconsin, …
‣ Often adopted without independent study of effectiveness
‣ Example ‣ New York started using tool in 2001
‣ Studied it only in 201238
ProPublica Study‣ In 2016 ProPublica conducted a study of COMPAS
‣ 7000 arrests in Broward County, FL
‣ between 2013 and 2014
‣ OK predictions for all crimes (misdemeanors included)
‣ 61% of people labeled high risk committed new crimes
‣ But unreliable for violent crimes
‣ 20% of people labeled high risk committed new violent crimes
39
ProPublica Study‣ Found significant racial disparities‣ Out of people labeled high risk but didn’t re-offend‣ 44.9% were African American
‣ 23.5% were white
‣ Out of people labeled low risk but did re-offend‣ 28% were African American
‣ 47.7% were white
‣ Study accounted for ‣ Criminal history, age and gender
40
Algorithms‣ Algorithms are impacting our lives‣ You are learning how to ‣ design algorithms‣ analyze algorithms‣ implement algorithms
‣ But don’t forget that… ‣ …your algorithms will impact people
41
Algorithms‣ Algorithms can do a lot amazing things‣ But they can also‣ deny people their freedom‣ addict people‣ sway elections‣ expose people’s private lives‣ …
42
References‣ Artificial Intelligence - A Modern Approach by S. Russel & P. Norvig‣ Machine Bias by ProPublica
‣ https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
‣ New York State COMPAS-Probation Risk and Need Assessment Study
‣ http://www.northpointeinc.com/downloads/research/DCJS_OPCA_COMPAS_Probation_Validity.pdf
‣ Evaluating the Predictive Validity of the COMPAS Risk and Needs Assessment System by Northpointe‣ http://www.northpointeinc.com/files/publications/Criminal-Justice-
Behavior-COMPAS.pdf
43
References‣ Fairness, Accountability and Transparency in ML‣ http://fatml.mysociety.org
‣ Center for Humane Technology‣ http://humanetech.com/
44