a beginner's guide to machine learning with scikit-learn
DESCRIPTION
Given at the PyData NYC 2013 conference (http://vimeo.com/79517341), and will be given at PyTennessee 2014. Scikit-learn is one of the most well-known machine learning Python modules in existence. But how does it work, and what, for that matter, is machine learning? For those with programming experience but who are new to machine learning, this talk gives a beginner-level overview of how machine learning can be useful, important machine learning concepts, and how to implement them with scikit-learn. We’ll use real world data to look at supervised and unsupervised machine learning algorithms and why scikit-learn is useful for performing these tasks.TRANSCRIPT
![Page 1: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/1.jpg)
A Beginner’s Guide to Machine Learning with Scikit-LearnSarah Guido
PyTennessee 2014
![Page 2: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/2.jpg)
All about me
• Grad student at the University of Michigan• Data analyst for HathiTrust• Organizer of Ann Arbor PyLadies chapter
![Page 3: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/3.jpg)
My talk
• Machine learning and scikit-learn• Supervised and unsupervised learning• Preprocessing, validation and testing, strategies for machine learning
![Page 4: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/4.jpg)
What is machine learning?
• Application of algorithms that learn from examples
• Representation and generalization
![Page 5: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/5.jpg)
Why should we care?
• Useful in every day life• Email spam, handwriting analysis, stock market
analysis, Netflix
• Especially useful in data analysis• Feature extraction, linear regression, classification,
clustering
![Page 6: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/6.jpg)
Machine Learning Vocab
• Instance• Feature• Class• Categorical
• Nominal• Ordinal
• Continuous
![Page 7: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/7.jpg)
Machine Learning VocabFeature Class
Instance
![Page 8: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/8.jpg)
Scikit-Learn
• Machine learning module• Open-source• Built-in datasets• Good resources for learning
![Page 9: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/9.jpg)
Scikit-Learn
• Model = EstimatorObject()• Model.fit(dataset.data, dataset.target)
• dataset.data = dataset• dataset.target = labels
• Model.predict(dataset.data)
![Page 10: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/10.jpg)
Scikit-Learn
• Supervised• Unsupervised• Semi-supervised• Reinforcement learning• Neural networks• …and many more!
![Page 11: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/11.jpg)
Supervised learning
• Labeled data• You know what you’re looking for• Classification: predict categorical labels• Regression: predict continuous target variables
![Page 12: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/12.jpg)
Classification
• Categorical variables• Relationship between instance and feature• Classification algorithms == classifiers
![Page 13: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/13.jpg)
Classification
• Naïve Bayes classifier• Features are independent• Fast performance• Decent classifier
![Page 14: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/14.jpg)
Classification
• Car evaluation dataset-UCI• Features: buying price, the maintenance price, the number of doors, the number of seats, the size of the trunk, and the safety ranking
• Labels: unacceptable, acceptable, good, or very good
![Page 15: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/15.jpg)
Classification
![Page 16: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/16.jpg)
Classification
![Page 17: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/17.jpg)
Classification
![Page 18: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/18.jpg)
Unsupervised algorithms
• Unlabeled data• You might have no idea what you’re looking for• Clustering: splitting observations into groups• Dimensionality reduction: flatten data to fewer dimensions
![Page 19: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/19.jpg)
Clustering
• Exploring the data• Similar objects in the same group• Distance between data points
![Page 20: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/20.jpg)
Clustering
• K-means clustering• Three steps
• Chooses initial cluster centers• Assigns data instance to cluster• Recalculates cluster center
• Efficient
![Page 21: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/21.jpg)
Clustering
![Page 22: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/22.jpg)
Clustering
![Page 23: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/23.jpg)
Clustering
![Page 24: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/24.jpg)
Data preprocessing
• Encoding categorical features
![Page 25: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/25.jpg)
Data preprocessing
![Page 26: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/26.jpg)
Data preprocessing
![Page 27: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/27.jpg)
Data preprocessing
• Split the dataset into training and test data
![Page 28: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/28.jpg)
Validation and testing
• Model evaluation
• Cross-validation
![Page 29: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/29.jpg)
Good strategies
• Avoid overfitting• Use lots of data• Intuition fails in high dimensions
![Page 30: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/30.jpg)
My materials
• Scikit-learn.org documentation and tutorials• Machine learning class at U of M• Scikit-learn talks
![Page 31: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/31.jpg)
Resources
• Scikit-learn documentation and tutorials• scikit-learn.org/stable/documentation.html
• Other resources• http://archive.ics.uci.edu/ml/datasets.html• Mldata.org
• Videos• Scikit-learn tutorial: http://vimeo.com/53062607• Intro to scikit-learn: http://vimeo.com/72859487
![Page 32: A Beginner's Guide to Machine Learning with Scikit-Learn](https://reader033.vdocuments.us/reader033/viewer/2022061223/54c674b34a79593d1c8b4588/html5/thumbnails/32.jpg)
Contact me!
• @sarah_guido• Linkedin.com/sarahguido• github.com/sarguido