sung park predict 422 group project presentation
TRANSCRIPT
![Page 1: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/1.jpg)
TEXT MINING DATA SCIENCE JOBS IN R
Sung Park, MSPA Candidate August 20, 2015
Northwestern University PREDICT 422-‐DL SecGon 55
1
![Page 2: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/2.jpg)
SUMMARY • IntroducGon • Resources • Data Source • Data ExtracGon • Data PreparaGon • Supervised Learning
2
![Page 3: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/3.jpg)
INTRODUCTION • ExploraGon of web scraping and text mining
capabiliGes in R • Unstructured data
• Kaggle.com job posGngs • ClassificaGon using machine learning algorithm • Data scienGsts vs. non-‐data scienGsts
3
![Page 4: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/4.jpg)
RESOURCES • Text AnalyGcs Tutorial in R
• Timothy D’Auria, Boston Decision, LLC • hUps://www.youtube.com/watch?v=j1V2McKbkLo
• Web Scraping Tutorial in R • Sharon Machlis, Computerworld • hUps://www.youtube.com/watch?v=TPLMQnGw0Vk
• Data Science in R: A Case Study Approach to ComputaGonal Reasoning and Problem Solving • Deborah Nolan and Duncan Temple Lang
• Google and Stack Overflow
4
![Page 5: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/5.jpg)
DATA SOURCE • Kaggle.com/jobs • August 17, 2015 • 1,025 Job PosGngs
• Data ScienGst • Big Data Engineer • Data Science
Architect • Data Analyst • MarkeGng Analyst • StaGsGcian • Data Science
Director
5
![Page 6: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/6.jpg)
DATA EXTRACTION • Extracted job links
• XML Package • xpathSApply(doc, "//h3/a/@href[starts-‐with(., '/jobs')]")
• Extracted job posGng text
• rvest Package • html_text(html_nodes(htmlpage, "div.postcontent"))
6
![Page 7: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/7.jpg)
DATA PREPARATION • Cleaned the text data • tm Package • tm_map()
• Remove punctuaGons • Remove white spaces • Lower-‐casing • Remove stopwords
• “a”, “the”, “and”, “but”, etc.
7
![Page 8: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/8.jpg)
DATA PREPARATION • Created the term document matrix (TDM)
8
![Page 9: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/9.jpg)
DATA PREPARATION • TDM consists of 959 job posGngs and 73 terms • 375 data scienGsts and 584 non-‐data scienGsts
• Split TDM into training set and test set • 864 job posGngs in training sample • 95 job posGngs in test sample
9
![Page 10: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/10.jpg)
SUPERVISED LEARNING • K-‐Nearest Neighbor • Find the K value with the highest classificaGon accuracy
• K=8 shows the best result with 82.98% accuracy rate • Confusion matrix shows the model correctly predicted 22
out of 35 data scienGst job posGngs
10
![Page 11: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/11.jpg)
SUPERVISED LEARNING • ClassificaGon Decision Tree (Gini index) • The classificaGon accuracy rate is 96.8% • Confusion matrix shows the model correctly predicted 30
out of 33 data scienGst job posGngs
• Key terms for tree construcGon:
11
![Page 12: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/12.jpg)
SUPERVISED LEARNING • Bagging • The classificaGon accuracy rate is 96.8% • Confusion matrix shows the same results as the
classificaGon tree
12
![Page 13: SUNG PARK PREDICT 422 Group Project Presentation](https://reader031.vdocuments.us/reader031/viewer/2022021921/58ec9a381a28ab33598b45d9/html5/thumbnails/13.jpg)
QUESTIONS? COMMENTS?
Sung Park, MSPA Candidate August 20, 2015
Northwestern University PREDICT 422-‐DL SecGon 55
13