ods slack exploration

44
P02. ODS Slack exploration

Upload: lviv-data-science-summer-school

Post on 23-Jan-2018

170 views

Category:

Education


0 download

TRANSCRIPT

P02. ODS Slack exploration

OpenDataScience SlackOfficially started 12.03.2015

● Started as platform for open data science communication

● The biggest data science community in the world

As for today 25.07.2017

● 1.3M messages, 5400 users (2000 weekly active)

Most active channels:

#deep_learning, #theory_and_practice, #visualization, #_general,

#_meetings, #_jobs, #big_data, #python, #r, #datasets, #nlp, #edu_courses

Our data:

● Users● Messages and Threads● Time response● Reactions● Replies

Graphs

Channels 2016

Node size - PageRankClustering - walktrapData - messages in top-40 channels

Nodes size - Page Rank

Clustering - walktrap

Channels 2017

Node size - PageRankClustering - walktrapData - messages in top-40 channels

Most active users in all threads

Node size - weighted degreeClustering - modularityData - threads in all channels

Most active users in random_flood channelNode size - weighted degreeClustering - modularityData - threads in _random_flood

Most active users in career channel

Node size - weighted degreeClustering - modularityData - threads in career channel

Plots

Analysis of users’ activity

Detection of curious users

● Curious user - user who asks for help ● NLP techniques: preprocessing, regular expressions

Expert detection

● Experts - users with the highest numbers of specific reactions under his/her messages in threads

Troll detection

● Trolls - users with the highest numbers of specific reactions under his/her messages

Model Info

Response Time Prediction

Problem Info

Problem:

Predict time of response in thread

Data:

12500 threads, time period 2016 - 2017

Tool:

Regression models

Data StatsFeatures:

● Text of main message● Day and hour of main message● Length of main message● Channel● Mentioned users● Links in text● Historical activity

Target variable:

● Waiting time for response

Applied ApproachesApproaches:

● Lasso regression (Scikit learn) MAE = 149 min

● XGBoost regression MAE = 140 min

● Lightgbm regression MAE = 119 min

Best results: Lightgbm

Plot for real and predicted response time (in minutes) for deep_lerning channel:

Further WorkFuture improvements:

● New features (for example, use number of active users in channel and number of threads before new thread)

● Use answers in channel also● Reduce dimensionality● Take into account the topic of thread

Users Classification

Users Classification

Problem:

Classify users by messages

Data:

users - 1771 (select top 100)

12 channels, time period 2016 - 2017

messages - 120 997

User Classification Tools

Regression models - accuracy 19.85%

● LogisticRegression - 19.85% (CountVectorizer, OneVsRestClassifier)● LinearSVC - 15.63%

LSTM - accuracy 16.56%

Channels Classification

Channels Classification

Problem : Classify channels by messages

Data : messages - 120 997 time period 2016 - 2017

12 channels #career

#big_data

#kaggle_crackers

#lang_python

#lang_r#theory_and_practice

#nlp#welcome

#bayesian

#_meetings

#datasets#deep_learning

Channel Classification Tools

Preprocessing - pymorphy2 lemmatization, exclude: stop_words/url/emoji

Regression models - accuracy 55.33%● LogisticRegression - 55.33% (CountVectorizer, OneVsRestClassifier)● LinearSVC - 51.52%

CNN (with fasttext embeddings) - accuracy 51.67%

LSTM (with fasttext embeddings) - accuracy 55.42%

Ensemble - accuracy 58.17%

Channels Classification eli5

Channels Classification eli5

Channels Classification eli5

Our TeamVolodymyr Medentsiy

Vadym Korshunov

Ganna Kaplun

Andrii Skliar

Yana Mosiichuk

Kateryna Bobrovnyk

Vitalii Radchenko