apply logistic regression model in making celebrity's popularity ranking system
TRANSCRIPT
Author: Thuc VX <[email protected]>
Company: NHN Vietnam – Search service devision
Apply Logistic Regression model in Making Celebrity ranking system
Author: Thuc VX <[email protected]>
Company: NHN Vietnam – Search service devision
Project description - Making ranking system for celebrities in line2day service
Architecture
Systems collaboration
- "Celeb ranking" run every day at 23.00
- Line2day get scores ever day at 05.00
Author: Thuc VX <[email protected]>
Company: NHN Vietnam – Search service devision
General architecture
Collecting data module
Author: Thuc VX <[email protected]>
Company: NHN Vietnam – Search service devision
- Repository is text file. Structure in this file is in Json format
- Module for collecting data is run every day by crontab
- Return data file in http protocol with Apache web server
Platform
Web server: - Apache web server
Data format for information exchange - Json lib for Java: http://www.json.org
https client: - JSSE in JDK
Programming languages - Java: for all modules – Java JDK already in server gvbatch01
Celeb Ranking theory
Features 1. Youtube channel's subscribers
2. Official facebook's likes
3. Number of articles in Line2day - returned from Line2day API when get celeb list
4. Number of google search results:
5. Number of documents in vnews collection:
Recency attributes
Idx Name Description Coefficient
1 videoRecency Number of videos in line2day in last 3 days 1
2 articleRecency Number of articles in line2day in last 3 days 0.01
3 vnewsRecency Number of documents in vnews collection in last 3
days
0.005
4 photoRecency Number of photos in line2day in last 3 days 0.001
5 postRecency Number of posts of celeb in facebook in last 1 day 0.005
Author: Thuc VX <[email protected]>
Company: NHN Vietnam – Search service devision
Ranking function Use Logistic Regression model
Logistic Regression function:
Suppose:
Score(t) = F(t) + β . Recency
Machine leaning tool http://devwiki.nhncorp.vn/index.php/Logregdrvr
Phase 2
Features
Idx Name Description Apply in
1 Youtube Youtube channel's subscribers Phase 1
2 Facebook Official facebook's likes
3 Line2day
Number of articles in Line2day - returned from Line2day
API when get celeb list
4 Google Web Number of relevant documents in Google Web search
5 Google News Number of relevant documents in Google News search
5.1 GoogleTrend Result from Google trend
6 articleClicks Number of clicks on articles talking about celeb Phase 2
7 LikesPerPost Average number of likes per post in Facebook
8 musicVideoClicks Number of clicks on videos talking about celeb Phase 3
9 PhotoClicks Number of clicks on photos talking about celeb
10 Rating Celeb Rating Score
11 Followers
Number of followers in line2day: this feature from
Mobile App
12 Photos Number of photos in line2day
13 Videos Number of videos in line2day
Note:
Author: Thuc VX <[email protected]>
Company: NHN Vietnam – Search service devision
- In order to have feature Celeb Rating, Line2day mobile App and web have to provide function
that allows reader ability to vote for celeb.
- Try another Google parameters for searching:
http://moz.com/ugc/the-ultimate-guide-to-the-google-search-parameters
On articleClicks feature - Checking correlation between #clicks and pre-classification
- Use #clicks and evaluate model
Recency attributes
Idx Name Description coefficient 1 videoRecency Number of videos in line2day in last 3 days 1
2 articleRecency Number of articles in line2day in last 3 days 0.01
4 photoRecency Number of photos in line2day in last 3 days 0.001
5 postRecency Number of posts of celeb in facebook in last 1 day 0.005
6 FacebookTalkingAbout TalkingAbout in Facebook
PreProcess feature data
Check abnormal google search result:
- Check number of search result in Google search
- Processes:
1. Detect abnormal data: by min / max or automatically by app
2. Fixing abnormal data
Method for detecting abnormal data:
We use statistical method:
- Calculate modified Z-score:
Where
- If modified Z-score > 3.5, it's likely an outlier.
Method for fixing data:
1. Regression function between search result and classification: just right.
2. Every new celebrity must have a classification number.
Author: Thuc VX <[email protected]>
Company: NHN Vietnam – Search service devision
3. Old celeb will get classification from popularity
4. From classification value, we calculate Google search result by regression function.
Generate data for missing social URL
- Make regression function between Facebook (Youtube) feature and Google search feature
- If missing social URL, predict this feature value from predictive function
Name search service
Requirements:
- Count as exactly as possible number of times celebrity appears on the Internet
- Count articles that has celeb name, not other people with same name
- Synonym dictionary: support name with different format such as: Tara, t-ara..
Solution:
Search string is managed by CMV (Line2day tool). After editing search string, data is imported to
Line2day database and Line2day return to Celeb Ranking System by API.
Search string: provided from Line2day. This string provides two informations:
- Differnet type of name: name1, name2,..
- Different keywords: keyword1, keyword2,
Query for search engine will be in following format:
Query = ("name1" | "name2" ...) & ("keyword1" | "keyword2" | ...)
Evaluation - Đánh giá mô hình Hồi qui logistic
http://kiemtailieu.com/khoa-hoc-tu-nhien/tai-lieu/danh-gia-mo-hinh-hoi-qui-logistic/2.html
Use PEASON correlation Correlation coeff between target value and predicted value
Use Kendall tau-b correlation cor.test(exer, smoke, method="kendall")
Author: Thuc VX <[email protected]>
Company: NHN Vietnam – Search service devision
Use DCG (Discounted cummulative gain)
This rate is between range of [0, 1]. Higher rate reflects better model we made.
Diagnostics - Residuals: between real rank and predictted rank
- Confident interval
- Outliers - specify highest residual observation
- Covariance: between pre-judged and predicted values