nolimit research stack

36
NoLimit Research Stack Tech Talk - March 11th 2016

Upload: ananta-pandu-wicaksana

Post on 21-Jan-2018

322 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NoLimit Research Stack

NoLimit Research Stack

Tech Talk - March 11th 2016

Page 2: NoLimit Research Stack

Contents

I. OverviewII. APIs

A. Entity ExtractorB. SummarizerC. Category ClassifierD. TopicE. Next Project(s)

III. Supporting ToolsA. Gerbang-APIB. Demo Master

Page 3: NoLimit Research Stack

Overview - Introduction

NoLimit Research Team responsible for developing internal API for NLP for Bahasa Indonesia. (Ananta Pandu & Anggrahita Bayu).

Currently, the APIs are:

A. Entity ExtractorB. SummarizerC. Category ClassifierD. Topic

Page 4: NoLimit Research Stack

Overview - The Architecture

web service web service web service web service

APIs

Gerbang-API Demo Master

Supporting Tools

Entity Extractor Summarizer CategoryClassifier Topic Classifier

Page 5: NoLimit Research Stack

Overview - The Architecture (2)

NodeJS NodeJS NodeJS NodeJS

APIs

NodeJS Reactjs

Supporting Tools

SCALA NodeJS NodeJS Python

Page 6: NoLimit Research Stack

The APIs

A. Entity ExtractorB. SummarizerC. Category ClassifierD. Topic Classifier

Link: http://demo.api.nlp.nolimitid.com/

Page 7: NoLimit Research Stack

Entity Extractor

Get entities from an online news text

http://demo.api.nlp.nolimitid.com/page/3

Text: StringEntities: Object(City, Country, Company, Event, JobTitle, Organization, Person,

Product)

Page 8: NoLimit Research Stack

Entity Extractor (2)

Page 9: NoLimit Research Stack

Entity Extractor (3)

- Built using Scala.- Get entities from text using HMM (https://en.wikipedia.org/wiki/Hidden_Markov_model).- Dataset (news+entities) provided by NoLimit. Total: 3000 articles.- Previously built using java + weka. Had to change it from simple classification to HMM because

HMM is known for its application in recognizing pattern better than simple classification.

Page 10: NoLimit Research Stack

Entity Extractor - V1.0

V1.0 built using java + weka

- Tag each token with simple classification using weka API.

- Good, because:- Weka make experiment easier

- Bad, because:

- Java 1.7 syntax for higher-order function is worse than NodeJS

- Simple classification < HMM- Confusing Weka API- Weka’s models has large filesize

Page 11: NoLimit Research Stack

Entity Extractor - V2.0

V2.0 built using NodeJS

- Implement HMM in NodeJS.- Good, because:

- JSON as literal object & array- Libraries (lodash, bluebird, nalapa https://github.com/anpandu/nalapa)- EZ to deploy (npm install + pm2 start)

- Bad, because:- NodeJS apps run in single thread- Had to create multiple instance as micro-services, too complicated for a single endpoint

Page 12: NoLimit Research Stack

Entity Extractor - V3.0 (current)

V3.0 built using Scala

- Implement HMM in Scala, actually just port code to scala- Good, because:

- Can provide simpler multithread implementation more than nodejs (parallel map, akka, etc)- Safer because native immutability (var vs val)- Static typing languages is generally faster than dynamic typing language

- Bad, because:- Fewer libraries- Longer test time

Page 13: NoLimit Research Stack

Entity Extractor (4)

Next :

- Updateable dataset- Scheduled re-training (realtime?)

Page 14: NoLimit Research Stack

Entity Extractor - Conclusion

- HMM is better than simple classification because predicting label of each token in a text is pattern-recognizing problem. HMM save sequence of tokens and their labels.

- Developing entity extractor in Scala is the best solution so far because it enables us implement parallel computation easier than nodejs. It has better performance too.

Page 15: NoLimit Research Stack

Summarizer

Get shorter version of an online news text

http://demo.api.nlp.nolimitid.com/page/2

Text: String Summary: String

Page 16: NoLimit Research Stack

Summarizer (2)

Page 17: NoLimit Research Stack

Summarizer (3)

- Built using NodeJS (core: https://github.com/anpandu/nodejs-text-summarizer).- Get summary from a text:

- Split text to sentences.

- Score each sentence using Word Form Similarity, Word Order Similarity, and Word Semantic Similarity (P. y. Zhang 2009).

- Take n best sentence to form a summary.- No dataset. Scoring formula coded straight from paper (P. y. Zhang 2009).

Page 18: NoLimit Research Stack

Summarizer - Preprocess

- Split text to sentences- Split sentence into tokens- Remove stopwords, replace non-ASCII, etc

Page 19: NoLimit Research Stack

- Best sentence for summary is the one that has most similarities with other sentences- Score each token with:

- Word Form Similarity- Word Order Similarity- Word Semantic

- WFS, WOS, and WS formula is in (P. y. Zhang 2009)- Sum the scores (SUM = a*WFS + b*WOS + c*WS; a+b+c=1)- Take best n sentences- Sort by it’s original order and join

Summarizer - Scoring

Page 20: NoLimit Research Stack

Category Classifier

Determine the category of an online news text

http://demo.api.nlp.nolimitid.com/page/1

Text: String Category: Object

Page 21: NoLimit Research Stack

Category Classifier (2)

Page 22: NoLimit Research Stack

Category Classifier (3)

- Built using NodeJS. (core: https://github.com/anpandu/indonesian-news-category-classifier)- Tag an article as a category (total 12 categories).

- Split article into tokens.- Remove stopwords, unique-ify, etc ...- Use tf-idf scores as features to train model. (12 scores, 12 features)

- Train using SVM (https://github.com/nicolaspanel/node-svm), cos tf-idf score is a float number, excellent for vector.

- Tested on training set, f-measure of 90%- Dataset provided by NoLimit, tuple of (article + category), total of 48000 articles.

Page 23: NoLimit Research Stack

Topic (1)

Use cases:

1. Identify several determined number of topics from a bunch of documents (news articles)

2. Classify a single article into a certain topic

Page 24: NoLimit Research Stack

Topic (2)

- Built using Python 3.X and its awesome supporting libraries:- Numpy (array processing)

- http://www.numpy.org/- TextBlob (NLTK-based text processor)

- https://textblob.readthedocs.org/en/dev/- Scikit Learn (Machine Learning [Classifier, Clustering, Quality Analysis])

- http://scikit-learn.org/ - JSON (JSON handler), Pickle (Python object-to-file)- LDA

Page 25: NoLimit Research Stack

Topic (3): Latent Dirichlet Allocation (LDA)

Here be the Leviathan

Indeed, any hope of overcoming him is false; Shall one not be overwhelmed at the sight of him?

-- Job 41:9 NKJV

Page 26: NoLimit Research Stack

Latent Dirichlet Allocation (2)

Page 27: NoLimit Research Stack

Latent Dirichlet Allocation (3)

Wd,n = n-th word in document d N = number of wordsZd,n = topic id of n-th word in document d D = number of documentsθd = distribution of topic in document d K = number of topicsβk = probability of word occurs in topic k

βk ~ DirichletV(η)α = prior weight of a topic in document θd ~ DirichletK(α)η = prior weight of a word in a topic z ~ CategoricalK(θd)

w ~ CategoricalV(βk)

Wd,nZd,nθdα βk η

ND K

Page 28: NoLimit Research Stack

Latent Dirichlet Allocation (4): in Python

# X = matrix of wordcount for the vocabularies in all document# num_topics = predetermined amount of topics

import ldamodel = lda.LDA(num_topics)model.fit(X)#the leviathan roars after this function’s invocation

topic_word = model.topic_word_#(array, shape = [n_topics, n_features]) Point estimate of the topic-word distributions

doc_topic = model.doc_topic_#(array, shape = [n_samples, n_features]) Point estimate of the document-topic distributions

Page 29: NoLimit Research Stack

Next Project(s)

- Entity extractor + sentiment- Quotation extraction- Opinion Mining- Credit Card scoring by social media

Page 30: NoLimit Research Stack

Challenges

- How to make updating dataset/model can be done automatically- Balancing accuracy vs speed

Page 31: NoLimit Research Stack

Supporting Tools

A. Gerbang APIB. Demo Master

Page 32: NoLimit Research Stack

Gerbang-API

- Built by NodeJS (sailsjs).- Unify all APIs into one.- Actually just a simple app to re-route endpoints.- Next:

- Move to a simpler framework (restify ?)- user authorization + token- rate limiting- logger

Page 33: NoLimit Research Stack

Demo Master

- Small web app to save documentation + demo.- Link: http://demo.enrichment.nolimitid.com/- Next:

- upgrade to 3rd party framework (probably https://github.com/tripit/slate)

Page 34: NoLimit Research Stack

Demo Master (2)

Old Demo Master

Page 35: NoLimit Research Stack

Demo Master (3)

Next Demo Master (with slate)

Page 36: NoLimit Research Stack

Question ??