nolimit research stack

Post on 21-Jan-2018

322 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NoLimit Research Stack

Tech Talk - March 11th 2016

Contents

I. OverviewII. APIs

A. Entity ExtractorB. SummarizerC. Category ClassifierD. TopicE. Next Project(s)

III. Supporting ToolsA. Gerbang-APIB. Demo Master

Overview - Introduction

NoLimit Research Team responsible for developing internal API for NLP for Bahasa Indonesia. (Ananta Pandu & Anggrahita Bayu).

Currently, the APIs are:

A. Entity ExtractorB. SummarizerC. Category ClassifierD. Topic

Overview - The Architecture

web service web service web service web service

APIs

Gerbang-API Demo Master

Supporting Tools

Entity Extractor Summarizer CategoryClassifier Topic Classifier

Overview - The Architecture (2)

NodeJS NodeJS NodeJS NodeJS

APIs

NodeJS Reactjs

Supporting Tools

SCALA NodeJS NodeJS Python

The APIs

A. Entity ExtractorB. SummarizerC. Category ClassifierD. Topic Classifier

Link: http://demo.api.nlp.nolimitid.com/

Entity Extractor

Get entities from an online news text

http://demo.api.nlp.nolimitid.com/page/3

Text: StringEntities: Object(City, Country, Company, Event, JobTitle, Organization, Person,

Product)

Entity Extractor (2)

Entity Extractor (3)

- Built using Scala.- Get entities from text using HMM (https://en.wikipedia.org/wiki/Hidden_Markov_model).- Dataset (news+entities) provided by NoLimit. Total: 3000 articles.- Previously built using java + weka. Had to change it from simple classification to HMM because

HMM is known for its application in recognizing pattern better than simple classification.

Entity Extractor - V1.0

V1.0 built using java + weka

- Tag each token with simple classification using weka API.

- Good, because:- Weka make experiment easier

- Bad, because:

- Java 1.7 syntax for higher-order function is worse than NodeJS

- Simple classification < HMM- Confusing Weka API- Weka’s models has large filesize

Entity Extractor - V2.0

V2.0 built using NodeJS

- Implement HMM in NodeJS.- Good, because:

- JSON as literal object & array- Libraries (lodash, bluebird, nalapa https://github.com/anpandu/nalapa)- EZ to deploy (npm install + pm2 start)

- Bad, because:- NodeJS apps run in single thread- Had to create multiple instance as micro-services, too complicated for a single endpoint

Entity Extractor - V3.0 (current)

V3.0 built using Scala

- Implement HMM in Scala, actually just port code to scala- Good, because:

- Can provide simpler multithread implementation more than nodejs (parallel map, akka, etc)- Safer because native immutability (var vs val)- Static typing languages is generally faster than dynamic typing language

- Bad, because:- Fewer libraries- Longer test time

Entity Extractor (4)

Next :

- Updateable dataset- Scheduled re-training (realtime?)

Entity Extractor - Conclusion

- HMM is better than simple classification because predicting label of each token in a text is pattern-recognizing problem. HMM save sequence of tokens and their labels.

- Developing entity extractor in Scala is the best solution so far because it enables us implement parallel computation easier than nodejs. It has better performance too.

Summarizer

Get shorter version of an online news text

http://demo.api.nlp.nolimitid.com/page/2

Text: String Summary: String

Summarizer (2)

Summarizer (3)

- Built using NodeJS (core: https://github.com/anpandu/nodejs-text-summarizer).- Get summary from a text:

- Split text to sentences.

- Score each sentence using Word Form Similarity, Word Order Similarity, and Word Semantic Similarity (P. y. Zhang 2009).

- Take n best sentence to form a summary.- No dataset. Scoring formula coded straight from paper (P. y. Zhang 2009).

Summarizer - Preprocess

- Split text to sentences- Split sentence into tokens- Remove stopwords, replace non-ASCII, etc

- Best sentence for summary is the one that has most similarities with other sentences- Score each token with:

- Word Form Similarity- Word Order Similarity- Word Semantic

- WFS, WOS, and WS formula is in (P. y. Zhang 2009)- Sum the scores (SUM = a*WFS + b*WOS + c*WS; a+b+c=1)- Take best n sentences- Sort by it’s original order and join

Summarizer - Scoring

Category Classifier

Determine the category of an online news text

http://demo.api.nlp.nolimitid.com/page/1

Text: String Category: Object

Category Classifier (2)

Category Classifier (3)

- Built using NodeJS. (core: https://github.com/anpandu/indonesian-news-category-classifier)- Tag an article as a category (total 12 categories).

- Split article into tokens.- Remove stopwords, unique-ify, etc ...- Use tf-idf scores as features to train model. (12 scores, 12 features)

- Train using SVM (https://github.com/nicolaspanel/node-svm), cos tf-idf score is a float number, excellent for vector.

- Tested on training set, f-measure of 90%- Dataset provided by NoLimit, tuple of (article + category), total of 48000 articles.

Topic (1)

Use cases:

1. Identify several determined number of topics from a bunch of documents (news articles)

2. Classify a single article into a certain topic

Topic (2)

- Built using Python 3.X and its awesome supporting libraries:- Numpy (array processing)

- http://www.numpy.org/- TextBlob (NLTK-based text processor)

- https://textblob.readthedocs.org/en/dev/- Scikit Learn (Machine Learning [Classifier, Clustering, Quality Analysis])

- http://scikit-learn.org/ - JSON (JSON handler), Pickle (Python object-to-file)- LDA

Topic (3): Latent Dirichlet Allocation (LDA)

Here be the Leviathan

Indeed, any hope of overcoming him is false; Shall one not be overwhelmed at the sight of him?

-- Job 41:9 NKJV

Latent Dirichlet Allocation (2)

Latent Dirichlet Allocation (3)

Wd,n = n-th word in document d N = number of wordsZd,n = topic id of n-th word in document d D = number of documentsθd = distribution of topic in document d K = number of topicsβk = probability of word occurs in topic k

βk ~ DirichletV(η)α = prior weight of a topic in document θd ~ DirichletK(α)η = prior weight of a word in a topic z ~ CategoricalK(θd)

w ~ CategoricalV(βk)

Wd,nZd,nθdα βk η

ND K

Latent Dirichlet Allocation (4): in Python

# X = matrix of wordcount for the vocabularies in all document# num_topics = predetermined amount of topics

import ldamodel = lda.LDA(num_topics)model.fit(X)#the leviathan roars after this function’s invocation

topic_word = model.topic_word_#(array, shape = [n_topics, n_features]) Point estimate of the topic-word distributions

doc_topic = model.doc_topic_#(array, shape = [n_samples, n_features]) Point estimate of the document-topic distributions

Next Project(s)

- Entity extractor + sentiment- Quotation extraction- Opinion Mining- Credit Card scoring by social media

Challenges

- How to make updating dataset/model can be done automatically- Balancing accuracy vs speed

Supporting Tools

A. Gerbang APIB. Demo Master

Gerbang-API

- Built by NodeJS (sailsjs).- Unify all APIs into one.- Actually just a simple app to re-route endpoints.- Next:

- Move to a simpler framework (restify ?)- user authorization + token- rate limiting- logger

Demo Master

- Small web app to save documentation + demo.- Link: http://demo.enrichment.nolimitid.com/- Next:

- upgrade to 3rd party framework (probably https://github.com/tripit/slate)

Demo Master (2)

Old Demo Master

Demo Master (3)

Next Demo Master (with slate)

Question ??

top related