nolimit research stack
Post on 21-Jan-2018
322 Views
Preview:
TRANSCRIPT
NoLimit Research Stack
Tech Talk - March 11th 2016
Contents
I. OverviewII. APIs
A. Entity ExtractorB. SummarizerC. Category ClassifierD. TopicE. Next Project(s)
III. Supporting ToolsA. Gerbang-APIB. Demo Master
Overview - Introduction
NoLimit Research Team responsible for developing internal API for NLP for Bahasa Indonesia. (Ananta Pandu & Anggrahita Bayu).
Currently, the APIs are:
A. Entity ExtractorB. SummarizerC. Category ClassifierD. Topic
Overview - The Architecture
web service web service web service web service
APIs
Gerbang-API Demo Master
Supporting Tools
Entity Extractor Summarizer CategoryClassifier Topic Classifier
Overview - The Architecture (2)
NodeJS NodeJS NodeJS NodeJS
APIs
NodeJS Reactjs
Supporting Tools
SCALA NodeJS NodeJS Python
The APIs
A. Entity ExtractorB. SummarizerC. Category ClassifierD. Topic Classifier
Link: http://demo.api.nlp.nolimitid.com/
Entity Extractor
Get entities from an online news text
http://demo.api.nlp.nolimitid.com/page/3
Text: StringEntities: Object(City, Country, Company, Event, JobTitle, Organization, Person,
Product)
Entity Extractor (2)
Entity Extractor (3)
- Built using Scala.- Get entities from text using HMM (https://en.wikipedia.org/wiki/Hidden_Markov_model).- Dataset (news+entities) provided by NoLimit. Total: 3000 articles.- Previously built using java + weka. Had to change it from simple classification to HMM because
HMM is known for its application in recognizing pattern better than simple classification.
Entity Extractor - V1.0
V1.0 built using java + weka
- Tag each token with simple classification using weka API.
- Good, because:- Weka make experiment easier
- Bad, because:
- Java 1.7 syntax for higher-order function is worse than NodeJS
- Simple classification < HMM- Confusing Weka API- Weka’s models has large filesize
Entity Extractor - V2.0
V2.0 built using NodeJS
- Implement HMM in NodeJS.- Good, because:
- JSON as literal object & array- Libraries (lodash, bluebird, nalapa https://github.com/anpandu/nalapa)- EZ to deploy (npm install + pm2 start)
- Bad, because:- NodeJS apps run in single thread- Had to create multiple instance as micro-services, too complicated for a single endpoint
Entity Extractor - V3.0 (current)
V3.0 built using Scala
- Implement HMM in Scala, actually just port code to scala- Good, because:
- Can provide simpler multithread implementation more than nodejs (parallel map, akka, etc)- Safer because native immutability (var vs val)- Static typing languages is generally faster than dynamic typing language
- Bad, because:- Fewer libraries- Longer test time
Entity Extractor (4)
Next :
- Updateable dataset- Scheduled re-training (realtime?)
Entity Extractor - Conclusion
- HMM is better than simple classification because predicting label of each token in a text is pattern-recognizing problem. HMM save sequence of tokens and their labels.
- Developing entity extractor in Scala is the best solution so far because it enables us implement parallel computation easier than nodejs. It has better performance too.
Summarizer
Get shorter version of an online news text
http://demo.api.nlp.nolimitid.com/page/2
Text: String Summary: String
Summarizer (2)
Summarizer (3)
- Built using NodeJS (core: https://github.com/anpandu/nodejs-text-summarizer).- Get summary from a text:
- Split text to sentences.
- Score each sentence using Word Form Similarity, Word Order Similarity, and Word Semantic Similarity (P. y. Zhang 2009).
- Take n best sentence to form a summary.- No dataset. Scoring formula coded straight from paper (P. y. Zhang 2009).
Summarizer - Preprocess
- Split text to sentences- Split sentence into tokens- Remove stopwords, replace non-ASCII, etc
- Best sentence for summary is the one that has most similarities with other sentences- Score each token with:
- Word Form Similarity- Word Order Similarity- Word Semantic
- WFS, WOS, and WS formula is in (P. y. Zhang 2009)- Sum the scores (SUM = a*WFS + b*WOS + c*WS; a+b+c=1)- Take best n sentences- Sort by it’s original order and join
Summarizer - Scoring
Category Classifier
Determine the category of an online news text
http://demo.api.nlp.nolimitid.com/page/1
Text: String Category: Object
Category Classifier (2)
Category Classifier (3)
- Built using NodeJS. (core: https://github.com/anpandu/indonesian-news-category-classifier)- Tag an article as a category (total 12 categories).
- Split article into tokens.- Remove stopwords, unique-ify, etc ...- Use tf-idf scores as features to train model. (12 scores, 12 features)
- Train using SVM (https://github.com/nicolaspanel/node-svm), cos tf-idf score is a float number, excellent for vector.
- Tested on training set, f-measure of 90%- Dataset provided by NoLimit, tuple of (article + category), total of 48000 articles.
Topic (1)
Use cases:
1. Identify several determined number of topics from a bunch of documents (news articles)
2. Classify a single article into a certain topic
Topic (2)
- Built using Python 3.X and its awesome supporting libraries:- Numpy (array processing)
- http://www.numpy.org/- TextBlob (NLTK-based text processor)
- https://textblob.readthedocs.org/en/dev/- Scikit Learn (Machine Learning [Classifier, Clustering, Quality Analysis])
- http://scikit-learn.org/ - JSON (JSON handler), Pickle (Python object-to-file)- LDA
Topic (3): Latent Dirichlet Allocation (LDA)
Here be the Leviathan
Indeed, any hope of overcoming him is false; Shall one not be overwhelmed at the sight of him?
-- Job 41:9 NKJV
Latent Dirichlet Allocation (2)
Latent Dirichlet Allocation (3)
Wd,n = n-th word in document d N = number of wordsZd,n = topic id of n-th word in document d D = number of documentsθd = distribution of topic in document d K = number of topicsβk = probability of word occurs in topic k
βk ~ DirichletV(η)α = prior weight of a topic in document θd ~ DirichletK(α)η = prior weight of a word in a topic z ~ CategoricalK(θd)
w ~ CategoricalV(βk)
Wd,nZd,nθdα βk η
ND K
Latent Dirichlet Allocation (4): in Python
# X = matrix of wordcount for the vocabularies in all document# num_topics = predetermined amount of topics
import ldamodel = lda.LDA(num_topics)model.fit(X)#the leviathan roars after this function’s invocation
topic_word = model.topic_word_#(array, shape = [n_topics, n_features]) Point estimate of the topic-word distributions
doc_topic = model.doc_topic_#(array, shape = [n_samples, n_features]) Point estimate of the document-topic distributions
Next Project(s)
- Entity extractor + sentiment- Quotation extraction- Opinion Mining- Credit Card scoring by social media
Challenges
- How to make updating dataset/model can be done automatically- Balancing accuracy vs speed
Supporting Tools
A. Gerbang APIB. Demo Master
Gerbang-API
- Built by NodeJS (sailsjs).- Unify all APIs into one.- Actually just a simple app to re-route endpoints.- Next:
- Move to a simpler framework (restify ?)- user authorization + token- rate limiting- logger
Demo Master
- Small web app to save documentation + demo.- Link: http://demo.enrichment.nolimitid.com/- Next:
- upgrade to 3rd party framework (probably https://github.com/tripit/slate)
Demo Master (2)
Old Demo Master
Demo Master (3)
Next Demo Master (with slate)
Question ??
top related