case study of rujhaan.com (a social news app )
DESCRIPTION
Case study of Rujhaan.comTRANSCRIPT
![Page 1: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/1.jpg)
Case Study of Rujhaan.com
November 2014 Meetup
Rahul Jain
@rahuldausa
![Page 2: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/2.jpg)
About Me…
• Big-data/Search Consultant based out of Hyderabad, India• Provide Consulting services and solutions for Solr, Elasticsearch and other Big
data solutions (Apache Hadoop and Spark)• Organizer of two Meetup groups in Hyderabad• Hyderabad Apache Solr/Lucene• Big Data Hyderabad
![Page 3: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/3.jpg)
Rujhaan which means "#interest" is a news app that aggregates the Trending #News, #trends with #buzz
around them from social media.
It also works as a content discovery where user can see information based on his interest (under development).
What it does?
![Page 4: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/4.jpg)
What I am going to talk
• Introduction
• Software Stack• Crawler
• Apache Solr
• MongoDB
• Redis
• Machine Learning stack• Classification
• Clustering
• NER
• POS Tagging
![Page 6: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/6.jpg)
![Page 7: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/7.jpg)
Trends : Arpita Khan
http://www.rujhaan.com/topic/Arpita-Khan.html
![Page 8: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/8.jpg)
Trends : Phil Hughes
http://www.rujhaan.com/topic/Phillip-Hughes.html
![Page 9: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/9.jpg)
Technology Stack
![Page 10: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/10.jpg)
Major challenge:
Response time of 500ms is Critical
![Page 11: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/11.jpg)
High level Flow: Processing
Fetch
Managed Cache
Internet
21
3 4
8
5
6
7
Parse
MongoDB
HTML Cleaner
Junk/Spam
Cleaner(Text)
Scoring
LanguageDetectio
n
Classification/Clustering
Summary (Most Meaningful text
of Story)
Social Media
Apache Solr
9
Topics Extraction 1
0
11
![Page 12: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/12.jpg)
High level Flow: View
HAProxy
Managed Cache
Internet
2
1
3
Nginx
MongoDB
Tomcat (App)
Redis
Apache Solr
4
5
![Page 13: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/13.jpg)
Current Traffic Stats
Traffic: • 16k users/month• ~38k pageviews/month• 200k requests/day by 24+ bots• Traffic growing by 60-70%/month• Alexa rank : ~211000
![Page 14: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/14.jpg)
Application Stack
• Crawler
• Apache Solr
• MongoDB
• Redis
![Page 15: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/15.jpg)
Crawler
http://www.codeproject.com/Articles/13486/A-Simple-Crawler-Using-C-Sockets
• A web crawler (also known as a web spider or ant) is a program, which browses the World Wide Web in a methodical, automated manner.
• Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches.
![Page 16: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/16.jpg)
How it work?
http://wiki.eclipse.org/SMILA/Documentation/Importing/Crawler/Web
![Page 17: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/17.jpg)
Search@ApacheSolr
• Enterprise Search platform for Apache Lucene
• Open source
• Highly reliable, scalable, fault tolerant
• Support distributed Indexing (SolrCloud),
Replication, and load balanced querying
• http://lucene.apache.org/solr
17
![Page 18: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/18.jpg)
High level overview
Source: http://www.slideshare.net/erikhatcher/solr-search-at-the-speed-of-light
![Page 19: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/19.jpg)
Apache Solr - Features
• full-text search
• faceted search (similar to GroupBy clause in RDBMS)
• scalability
– caching
– replication
– distributed search
• near real-time indexing
• geospatial search
• and many more : highlighting, database integration, rich document
(e.g., Word, PDF) handling
19
![Page 20: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/20.jpg)
Database: #MongoDB
• Document Oriented NoSQL
database
• Dynamic Schema
• JSON based
• Fast read and write
• Quite suitable for Non
Relational data
Stats:• 2 million tweets • 70k news articles• ~25GB rawhtml unstructured data• ~16GB structured data
![Page 21: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/21.jpg)
Why NoSQL
• Large Volume of Data
• Dynamic Schemas
• Auto-sharding
• Replication
• Horizontally Scalable
* Some of these above Operations can be achieved by Enterprise class RDBMS software but with very High cost
![Page 22: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/22.jpg)
Major NoSQL Categories
• Document databases
• pair each key with a complex data structure known as a document.
• MongoDB
• Graph databases
• store information about networks, such as social connections
• Neo4j
Contd.
![Page 23: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/23.jpg)
Major NoSQL Categories
• Key-Value stores
• Every single item in the database is stored as an attribute name (or "key"),
• Riak , Voldemort, Redis
• Wide-column stores
• store data in columns together, instead of row
• Google’s Bigtable, Cassandra and HBase
![Page 24: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/24.jpg)
Sample Record (JSON){
"_id" : ObjectId("53f087c69144ca452acadfb0"),
"id" : "7a622c50e95d4debb1376d4f6e2d0a47",
"title" : "Yelp Swings To Profitability In Strong Q2 With $88.8M In Revenue, EPS Of $0.04",
"summary_gs" : "Today after the bell Yelp reported its second-quarter financial performance, including revenue of $88.79 million, and a profit of $0.04 per share. The company had net income of $2.7 million in the period, up from a $878,000 loss in the year-ago quarter. Investors had expected Yelp to lose 3 cents per share on revenue of $86.32 million. The company’s revenue tally for its most recent quarter is up 61 percent on a year-over-year basis. The company also reported strong guidance for its third quarter, with revenues forecasted to land in the $98 to $99 million range. ",
"link" : "http://techcrunch.com/2014/07/30/yelp-swings-to-profitability-in-strong-q2-with-88-8m-in-revenue-eps-of-0-04/",
"category_label" : "business",
“image_url”:” http://tctechcrunch2011.files.wordpress.com/2014/04/yelp-earnings.jpg”,
“score”: 38.0,
“boost”:1.0,
“keywords”:[“news”, “yelp”, “revenue”]
}
![Page 25: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/25.jpg)
Cache: #Redis
• Advanced In-Memory key-value store
• Insane fast
• Response time in order of 5-10ms
• Provides Cache behavior (set, get) with
advance data structures like hashes, lists,
sets, sorted sets, bitmaps etc.
• http://redis.io/
![Page 26: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/26.jpg)
Machine Learning
• Classification
• Clustering
• NER (Named Entity Recognition)
• Summarization (Relevant text)
• Topics Extraction
![Page 27: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/27.jpg)
ML Workflow
![Page 28: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/28.jpg)
Classification• classify a document into a predefined category.
– For e.g news can be classified into business, politics, finance etc.
• documents can be text, images• Popular one is Naive Bayes Classifier.• Steps:
– Step1 : Train the program (Building a Model) using a training set with a category for e.g. sports, cricket, news,
– Classifier will compute probability for each word, the probability that it makes a document belong to each of considered categories
– Step2 : Test with a test data set against this Model
• http://en.wikipedia.org/wiki/Naive_Bayes_classifier
![Page 29: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/29.jpg)
Clustering• clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more similar to each other
• objects are not predefined• For e.g. these keywords
– “man’s shoe”– “women’s shoe”– “women’s t-shirt”– “man’s t-shirt”– can be cluster into 2 categories “shoe” and “t-shirt” or
“man” and “women”
• Popular ones are K-means clustering and Hierarchical clustering
![Page 30: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/30.jpg)
K-means Clustering
http://pypr.sourceforge.net/kmeans.html
• partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
• http://en.wikipedia.org/wiki/K-means_clustering
![Page 31: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/31.jpg)
Summarization
• Finding the most relevant text related to story/article
• There can be multiple approaches related to accuracy.
• Below is our approach:
Cleaned Text
21 3Find low
value cluster
4
5
Cluster based on stop words
Score each cluster
Take Highest score cluster
Sentence Extractor
Some more Scoring…
Summary text
67
*Summary can be a content curated by computer system. i.e. translating the story into its own sentences (out of scope)
![Page 32: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/32.jpg)
POS (Part of Speech) Tagging
• process of marking up a word in a text (corpus) as
corresponding to a particular part of speech, its
definition, as well as its context
• relationship with adjacent and related words in a
phrase, sentence, or paragraph.
• 9 parts of speech in English: noun, verb, article,
adjective, preposition, pronoun, adverb,
conjunction, and interjection.
• “This is a sample sentence” will be output as
• This/DT is/VBZ a/DT sample/NN sentence/NN
• We use Stanford MaxentTagger
• http://nlp.stanford.edu/software/tagger.shtml
Number Tag Description1. CC Coordinating conjunction2. CD Cardinal number3. DT Determiner4. JJ Adjective8. JJR Adjective, comparative9. JJS Adjective, superlative10. LS List item marker11. MD Modal12. NN Noun, singular or mass13. NNS Noun, plural14. NNP Proper noun, singular15. NNPS Proper noun, plural16. PDT Predeterminer17. POS Possessive ending18. PRP Personal pronoun19. PRP$ Possessive pronoun20. RB Adverb21. RBR Adverb, comparative22. RBS Adverb, superlative23. RP Particle24. SYM Symbol25. TO to26. UH Interjection27. VBD Verb, past tense32. VBZ Verb, 3rd person singular present
![Page 33: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/33.jpg)
NER • Identifying the Named Entities like Person name, location, organization from a text
• Need a pre built trained model.
![Page 34: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/34.jpg)
Machine Learning Stack
• Stanford NER & Tagger
• LingPipe
• OpenNLP
• Carrot2
![Page 36: Case study of Rujhaan.com (A social news app )](https://reader033.vdocuments.us/reader033/viewer/2022052910/559b9a511a28abb3798b4788/html5/thumbnails/36.jpg)
Thanks!@rahuldausa on twitter and slideshare
http://www.linkedin.com/in/rahuldausa
36
Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IRhttp://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies.http://www.meetup.com/Big-Data-Hyderabad/