infsci 2480 rss feeds document filteringpeterb/2480-122/rssfeeds_docfiltering.pdf · feeds! feed =...
TRANSCRIPT
2/7/11
1
INFSCI 2480!RSS Feeds !
Document Filtering!Yi-ling Lin!
02/02/2011!
Feed? RSS? Atom? !
RSS = Rich Site Summary!
RSS = RDF (Resource Description Framework) Site Summary!
RSS = Really Simple Syndicate!
ATOM!
2/7/11
2
Feeds!
Feed = “A document (often XML-based) which contain content items, often summaries of stories or weblog posts with web links to longer versions”!
Feed > RSS, Atom!
Feeds!
RSS 2.0!
RSS 0.92!
RSS 0.91!
RSS 1.0!
Atom!
RSS Versions!
Version distribution collected by an RSS search engine (Feb 2010)!
2.0 > 1.0 > 0.91 > 0.92!
http://www.syndic8.com/stats.php?Section=rss#tabtable!
2/7/11
3
Comparison of RSS versions!
RSS 0.91 RSS 0.92 RSS 2.0 Categories on channel or item X O O Elements on the channel : language, copyright, docs, lastBuildDate, managingEditor, pubDate, rating, skipDays, skipHours, generator, ttl
X X O
Item enclosures X O O
Elements on items: authors, comments, pubDate X X O
Item count limitation 15 X X
Notes Channel-level metadata only
Allows both channel and
item metadata Modularized
Revealing RSS in Web pages!
2/7/11
4
RSS content Structure!
RSS 0.90 to 2.0 family!
XML!
<channel> & <item> parts!Feed information (channel)!Each article content (item)!
Additional features with higher versions — 0.90 to 2.0!
RSS 1.0 & Atom are in different formats!!
RSS 0.92
2/7/11
5
RSS 2.0
RSS 1.0 “uses RDF” http://www.w3.org/RDF/
2/7/11
6
ATOM
In more detail...!
Specifications!
RSS 0.91: http://www.rssboard.org/rss-0-9-1-netscape!
RSS 2.0: http://cyber.law.harvard.edu/rss/rss.html!
2/7/11
7
Parsing RSS Feeds!
Problem — extract texts from RSS structure!
They are XML!
Parsers!
SAX!
DOM!
Out-of-box parser !
SAX and DOM!
SAX (Simple API for XML) — serial access parser!
Stream of XML data goes in!
Event-driven parsing!
DOM (Document Object Model)!
Use hierarchical structure for parsing!
2/7/11
8
SAX Example!
DOM Example!
2/7/11
9
Ready-made Parser!
Universal Feed Parser <http://www.feedparser.org>!
Universal Feedparser!
2/7/11
10
Core Attributes!
Follows RSS/ATOM syntax normalization!
However, not always!
updated!
/atom10:feed/atom10:updated!
/atom03:feed/atom03:modified!
/rss/channel/pubDate!
/rss/channel/dc:date!
/rdf:RDF/rdf:channel/dc:date!
/rdf:RDF/rdf:channel/dcterms:modified!
Advanced features!
Date parsing!
HTML sanitization!
Content normalization!
Namespace handling!
and more...!
2/7/11
11
Document classification!
Probability Calculation!
Pr(word|classification)!
Ex. Pr(“drug”|spam) = 80 docs / total 100 spam docs = 0.8!
2/7/11
12
Weighted Probability!
Doc1[… money …](s), Doc2[ … money …](s), Doc3[ … money …](s), Doc4[……](s), Doc5[……](ns)!
Pr(“money”|spam) = 3/4 = 0.75!
Pr(“money”|no-spam) = 0/1 = 0!
Pr = 0.5 (we don’t know) may be better than Pr = 0 (never)!
Ex. After finding one spam instance!
Naive Bayesian Classifier!
Goal = Pr(Category|Document) !
Ex. Pr(Spam|Doc1) = 0.001, Pr(No-spam|Doc1) = 0.5 → Doc1 = No-pam!
What we have is? = Pr(Feature|Category)!
Process = Pr(Feature|Category) → Pr(Document|Category) → Pr(Category|Document)!
2/7/11
13
Pr(Document|Category) !
Pr(Document|Category) = Pr(Feature1|Cat) * Pr(Feature2|Cat) * Pr(Feature3|Cat) … Pr(FeatureN|Cat) !
Pr(A ^ B) = Pr(A) * Pr(B)!
Assumption — A and B are independent from each other!
Not true — social vs. Web, social vs. Probability!
But still useful!
Pr(Category|Document)!
Pr(A|B) = Pr(B|A) * Pr(A) / Pr(B)!
Thomas Bayes!
Pr(Category|Document) != Pr(Document|Category) * Pr(Category) / Pr(Document) !
Pr(Category) = # of docs in Cat / total # of docs!
Pr(Document) = Constant!
2/7/11
14
Choosing a Category!
Take one with the highest probability!
What if, Pr(Spam|Doc) = 0.000001, Pr(No-spam|Doc) = 0.0000005!
Answer may be “Not sure”!
Choosing a Category!
Thresholding!
If Pr(Spam|Doc) > 3 * Pr(No-spam|Doc),!
Then spam!
→ which is more reasonable!
2/7/11
15
Persisting Trained Classifier!
Classifier in python,!
Dictionaries in memory — fc, cc!
Disappears after quitting from Python interpreter!
Should be saved to disc!
MySQL — client/server RDBMS!
SQLite — file-based RDBMS!
Persisting Trained Classifier!
Python shelve!
Put/Get any Python object into disk files!
2/7/11
16
Alternative Methods!
Supervised learning methods!
Neural network!
Support Vector Machine!
Decision Tree!
Software packages!
Weka, R, SPSS Clementine, etc!
Weka Example!
Example Data!
Weather condition !→ To play or not to play?!
4 attributes, 1 class variable!
2/7/11
17
Weka Example!
Weka Example!
2/7/11
18
Weka Example!