information surprise or how to find interesting data

62
Surprise! Information Surprise or how to discover data Oleksandr Pryymak /Sasha/ @opryymak [email protected]

Upload: oleksandr-pryymak

Post on 15-Aug-2015

376 views

Category:

Science


4 download

TRANSCRIPT

Page 1: Information surprise or how to find interesting data

Surprise!

Information Surprise or how to discover data

Oleksandr Pryymak /Sasha/@[email protected]

Page 2: Information surprise or how to find interesting data

What is a ‘surprise’?

Page 3: Information surprise or how to find interesting data

Define Surprise!

surprise

[countable] an event, a piece of news, etc. that is unexpected or that happens suddenlySYNONYMS: shock, … , eye-opener

[uncountable, countable] a feeling caused by something happening suddenly or unexpectedlySYNONYMS: astonishment, ...

(Oxford Advanced Learner's Dictionary)

Page 4: Information surprise or how to find interesting data

Cat explores

Page 5: Information surprise or how to find interesting data

Cat explores

meh

Page 6: Information surprise or how to find interesting data

Cat meets unexpected

Page 7: Information surprise or how to find interesting data

Cat meets unexpected

wow

Page 8: Information surprise or how to find interesting data

Quantify Surprise!

?measured in

wows

Page 9: Information surprise or how to find interesting data

QuantifyComplexity can measure any content type.Note: complex is not random!

Measures of complexity1. Subjective rating2. #Distinct elements3. #Dimension4. #Control parameters5. Minimal description6. Information content7. Minimal generator8. Minimum energy

Abdallah, S., & Plumbley, M. (2009). Information dynamics: patterns of expectation and surprise in the perception of music. Connection Science, 21(2-3), 89-117.

<vs>

Page 10: Information surprise or how to find interesting data

Neuro/Cognitive ScienceHow do we perceive information?

Machine LearningHow to measure differences?

Surprise Quants in academia

Page 11: Information surprise or how to find interesting data

... machine that constantly tells you what you already know is just irritating. So software alerts users only to surprises...Horvitz, E., Apacible, J., Sarin, R., & Liao, L. Prediction, Expectation, and Surprise: Methods, Designs, and Study of a Deployed Traffic Forecasting Service.

Friston, K. (2010). The free-energy principle: a unified brain theory?. Nature Reviews Neuroscience, 11(2), 127-138.

Surprise Quants in academiaNeuro/Cognitive ScienceHow do we perceive information?

Machine LearningHow to measure differences?

Page 12: Information surprise or how to find interesting data

Machine LearningNeuro/Cognitive Science

Surprise Quants in academia

Itti, L., & Baldi, P. F. (2005). Bayesian surprise attracts human attention. In Advances in neural information processing systems (pp. 547-554).

Page 13: Information surprise or how to find interesting data

Surprise Quants in academia

Itti, L., & Baldi, P. F. (2005). Bayesian surprise attracts human attention. In Advances in neural information processing systems (pp. 547-554).

meh

wow

meh

Page 14: Information surprise or how to find interesting data

Typical ML applicationsUnsupervised Learning

1. Decision trees (inf. gain)2. MaxEnt principle 3. ...

Specifically after ‘surprise’:4. One-class classification5. Anomaly detection6. Novelty measure Pimentel, M. A., Clifton, D. A., Clifton, L., & Tarassenko, L. (2014).

A review of novelty detection. Signal Processing, 99, 215-249.

Page 15: Information surprise or how to find interesting data

Model of a catData Model

(expectations)

Data (stream) Surprising?

(interesting, new)

Update

wow(act)

meh(ignore)

Element(attention window)

Page 16: Information surprise or how to find interesting data

Model of a cat’s surprise

Surprising?(interesting, new)

Page 17: Information surprise or how to find interesting data

Quantify surprisal /self-information/

The surprise /information/ in observing the occurrence of an event having probability .

Axioms:≤≥

Derive:∗ ∗

Surprisal /self-information/:−

Flipping a fair coin provides 1bit of new information.

bitsor wows

bits

Page 18: Information surprise or how to find interesting data

Surprisal applicationsSelecting information source:

Oleksandr Pryymak. Achieving Accurate Opinion Consensus in Large Multi-Agent SystemsUniversity of Southampton, Doctoral Thesis, 170pp., 2013

Page 19: Information surprise or how to find interesting data

Model of a catData Model

(expectations)

Data (stream) Surprising?

(interesting, new)

Update

wow(act)

meh(ignore)

Element(attention window)

Page 20: Information surprise or how to find interesting data

Model of a cat’s knowledgeData Model

(expectations)

Page 21: Information surprise or how to find interesting data

Quantify ‘knowledge’ /entropy/

The Shannon entropy is the expected value of the self-information.

Notes:1. The maximum entropy distribution

is the least informative.

2. The statistical mechanics and the information entropy are principally the same.

max: log2(n)

Entropy of a Bernoulli trialX Є {0,1}

Page 22: Information surprise or how to find interesting data

Entropy applicationsAnalysis of a binary of GeoIP ISP database:

Analyzing unknown binary files using information entropy:http://yurichev.com/blog/entropy/

Page 23: Information surprise or how to find interesting data

Entropy applicationsVisualizing the OSX ksh binary (see binvis.io)

Visualizing entropy in binary files http://corte.si/posts/visualisation/entropy/index.html

1,2: Cryptic signature

Page 24: Information surprise or how to find interesting data

Model of a cat’s discoveryData Model

(expectations)

Surprising?(interesting, new)

wow(act)

meh(ignore)

Element(attention window)

What has changed?

Page 25: Information surprise or how to find interesting data

The Kullback–Leibler divergence /relative entropy, information gain/: is a measure of the information lost when Q is used to approximate P (measures the expected number of extra bits required to recode)

Quantify ‘discovery’ /information gain/

"KL-Gauss-Example" T. Nathan Mundhenk

Not a true measure: asymmetric →

Page 26: Information surprise or how to find interesting data

Quantify ‘discovery surprise’ Symmetric KL Distances: All result in the same performance:

Pinto, D., Benedí, J. M., & Rosso, P. (2007). Clustering narrow-domain short texts by using the Kullback-Leibler distance. In Computational Linguistics and Intelligent Text Processing

Page 27: Information surprise or how to find interesting data

Calculating KLDData sparseness problem: often ∞Solutions:- drop components from calculations- smothing:

Page 28: Information surprise or how to find interesting data

Surprise in TweetsKLD application

Page 29: Information surprise or how to find interesting data

Surprise in TweetsKLD application

Page 30: Information surprise or how to find interesting data

Explore data: search engines Elasticsearch +Kibana = faceted data exploration

Page 31: Information surprise or how to find interesting data

Whole dataset

I still have hopes to find where I left this partition

Page 32: Information surprise or how to find interesting data

Whole dataset

Page 33: Information surprise or how to find interesting data

Whole dataset

Page 34: Information surprise or how to find interesting data

Whole datasetMH17July 17,2014

Annexation of CrimeaFeb 20... March 20,2014

Presidential electionsMay 25,2014

Experiments SETFeb 1 - 28, 2014

Page 35: Information surprise or how to find interesting data

tweets: 5.64 M

Experiment dataset: Feb2014

Page 36: Information surprise or how to find interesting data

Experiment dataset: English

Page 37: Information surprise or how to find interesting data

Pipeline

Stream (tweets)

Last 8 timeslots (data model)

Timeslot(attention window)

KLD(interesting, new)

Update

new event

(act)

meh(ignore)

Page 38: Information surprise or how to find interesting data

Simplistic topic modeling- tweets are super short

+ important events are widely discussed+ events change vocabulary- timeslot aggregation favors the predominant event

Document is a timeslot.Model:

- bag of words- freq. threshold > 200 tweets- term frequency (naive)- tokenizer: https://github.

com/jaredks/tweetokenize + a few touches

Page 39: Information surprise or how to find interesting data

Simplistic topic modeling

Document is a time slot.Model:

- bag of words- freq. threshold > 200 tweets- term frequency (naive)- tokenizer: https://github.

com/jaredks/tweetokenize + a few touches

Page 40: Information surprise or how to find interesting data

Vocabulary diversityFollows daily cycles

run out of disc space

Page 41: Information surprise or how to find interesting data

Test a domain specific hack

Vocabulary: catastrophe

Page 42: Information surprise or how to find interesting data

Vocabulary slots: KLD How surpriseful vocabulary of each hour against the whole dataset

Beware: on this scale individual hours are small, but events are plentiful

Higher KLD on sparse data

Lower KLD on dense data

Page 43: Information surprise or how to find interesting data

Vocabulary slots: KLD smoothedSmoothing did not change peaks

new minimum

Page 44: Information surprise or how to find interesting data

Vocabulary slots: rolling KLD How surpriseful vocabulary of each hour against the last 24h

Less variation on dense data

Page 45: Information surprise or how to find interesting data

Vocabulary slots: rolling KLD How surpriseful vocabulary of each hour against the last 8h

Page 46: Information surprise or how to find interesting data

Vocabulary slots: rolling KLD How surpriseful vocabulary of each hour against the last 4h

Page 47: Information surprise or how to find interesting data

Event Detection ProblemOutliers detection:

- rate change of the ‘surprise’

Compare against:

Page 48: Information surprise or how to find interesting data

Rolling KLD outliersEvents: detected rate change

Page 49: Information surprise or how to find interesting data

Rolling KLD outliers tokensAnnotate events with the most surpriseful tokens

Page 50: Information surprise or how to find interesting data

Further dataset limitation

primeevents

Page 51: Information surprise or how to find interesting data

Rolling KLD outliers: Feb 19-28

Page 52: Information surprise or how to find interesting data

Find representative tweetsLast 8 timeslots

(data model)

Timeslot(attention window)

KLD(surprising)

Update surprising tweets

-KLD(least surprising)

1. Detect distinct features

2. Find elements representing

distinct features

Page 53: Information surprise or how to find interesting data

Surpriseful tweets link➥

Only from users with +500followers

Page 54: Information surprise or how to find interesting data

The only spam/bot tweet selected. from the first time slot, when the prior is uniform. Notice: the dataset is not filtered!

Page 55: Information surprise or how to find interesting data
Page 56: Information surprise or how to find interesting data
Page 57: Information surprise or how to find interesting data
Page 58: Information surprise or how to find interesting data
Page 59: Information surprise or how to find interesting data
Page 61: Information surprise or how to find interesting data

1. Benchmark: ‘hot’ events from media2. Fight bots

a. spam (repetitions, bots)b. ‘forced’ opinionsc. filter low quality

3. Topic modela. no just Term Frequencyb. split topics (!)

To improve in Tweets app

Page 62: Information surprise or how to find interesting data

Questions?

art by www.facebook.com/Marysya.Rudska