accountability hack 2014 - parli-n-grams

29
Parli-N-Grams Giuseppe Sollazzo @puntofisso Accountability Hack 2014

Upload: giuseppe-sollazzo

Post on 29-Jun-2015

222 views

Category:

Internet


1 download

DESCRIPTION

My presentation for Accountability Hack: A Hansard ngram extractor and viewer

TRANSCRIPT

Page 1: Accountability Hack 2014 - Parli-N-Grams

Parli-N-GramsGiuseppe Sollazzo

@puntofisso

Accountability Hack 2014

Page 2: Accountability Hack 2014 - Parli-N-Grams

Parli-N-Grams

A search and analysis tool for Hansard

Page 3: Accountability Hack 2014 - Parli-N-Grams

The best search lets you discover things while you look for them

Page 4: Accountability Hack 2014 - Parli-N-Grams
Page 5: Accountability Hack 2014 - Parli-N-Grams

An N-Gram is a sequence of N words

N-Grams?

Page 6: Accountability Hack 2014 - Parli-N-Grams

An N-Gram is a sequence of N words● 1-gram: fox

N-Grams?

Page 7: Accountability Hack 2014 - Parli-N-Grams

An N-Gram is a sequence of N words● 1-gram: fox● 2-gram: brown fox

N-Grams?

Page 8: Accountability Hack 2014 - Parli-N-Grams

An N-Gram is a sequence of N words● 1-gram: fox● 2-gram: brown fox● 3-gram: quick brown fox

N-Grams?

Page 9: Accountability Hack 2014 - Parli-N-Grams

An N-Gram is a sequence of N words● 1-gram: fox● 2-gram: brown fox● 3-gram: quick brown fox● 4-gram: the quick brown fox

N-Grams?

Page 10: Accountability Hack 2014 - Parli-N-Grams

An N-Gram is a sequence of N words● 1-gram: fox● 2-gram: brown fox● 3-gram: quick brown fox● 4-gram: the quick brown fox● ...

N-Grams?

Page 11: Accountability Hack 2014 - Parli-N-Grams
Page 12: Accountability Hack 2014 - Parli-N-Grams
Page 13: Accountability Hack 2014 - Parli-N-Grams
Page 14: Accountability Hack 2014 - Parli-N-Grams
Page 15: Accountability Hack 2014 - Parli-N-Grams
Page 16: Accountability Hack 2014 - Parli-N-Grams
Page 17: Accountability Hack 2014 - Parli-N-Grams
Page 18: Accountability Hack 2014 - Parli-N-Grams
Page 19: Accountability Hack 2014 - Parli-N-Grams
Page 20: Accountability Hack 2014 - Parli-N-Grams

Tech Stack

Harvesting/parsing: PHPFront-End: JQuery, JavaScriptUI: Bootswatch, Bootstrap

Page 21: Accountability Hack 2014 - Parli-N-Grams

Next time, PLAN!

Page 22: Accountability Hack 2014 - Parli-N-Grams

Next time, PLAN!

Harvesting 6.4GB is slow

Page 23: Accountability Hack 2014 - Parli-N-Grams

Next time, PLAN!

Harvesting 6.4GB is slowParsing 6.4GB is slower

Page 24: Accountability Hack 2014 - Parli-N-Grams

Next time, PLAN!

Harvesting 6.4GB is slowParsing 6.4GB is slower● Especially in PHP

Page 25: Accountability Hack 2014 - Parli-N-Grams

Next time, PLAN!

Harvesting 6.4GB is slowParsing 6.4GB is slower● Especially in PHPRunning grep because you’ve forgotten to extract data beforehand is slow AND stupid

Page 26: Accountability Hack 2014 - Parli-N-Grams

Next time, PLAN!

Most data is availableExtraction is still running for 1-grams...

Page 27: Accountability Hack 2014 - Parli-N-Grams

Next time, PLAN!

sed s/=\'\'/=\'\\\\\'/g $filename | sed s/\'\'\ /\\\\''\'\'\ /g | sed "s/$/;/g" | sed "s/\([a-z]\)'\(s\)/\1\\\'\2/g" | sed "s/\([A-Z]\)'\(s\)/\1\\\'\2/g" | sed "s/\([a-z]\)'\(l\)/\1\\\'\2/g" | sed "s/\([a-z]\)'\(r\)/\1\\\'\2/g" | sed "s/\(n\)'\(t\)/\1\\\'\2/g" | sed "s/\(o\)'\(c\)/\1\\\'\2/g" | sed "s/\(e\)'\(v\)/\1\\\'\2/g" | sed "s/\(I\)'\(v\)/\1\\\'\2/g" | sed "s/\(u\)'\(v\)/\1\\\'\2/g" | sed "s/\([a-z]\)'\([A-Z]\)/\1\\\'\2/g" | sed "s/\(O\)'\([a-z]\)/\1\\\'\2/g" | sed "s/\(O\)'\([A-Z]\)/\1\\\'\2/g" | sed "s/\(I\)'\(m\)/\1\\\'\2/g" | sed "s/\([A-Z]\)'\(l\)/\1\\\'\2/g" | sed "s/\([a-z]\)'\([a-z]\)/\1\\\'\2/g" | sed "s/\([a-z]\)\'-\([a-z]\)/\1\\\'-\2/g" | sed "s/\([A-Z]\)'\([A-Z]\)/\1\\\'\2/g" | sed "s/\([A-Z]\)'\([a-z]\)/\1\\\'\2/g" | sed "s/'\([a-z]\)'\([a-z]\)/\\'\1\\\'\2/g" | sed "s/-'n\\\'/-\\\'n\\\'/g" | sed "s/-'\([a-z]\)/-\\\'\1/g" | sed "s/-o'-/-o\\\'-/g" | sed "s/ght'-le/ght\\\'-le/g" | sed "s/cats'-meat/cats\\\'-meat/g" | sed "s/n'-roll/n\\\'-roll/g" | sed "s/sou'-w/sou\\\'-w/g" | sed "s/gleaf'-for/gleaf\\\'-for/g"

Page 28: Accountability Hack 2014 - Parli-N-Grams

Available on

http://github.com/puntofisso/AccHack14http://parli-n-grams.puntofisso.net

Page 29: Accountability Hack 2014 - Parli-N-Grams

Thank you!Parli-N-Gram

Giuseppe Sollazzo@puntofisso

Accountability Hack 2014