accountability hack 2014 - parli-n-grams

Post on 29-Jun-2015

223 Views

Category:

Internet

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

My presentation for Accountability Hack: A Hansard ngram extractor and viewer

TRANSCRIPT

Parli-N-GramsGiuseppe Sollazzo

@puntofisso

Accountability Hack 2014

Parli-N-Grams

A search and analysis tool for Hansard

The best search lets you discover things while you look for them

An N-Gram is a sequence of N words

N-Grams?

An N-Gram is a sequence of N words● 1-gram: fox

N-Grams?

An N-Gram is a sequence of N words● 1-gram: fox● 2-gram: brown fox

N-Grams?

An N-Gram is a sequence of N words● 1-gram: fox● 2-gram: brown fox● 3-gram: quick brown fox

N-Grams?

An N-Gram is a sequence of N words● 1-gram: fox● 2-gram: brown fox● 3-gram: quick brown fox● 4-gram: the quick brown fox

N-Grams?

An N-Gram is a sequence of N words● 1-gram: fox● 2-gram: brown fox● 3-gram: quick brown fox● 4-gram: the quick brown fox● ...

N-Grams?

Tech Stack

Harvesting/parsing: PHPFront-End: JQuery, JavaScriptUI: Bootswatch, Bootstrap

Next time, PLAN!

Next time, PLAN!

Harvesting 6.4GB is slow

Next time, PLAN!

Harvesting 6.4GB is slowParsing 6.4GB is slower

Next time, PLAN!

Harvesting 6.4GB is slowParsing 6.4GB is slower● Especially in PHP

Next time, PLAN!

Harvesting 6.4GB is slowParsing 6.4GB is slower● Especially in PHPRunning grep because you’ve forgotten to extract data beforehand is slow AND stupid

Next time, PLAN!

Most data is availableExtraction is still running for 1-grams...

Next time, PLAN!

sed s/=\'\'/=\'\\\\\'/g $filename | sed s/\'\'\ /\\\\''\'\'\ /g | sed "s/$/;/g" | sed "s/\([a-z]\)'\(s\)/\1\\\'\2/g" | sed "s/\([A-Z]\)'\(s\)/\1\\\'\2/g" | sed "s/\([a-z]\)'\(l\)/\1\\\'\2/g" | sed "s/\([a-z]\)'\(r\)/\1\\\'\2/g" | sed "s/\(n\)'\(t\)/\1\\\'\2/g" | sed "s/\(o\)'\(c\)/\1\\\'\2/g" | sed "s/\(e\)'\(v\)/\1\\\'\2/g" | sed "s/\(I\)'\(v\)/\1\\\'\2/g" | sed "s/\(u\)'\(v\)/\1\\\'\2/g" | sed "s/\([a-z]\)'\([A-Z]\)/\1\\\'\2/g" | sed "s/\(O\)'\([a-z]\)/\1\\\'\2/g" | sed "s/\(O\)'\([A-Z]\)/\1\\\'\2/g" | sed "s/\(I\)'\(m\)/\1\\\'\2/g" | sed "s/\([A-Z]\)'\(l\)/\1\\\'\2/g" | sed "s/\([a-z]\)'\([a-z]\)/\1\\\'\2/g" | sed "s/\([a-z]\)\'-\([a-z]\)/\1\\\'-\2/g" | sed "s/\([A-Z]\)'\([A-Z]\)/\1\\\'\2/g" | sed "s/\([A-Z]\)'\([a-z]\)/\1\\\'\2/g" | sed "s/'\([a-z]\)'\([a-z]\)/\\'\1\\\'\2/g" | sed "s/-'n\\\'/-\\\'n\\\'/g" | sed "s/-'\([a-z]\)/-\\\'\1/g" | sed "s/-o'-/-o\\\'-/g" | sed "s/ght'-le/ght\\\'-le/g" | sed "s/cats'-meat/cats\\\'-meat/g" | sed "s/n'-roll/n\\\'-roll/g" | sed "s/sou'-w/sou\\\'-w/g" | sed "s/gleaf'-for/gleaf\\\'-for/g"

Available on

http://github.com/puntofisso/AccHack14http://parli-n-grams.puntofisso.net

Thank you!Parli-N-Gram

Giuseppe Sollazzo@puntofisso

Accountability Hack 2014

top related