munging solo: the joy of small data

30
Mini Munging: the Joy of Small Data Rob Miller https://robm.me.uk/ | @robmil

Upload: robmillr

Post on 18-Aug-2015

145 views

Category:

Software


2 download

TRANSCRIPT

Mini Munging:the Joy of Small Data

Rob Miller https://robm.me.uk/ | @robmil

Who are you?

What do you do?

I mung data

mung (mʌŋ)verbTo transform data from one form into another, unrecognisable one.

My Toolkit

• Hadoop

• Elasticsearch

• Cassandra

My Toolkit

• Hadoop

• Elasticsearch

• Cassandra

My data is small

Your data is(probably) small too

The Small Data Toolkit

• The command line

• Pipelines

• Ruby!

The shell is aprogramming environment

Text is auniversal interface

Pipelines are incredibly powerful

$ head -1 [email protected],login,2015-07-20 13:10:11

$ cat log.csv | cut -d, -f1 | sort | uniq -c 25 [email protected] [email protected]

$ cat log.csv | grep '^[email protected],' | cut -d, -f3 | cut -d' ' -f2 | sort | uniq -c 61 2015-01-20 42 2015-06-18 4 2015-07-20

Free functionality,free parallelism,

composable & modular

”For the same amount of data I was able to use my laptop to get the results in about 12 seconds (270MB/sec), while the Hadoop cluster took about 26 minutes (1.14MB/sec)”

Adam Drake, “Command-line tools can be 235x faster than your Hadoop cluster”, http://bit.ly/1sS01aP

And Ruby fits here too!

$ ruby -e $ ruby -ne $ ruby -pe

$ ruby -F -ane $ ruby -r

$ head -1 [email protected],login,2015-07-20 13:10:11

$ cat log.csv | cut -d, -f1 | cut -d@ -f2 | ruby -rresolv -ne 'puts Resolv.getaddress(chomp)' | sort | uniq -c | sort -rn 24 10.0.42.1 3 10.27.100.8

Start with coreutils,then throw in Ruby

$ ruby -rcsv -rbitly -e 'b = Bitly.new("user", "foo"); CSV.filter { |r| r.each { |f| f.replace b.shorten(f).short_url if f =~ /^https?:/ } }' urls.csv > urls-shortened.csv

% nokogiri -e 'puts @doc.css("img").map { |i| "https:" + i["src"] }' https://en.wikipedia.org/wiki/Unix | xargs -n1 -P4 wget

Your shell is a programmingenvironment

Ruby fits into this world perfectly

Most data is small

Go and mung stuff!

The most useful bits of coreutils

cat

grep

head

tail

split

wc

sort

shuff

uniq

comm

cut

paste

join

tr

column

Text Processingwith Ruby

• Published by Pragmatic Bookshelf

• Currently in beta

• https://pragprog.com/book/rmtpruby/text-processing-with-ruby