munging solo: the joy of small data

Post on 18-Aug-2015

145 Views

Category:

Software

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Mini Munging:the Joy of Small Data

Rob Miller https://robm.me.uk/ | @robmil

Who are you?

What do you do?

I mung data

mung (mʌŋ)verbTo transform data from one form into another, unrecognisable one.

My Toolkit

• Hadoop

• Elasticsearch

• Cassandra

My Toolkit

• Hadoop

• Elasticsearch

• Cassandra

My data is small

Your data is(probably) small too

The Small Data Toolkit

• The command line

• Pipelines

• Ruby!

The shell is aprogramming environment

Text is auniversal interface

Pipelines are incredibly powerful

$ head -1 log.csvfred@example.com,login,2015-07-20 13:10:11

$ cat log.csv | cut -d, -f1 | sort | uniq -c 25 fred@example.com107 bob@example.net

$ cat log.csv | grep '^bob@example.net,' | cut -d, -f3 | cut -d' ' -f2 | sort | uniq -c 61 2015-01-20 42 2015-06-18 4 2015-07-20

Free functionality,free parallelism,

composable & modular

”For the same amount of data I was able to use my laptop to get the results in about 12 seconds (270MB/sec), while the Hadoop cluster took about 26 minutes (1.14MB/sec)”

Adam Drake, “Command-line tools can be 235x faster than your Hadoop cluster”, http://bit.ly/1sS01aP

And Ruby fits here too!

$ ruby -e $ ruby -ne $ ruby -pe

$ ruby -F -ane $ ruby -r

$ head -1 log.csvfred@example.com,login,2015-07-20 13:10:11

$ cat log.csv | cut -d, -f1 | cut -d@ -f2 | ruby -rresolv -ne 'puts Resolv.getaddress(chomp)' | sort | uniq -c | sort -rn 24 10.0.42.1 3 10.27.100.8

Start with coreutils,then throw in Ruby

$ ruby -rcsv -rbitly -e 'b = Bitly.new("user", "foo"); CSV.filter { |r| r.each { |f| f.replace b.shorten(f).short_url if f =~ /^https?:/ } }' urls.csv > urls-shortened.csv

% nokogiri -e 'puts @doc.css("img").map { |i| "https:" + i["src"] }' https://en.wikipedia.org/wiki/Unix | xargs -n1 -P4 wget

Your shell is a programmingenvironment

Ruby fits into this world perfectly

Most data is small

Go and mung stuff!

The most useful bits of coreutils

cat

grep

head

tail

split

wc

sort

shuff

uniq

comm

cut

paste

join

tr

column

Text Processingwith Ruby

• Published by Pragmatic Bookshelf

• Currently in beta

• https://pragprog.com/book/rmtpruby/text-processing-with-ruby

top related