munging solo: the joy of small data
TRANSCRIPT
$ head -1 [email protected],login,2015-07-20 13:10:11
$ cat log.csv | cut -d, -f1 | sort | uniq -c 25 [email protected] [email protected]
$ cat log.csv | grep '^[email protected],' | cut -d, -f3 | cut -d' ' -f2 | sort | uniq -c 61 2015-01-20 42 2015-06-18 4 2015-07-20
”For the same amount of data I was able to use my laptop to get the results in about 12 seconds (270MB/sec), while the Hadoop cluster took about 26 minutes (1.14MB/sec)”
Adam Drake, “Command-line tools can be 235x faster than your Hadoop cluster”, http://bit.ly/1sS01aP
$ head -1 [email protected],login,2015-07-20 13:10:11
$ cat log.csv | cut -d, -f1 | cut -d@ -f2 | ruby -rresolv -ne 'puts Resolv.getaddress(chomp)' | sort | uniq -c | sort -rn 24 10.0.42.1 3 10.27.100.8
$ ruby -rcsv -rbitly -e 'b = Bitly.new("user", "foo"); CSV.filter { |r| r.each { |f| f.replace b.shorten(f).short_url if f =~ /^https?:/ } }' urls.csv > urls-shortened.csv
% nokogiri -e 'puts @doc.css("img").map { |i| "https:" + i["src"] }' https://en.wikipedia.org/wiki/Unix | xargs -n1 -P4 wget
The most useful bits of coreutils
cat
grep
head
tail
split
wc
sort
shuff
uniq
comm
cut
paste
join
tr
column
Text Processingwith Ruby
• Published by Pragmatic Bookshelf
• Currently in beta
• https://pragprog.com/book/rmtpruby/text-processing-with-ruby