indexing big data in the cloud

Indexing Big Data in the Cloud

Indexing Big Data in the Cloud 2

Me

Scott StultsCo-Founder of OpenSource Connections

Solr / Lucene

Bash / Python / Java


Eric


Big Data


Big Data Wrangler


How?

Address a Real ProjectBe Agile

Make Small Mistaeks FastSucceed BIG


USPTO Goals

Prototype Search UX

Prove Solr:Scales

IntegratesExcels


Scale?


Our Approach

KISSYAGNI

(This space intentionally left blank)


Minimal Flair


Record Everything!


Some Numbers

Doc Count 1.1 MillionZip Files 313

Docs per Zip File 4,000

Zip File Size 75M

File Size 300M


Testing

Start some serversProcess a batchCheck the clock


start_nodes

start_nodes() { ec2-run-instances ami-1b814f72 \ --block-device-mapping '/dev/sdb=snap-48adde35::true' \ --block-device-mapping '/dev/sdi1=:10:false' \ --block-device-mapping '/dev/sdi2=:10:false' \ --block-device-mapping '/dev/sdi3=:20:false' \ --instance-type m1.large \ --key uspto-proto \ --instance-count $MAX_NODES \ --group default > ~/run-output}


Gut Check

How fast can we do this?

What can we do in parallel?


Scaling

Raise our instance limit

xargs -P GNU parallel


Shortcomings

SSH?Error recovery

One Solr


Alternatives

CloudFormationPuppet / Chef

Multiple Cores / ShardsHadoop


Success


Victory Lap


Instances / Time


Thank You

https://github.com/sstults/patent-indexing

@scottstults#o19s

indexing big data in the cloud

Technology

indexing big

instance

device

false

block