indexing big data in the cloud
DESCRIPTION
Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.TRANSCRIPT
Indexing Big Data in the Cloud
Indexing Big Data in the Cloud 2
Me
Scott StultsCo-Founder of OpenSource Connections
Solr / Lucene
Bash / Python / Java
Indexing Big Data in the Cloud 3
Eric
Indexing Big Data in the Cloud 4
Big Data
Indexing Big Data in the Cloud 5
Big Data Wrangler
Indexing Big Data in the Cloud 6
How?
Address a Real ProjectBe Agile
Make Small Mistaeks FastSucceed BIG
Indexing Big Data in the Cloud 7
USPTO Goals
Prototype Search UX
Prove Solr:Scales
IntegratesExcels
Indexing Big Data in the Cloud 8
Scale?
Indexing Big Data in the Cloud 9
Our Approach
KISSYAGNI
(This space intentionally left blank)
Indexing Big Data in the Cloud 10
Minimal Flair
Indexing Big Data in the Cloud 11
Record Everything!
Indexing Big Data in the Cloud 12
Some Numbers
Doc Count 1.1 MillionZip Files 313
Docs per Zip File 4,000
Zip File Size 75M
File Size 300M
Indexing Big Data in the Cloud 13
Testing
Start some serversProcess a batchCheck the clock
Indexing Big Data in the Cloud 14
start_nodes
start_nodes() { ec2-run-instances ami-1b814f72 \ --block-device-mapping '/dev/sdb=snap-48adde35::true' \ --block-device-mapping '/dev/sdi1=:10:false' \ --block-device-mapping '/dev/sdi2=:10:false' \ --block-device-mapping '/dev/sdi3=:20:false' \ --instance-type m1.large \ --key uspto-proto \ --instance-count $MAX_NODES \ --group default > ~/run-output}
Indexing Big Data in the Cloud 15
Gut Check
How fast can we do this?
What can we do in parallel?
Indexing Big Data in the Cloud 16
Scaling
Raise our instance limit
xargs -P GNU parallel
Indexing Big Data in the Cloud 17
Shortcomings
SSH?Error recovery
One Solr
Indexing Big Data in the Cloud 18
Alternatives
CloudFormationPuppet / Chef
Multiple Cores / ShardsHadoop
Indexing Big Data in the Cloud 19
Success
Indexing Big Data in the Cloud 20
Victory Lap
Indexing Big Data in the Cloud 21
Instances / Time
Indexing Big Data in the Cloud 22
Thank You
https://github.com/sstults/patent-indexing
@scottstults#o19s