big data at aws chicago user group
DESCRIPTION
Most of the slides from the Sept 23rd 2014 AWS User Group in Chicago. Talks: "AWS Storage Options" Ben Blair, CTO at MarkITx @stochastic_code "APIs and Big Data in AWS" - Kin Lane, API Evangelist @kinlane [coming soon] "Democratizing Data Analysis with Amazon Redshift" - Bill Wanjohi @billwanjohi and Michelangelo D'Agostino @MichelangeloDA, Civis Analytics Sponsored by Cohesive and CivisAnalytics.TRANSCRIPT
![Page 1: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/1.jpg)
!
AWS Chicago User Group !
Big Data Day
![Page 2: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/2.jpg)
Have an idea for a meetup? Talk to me: !Margaret WalkerCohesiveFT !!Tweet: @MargieWalker #AWSChicago
Sponsors & Hosts
#AWSChicago
![Page 3: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/3.jpg)
6:00 pm Introductions 6:05 pm Short Talks !"AWS Storage Options" Ben Blair, CTO at MarkITx @stochastic_code !"APIs and Big Data in AWS" - Kin Lane, API Evangelist @kinlane !"Democratizing Data Analysis with Amazon Redshift" - Bill Wanjohi @billwanjohi and Michelangelo D'Agostino @MichelangeloDA, Civis Analytics !
6:45 pm Q & A 7:00 pm Networking, drinks and pizza
Agenda
#AWSChicago
Sponsors & Hosts
![Page 4: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/4.jpg)
Next Meetups:
October 15? !
+Nov 12Let’s drink at re:Invent
![Page 5: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/5.jpg)
Keep it Secret, Keep it Safe
(and Fast and Available would be nice too)
![Page 6: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/6.jpg)
HiBen Blair
CTO @ MarkITx We live on AWS
![Page 7: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/7.jpg)
TL;DW
• Use IAM roles for access control
• Use DynamoDB for online storage & transactions
• Use Redshift for offline storage & analysis
• Use S3 to keep *everything*
![Page 8: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/8.jpg)
It’s hard to keep a secret
Use AIM EC2 roles instead
![Page 9: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/9.jpg)
3rd normal form, anyone?
Data duplication is OK Optimize for each context
![Page 10: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/10.jpg)
Interactive Data goes in DynamoDB
If your users read or write it, and it’s not huge, it should probably go into DynamoDB
![Page 11: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/11.jpg)
Why DynamoDB
• Works with tests. Tests are good.
• Predictable Performance & Cost
• Low Maintenance
![Page 12: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/12.jpg)
Why Not DynamoDB
• Vendor lock-in vs Cassandra
• Can’t add / change indexes (but that’s ok)
• Need to watch utilization
![Page 13: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/13.jpg)
SimpleDBNo, just no
![Page 14: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/14.jpg)
ElastiCacheGood place to end, bad place to start
![Page 15: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/15.jpg)
RDSHosted SQL Goodness
![Page 16: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/16.jpg)
RedshiftSeriously wonderful
![Page 17: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/17.jpg)
Redshift vs RDS
• Start with RDS
• Redshift is actually very cheap
• RDS for simple reporting on small data sets
• Redshift for all other analysis
![Page 18: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/18.jpg)
S3Store Everything.
!You won’t, and you’ll regret it later.
![Page 19: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/19.jpg)
EBSDistributed Availability > Instance Recovery
![Page 20: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/20.jpg)
Names MatterDistributed systems care about your keyspace even
when you don’t
![Page 21: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/21.jpg)
!@stochastic_code
!github.com/markitx
![Page 22: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/22.jpg)
"APIs and Big Data in AWS" Kin Lane API Evangelist !
@kinlane !Click here for slides on GitHub
#AWSChicago
Sponsors & Hosts
![Page 23: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/23.jpg)
Democratizing Data Analysis with Amazon
Redshift
Michelangelo D’Agostino - Civis Analytics Senior Data ScientistBill Wanjohi - Civis Analytics Senior Engineer
![Page 24: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/24.jpg)
● advantages of Redshift● some pitfalls● workflows and recommendations on best
practices
What you’ll learn
![Page 25: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/25.jpg)
Why should you listen?● 18 months of heavy Redshift use● Two complementary perspectives:
The Scientist and The Engineer
![Page 26: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/26.jpg)
Michelangelo @MichelangeloDA
![Page 27: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/27.jpg)
Bill @billwanjohi
![Page 28: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/28.jpg)
● collaborated on monolithic Vertica analytics database
● dozens of TB of data● scaled from 4-20 server blades● dozens of concurrent users across
departments (hundreds total)● arbitrary SQL allowed/encouraged
Life before Redshift
![Page 29: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/29.jpg)
Our early requirements
● SQL language● low starting cost● easy to integrate with OSS, other DBs● performant on large data sets● minimal database administration
![Page 30: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/30.jpg)
Choosing Redshift
● timing: first full release in Feb 2013● drastically cheaper to start than other
commercial offerings● very similar to our previous choice, HP
Vertica● many fewer administration tasks
![Page 31: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/31.jpg)
Basics● RDBMS● MPP/Columnar
Supports window functionsFew enforceable constraintsNo concept of an index
● Redshift <= ParAccel <= PostgreSQL 8Postgres drivers workORM requires mocking
● Most data I/O via S3 service
![Page 32: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/32.jpg)
Things analytics DBs are good at
● Big aggregates● Parallel I/O● Merge joins between tables
![Page 33: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/33.jpg)
Things they’re not good at
● Updates● Retrieval of individual records● Enforcing data quality
![Page 34: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/34.jpg)
How’s it worked out?
Pretty good!● adequate performance
○ big step up from traditional RDBMS○ comparable to other analytics DBs
● easy to stand up new clusters● cheaper clusters now available● most workflows can live entirely in-database● s3 is a good broker for what can’t
![Page 35: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/35.jpg)
Data Science Workflow
Our custom plumbing syncs tables from dozens of source databases into Redshift at varying refresh frequencies.
![Page 36: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/36.jpg)
We’ve found that SQL just invites so many more people to the analytics game.
Analysts and data scientists run exploratory SQL and build up complex tables for statistical modeling一utilizing crazy joins, aggregates and rollup features.
Redshift supports powerful window functions
Data Science Workflow
![Page 37: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/37.jpg)
Predictive Modeling
Data is pulled directly from Redshift into python/R to train statistical models
![Page 38: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/38.jpg)
Predictive ModelingFor simple linear models, scoring is done directly in redshift via SQL.
For more complicated models, data is pulled from redshift to s3 with a COPY SQL command, processed in EMR, and loaded back into redshift with another COPY command.
![Page 39: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/39.jpg)
Hurdles we’ve faced along the way● inconsistent runtimes● catalog contention● bugs (databases are hard)● resizing● too easy to end up with uncompressed data● “missing” PostgreSQL functionality● complex workload management
![Page 40: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/40.jpg)
Setup Recommendations● at least two nodes● send 35-day snapshots to other regions● at-rest encryption● enforce SSL● provision with boto or AWS CLI● cluster isolation to hide objects● buy 3-year reservations
![Page 41: Big data at AWS Chicago User Group](https://reader033.vdocuments.us/reader033/viewer/2022042518/557d2ca7d8b42a90748b4c3c/html5/thumbnails/41.jpg)
We’re Hiring!
Through research, experimentation, and iteration, we’re transforming how organizations do analytics. Our clients range in scale and focus from local to international, all empowered by our individual-level, data-driven approach.
civisanalytics.com/apply