com as mão sujas de dados julio faerman 1981269-3 facom techweek 2014

Com as Mão Sujas de Dados

Julio Faerman1981269-3

FACOM TechWeek 2014

http://jfaerman.com.br/facom14

“Na Prática a Teoria é Outra...”

16 years2000+ employees

40 million user

http://aws.amazon.com/solutions/case-studies/netflix/http://techblog.netflix.com/2013/12/netflix-presentation-videos-from-aws.html

Amazon Web Services for 100%

of Streaming

34.2% of all downstream

during primetime

AmazonSimpleStorageService

• Durable, scalable and fast storage (99.999999999%)

• 2+ Trillion (1012) objects• 1.1+ Million RPS• Native HTTP/S• And more:

Permissions, Static Hosting, Logging, Versionamento, Archival and Expiration Lifecycle, Torrent, Tags, Redundancy, Requester Pays, Criptography, Reduced Redundancy and more

http://aws.amazon.com/s3/

“Any dataset that is worth retaining is stored on S3. This includes data from billions of streaming events from televisions, laptops, and mobile devices every hour captured by our log data pipeline, plus dimension data from Cassandra supplied by our Aegisthus pipeline.”http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html

“87% Cost Reduction per Streaming Start.”http://youtu.be/XBgkZxAljbs

“In terms of scale, we have a 10 petabyte data warehouse on S3.”http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html

StructuredRelationalOn-Line

GB-TB-PB

Semi-structuredMap Reduce

BatchTB-PB-EB

Once upon a time…

Structured

On-Line

Semi-structured

UnstructuredDistributed Cache

In-Memory Data Grid

Map Reduce

ETLExtract-Transfer-Load

Graph Database

Document Database

Columnar Database

Real Time

Machine Learning

Relational Database

http://nathanmarz.com/

Data Structure Server

Stream Processing

Rule Engine

AmazonElastic

MapReduce

• Distributed processing with Apache Hadoop

• Near linear scalability• Resizable and disposable Clusters• Apache Hadoop ecosystem:

Hive, Pig, Impala, Spark, ..., …, …• Instant automatic provisioning• Simplified Administration• 5.5M+ Clusters

http://aws.amazon.com/elasticmapreduce/

http://aws.amazon.com/solutions/case-studies/pinterest/

50K -> 17M Usuários em 9 Meses

12- Funcionários

48M Usuários

8 Bilhões de Objetos

400+ TB de dados

April 2013:

400+ Web Engines400+ API Engines70x2+ MySQL DBs100+ Redis Instances230+ Memcache Instances10 Redis Task Manager500 Redis Task Processors80 Sharded Solr20 HBase12 Kafka + Azkabhan8 Zookeeper Instances 12 Varnish

http://www.infoq.com/presentations/scaling-pinterest

AmazonRelationalDatabaseService

• MySQL, Postgres, Oracle or SQL Server

• Highly Available (Multi-AZ)• Read-Replicas• Automated Backup, Patching and

Scaling

http://aws.amazon.com/rds/

AmazonElastiCache

• Memcached and Redis• Replication• Backup and Restore• Managed patch management,

failure detection and recovery• Elastic• Reliable

http://aws.amazon.com/elasticache/

• Petabyte Scale Data Warehousing

• Massively parallel OnLine Analytic Processing

• Resizable without downtime• Managed provisioning and

administration• Compatible with PostgreSQL

AmazonRedshift

http://aws.amazon.com/redshift/

Amazon Redshift Architecture

Leader Node• SQL endpoint• Stores metadata• Coordinates query execution

Compute Nodes• Local, columnar storage• Execute queries in parallel• Load, backup, restore via

Amazon S3; load from Amazon DynamoDB or SSH

Two hardware platforms• Optimized for data processing• DW1: HDD; scale from 2TB to 1.6PB• DW2: SSD; scale from 160GB to 256TB

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3 / DynamoDB / SSH

JDBC/ODBC

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

128GB RAM

16TB disk

LeaderNode

ETL from EMR/Hive to Amazon Redshift trough Amazon S3

EMR S3 Redshift

Extract & Transform Load

UnstructuredUnclean

StructuredClean

ColumnarCompressed

Amazon Redshift at Pinterest Today

• 16 node 256TB cluster • 2TB data per day• 100+ regular users• 500+ queries per day

75% <= 35 seconds, 90% <= 2 minute• Operational effort <= 5 hours/week

Shazam @ Superbowl

http://www.allthingsdistributed.com/2012/06/amazon-dynamodb-growth.html

Relational Indexvs.

Key-Value

B? Treevs.

Distributed? Hash Table

O(log n)vs.

• NoSQL Database• Provisioned Throughput• Seamless Salability• Zero Admin• Single digit millisecond latencyAmazon

DynamoDB

http://aws.amazon.com/dynamodb/

~5TB em Base de Dados

1 Bilhão de Requests/Mês

67.000 Requests/Minuto

34 milhões de Recomendações/Dia

4 milhões de produtos

27 Milhões de usuário

"A gente não pode se dar ao luxode jogar fora informação"

Availability Zone

Tomcat 6

Primórdio

1a Etapa

Availability Zone

Tomcat 6EhCache

Backup

2a Etapa

Availability Zone

Tomcat 6EhCacheNewRelic

MySQL Primário

Availability Zone

MySQL Secundário

EBS RAID0 EBS RAID0

Replicação

Availability ZoneAvailability Zone

3a Etapa

Availability Zone

Tomcat 6 + EhCache

Nginx HAProxy

Availability Zone Availability Zone

MySQL 1

EBS RAID0

MySQL 2

EBS RAID0

Replicação

MemcachedElasticLoad

Balancer

4a Etapa

Auto Scaling group

NginxHAProxyJettyEhCache

Availability Zone

Memcached Availability ZoneAvailability Zone

region region

Evolução da Arquitetura

AmazonKinesis

AmazonData

Pipeline

Cenas dos próximos capítulos…

http://aws.amazon.com/datapipeline/ http://aws.amazon.com/kinesis/

Where to begin?

http://aws.amazon.com/training/intro_series/

http://aws.amazon.com/free/

http://aws.amazon.com/training/

https://www.youtube.com/user/AmazonWebServices

http://aws.amazon.com/podcasts/aws-podcast/

http://aws.amazon.com/blogs/aws/

http://awshub.com.br

Julio Faermanfaermanj@amazon.com

http://jfaerman.com.br/facom14

Obrigado! Perguntas?

com as mão sujas de dados julio faerman 1981269-3 facom techweek 2014

Documents

what's the chitter about twitter - hedir techweek

techweek kansas city 2015 program

what you’ll see at...

scaling company values - twilio - techweek 2012

btcboca.org · faerman in honor of deborah’s bat mitzvah....

technology and infrastructure support for large scale...

adaptive computing on the grid using apples francine berman,...

techweek -chicago startup conference

dallas techweek - joinpad

techweek detroit 2015 program

2010 techweek 分散データベースシステムの比較...

techweek chicago 2015 program

i-factor sanjib sahoo techweek 2014 presentation

chicago - techweek 2016 | connected athlete panel

chapter 8: privacy preserving data mining · database...

if you like big data & getting caught in the rain... --...

as mãos sujas - jean-paul sartre

maestria en la gestion de organizaciones de quinn faerman y...

uspto techweek flyer 508 · 2019. 2. 7. · follow uspto on...

photo credit internet society - techweek · 10/4/2019 ·...