cloud analytics data warehousing - github pages...cloud analytics data warehousing marco serafini...

Cloud AnalyticsData Warehousing

Marco Serafini

COMPSCI 590SLecture 19

Trivia• How does Amazon make money?

• Selling books?• Entertainment?

Cloud Computing• Shared resources

• Multiple tenants sharing resources (with isolation)• Economy of scale

• Elastic provisioning• Can easily add and remove resources on the fly

• Pay as you go only when used• Different flavors

• IaaS, PaaS, SaaS• Public, private cloud

Cloud Offerings• Computing nodes

• Example: AWS EC2• Full nodes with local storage and pre-installed OS• Very large number of instance types: compute optimized, memory optimized, storage optimized, with GPUs, burstable…

• Storage services• Example: AWS S3• Key-value stores (put/get), file systems

• Higher-level services• Example: DBMS

Other Variants• Spot instances

• Allocated in real-time based on live bidding• Can be revoked any time (with notice)

• Serverless computing• Example: AWS Lambda

• Each of these services comes with own pricing

Storage Disaggregation• Use remote storage instead of local storage

• Network is fast• Remote and local storage can have same throughput

• Advantages: can use cloud storage services like S3• No configuration or provisioning needed• Cheaper

• Cost of disaggregated storage• Storage nodes can have weak CPUs and limited memory• Storage is cheap

Remote vs. Local Storage

Goals• Easily parallelize single-threaded code• Eliminate cluster management overhead

• Deployment of nodes• Installation• Configuration

• Even cloud offerings have their complexities• Many instance types• Many services

• Solution: Serverless functions

Serverless Functions• Single threaded code• Invoked through HTTP requests• Cloud platform takes care of

• Deployment• Load balancing• Performance isolation

• No need to• Deploy servers• Configure clusters

State and Fault Tolerance• State is lost after execution• Inputs and outputs need to be persisted• Fault tolerance

• Re-execute function• Require atomic writes to check what has succeeded

Registering Functions• Registering a new Lambda function is slow• Solution

• Register a single generic Lambda function• Serialize the code that needs the be executed• Store the code (and the input data) on S3• Generic Lambda function loads code and executes it

Remote Storage Scalability

Semantics• Map is easy

• Execute one function per element of the list• Map + single Reducer

• E.g. parallel featurization + single-server ML• MapReduce

• Many Lambdas needed, many small intermediate files• Use Redis, an in-memory key-value store

• Parameter server• Use Redis

The Cost of Scaling Up• Using more nodes does not always imply higher cost• Lower latency à lower cost per node

Shared-Nothing and the Cloud• Shared-nothing architecture

• Each node has its own disk and memory• All nodes are “symmetric”

• Challenges• Heterogeneous workloads

• No one-size-fits-all hardware configuration• Membership changes

• Large data shuffles when a node fails/is removed• Online upgrade

• It is similar to changing all the nodes in the system

Architecture• Data Storage

• Based on S3: high throughput, high latency• Used also for intermediate data

• Virtual Warehouses• Responsible for query execution• Stateless (restarted in their entirety)• Shared cache (low latency on hot data, most data cold)

• Cloud Services• Query parsing, access control, optimization• Snapshot isolation with multi-versioning• Metadata on external key-value store

Advantages• Storage on S3 is cheaper• Use expensive local disk only for hot data• All services (except storage) are stateless

• Simpler fault tolerance and membership change• Example: online upgrade

SparkSQL: Spark + DBMS• Extend Spark with

• Simple, high-level SQL-like operators• Query optimization

• No need to transfer data across systems• ETL, query processing, complex analytics in one system

DataFrames• Collection of rows with homogeneous schema

• Like a table in a DBMS• Can be manipulated like an RDD

• DataFrame operations• Similar to Python Pandas or R data frames• Evaluated lazily (query planning is postponed)• Can optimize across multiple queries

Advantages• Relational structure enables query optimization• In-memory caching using columnar representation

• Better compression• Mix SQL-like operators and arbitrary code

• More flexible than UDFs in DBMSs• Can optimize across multiple SQL operations

Catalyst• Query optimizer of SparkSQL• Rule-based optimization

• Rule: find pattern and transform• Used for both logical and physical plans• Can customize rules

• Code generation• Directly outputs bytecode (as opposed to interpreting a plan)• Much more CPU efficient

• Flexible data sources• Can change the physical representation of DataFrames• Still use the optimizer

cloud analytics data warehousing - github pages...cloud analytics data warehousing marco serafini...

Documents

big data analytics and the new era of data warehousing ·...

next generation corporate real estate & facilities ... ·...

greenplum chorus - cloud computing for data warehousing and...

peerless performance ts-590s - kenwood...

kenwood ts-590s

a health catalyst overview: building a data warehousing and...

chapter 2: data warehousing business intelligence: a...

retail and warehousing - cdn.sick.com · retail and...

smart consolidation for smarter warehousing · white paper:...

data warehousing & business intelligence€¦ · data...

practical applications for data warehousing, analytics, bi,...

chapter 5business intelligence: data warehousing, data...

web analytics & data warehousing preso for syncsort

analytics & data warehousing reader...

data warehousing and data analytics in a data-driven...

healthcare best practices in data warehousing & analytics

big data, analytics and 4th generation data warehousing

greenplum: driving the future of data warehousing and...

business analytics and data warehousing

ts-590s howto ssb audio setup v1.0 -...