accessing data in the cloud groups...accessing data in the cloud using sas to read data from amazon...

12
Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com

Upload: others

Post on 18-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service

Accessing Data in the CloudUsing SAS to read data from Amazon Simple Storage Service (S3)

seleritysas.com

Page 2: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service

What is Amazon Simple Storage Service (S3)?

• An object store, not a file system

• Write once, read many (WORM)

• Eventually consistent

• 99.999999999% durability

• Unlimited storage capacity

• Highly scalable and available data storage

• Low latency and high throughput performance

Page 3: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service

What Public Data is Available in S3?

• AWS Public Datasets• https://aws.amazon.com/public-datasets/• Geospatial and Environmental Datasets• Genomics and Life Science Datasets• Datasets for Machine Learning• Regulatory and Statistical Data

• awesome-public-datasets• https://github.com/caesar0301/awesome-

public-datasets

• NYC Taxi and Limousine Commission• http://www.nyc.gov/html/tlc/html/about/trip_r

ecord_data.shtml

Page 4: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service

What is the typical workflow to use raw data from S3?• Download the data file from S3 to your PC using http/https

• Upload/Import the data to SAS

Page 5: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service

What would make this more efficient?

• Cutting out the middle-man (your local PC)

Page 6: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service

How can we have S3 communicate direct to the SAS Server?• Use the FILENAME URL access method

✓ Easy to implement

✗ File is retrieved using the http protocol (serially)

✗ The slowest of all options, subject to timeouts for very large files

• Use PROC S3 to download files to the SAS Server’s filesystem✓ Very fast, as it uses parallel downloads

✗ Only available from 9.4M4

✗ Only works with secure S3 files, not public S3 files

Page 7: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service

How can we have S3 communicate direct to the SAS Server?• Use the AWS CLI to download files to the SAS Server’s filesystem

✓ Very fast, as it uses parallel downloads

✗ Need to install the AWS CLI on the SAS Server

✗ Need the ability to run X commands on the SAS Server

• “Mount” the S3 storage on the SAS Server✓ Treat it like a local disk

✗ S3 is not designed for block storage/access

✗ Potential issues with current storage driver implementations

Page 8: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service

Example: NYC Trip Data in S3

• NYC Yellow Cab trip data for January 2017• 9,710,124 records• CSV format• 815 MB

• Location• Bucket: nyc-tlc• Object Key: trip data/yellow_tripdata_2017-01.csv

• HTTP Protocol: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv

• S3 Protocol: “s3://nyc-tlc/trip data/yellow_tripdata_2017-01.csv”

Page 9: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service

FILENAME URL Access Method

NOTE: The data set WORK.YELLOW_TRIPDATA_2017_01 has 9710124 observations and 17

variables.

real time 36.09 seconds

cpu time 33.85 seconds

Page 10: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service

PROC S3

NOTE: PROCEDURE S3 used (Total process

time):

real time 3.77 seconds

cpu time 6.31 seconds

NOTE: PROCEDURE IMPORT used (Total

process time):

real time 26.75 seconds

cpu time 26.75 seconds

Page 11: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service

AWS CLI

NOTE: DATA statement used (Total process

time):

real time 5.80 seconds

cpu time 0.00 seconds

NOTE: PROCEDURE IMPORT used (Total process

time):

real time 26.59 seconds

cpu time 26.59 seconds

Page 12: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service

Questions?Contact

[email protected]

1300 727 757

seleritysas.com