plot 2 txt · architecture three lambda functions are driven by uploads to three different s3...

30
Migrating the plot2txt processing pipeline to AWS bill brouwer / plot2txt.com plot 2 txt http://www.plot2txt.com 01/16

Upload: others

Post on 26-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Migrating the plot2txt processing pipeline to AWS

bill brouwer / plot2txt.com

plot 2 txt

http://www.plot2txt.com 01/16

Page 2: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Overview

● The ('backend') processing flow of plot2txt is now a cloud service available via AWS

● AWS offers a rich technology stack, the following is used in this work:– Processing algorithms/service → lambda functions

– NoSQL DB for saving meta-data etc→ dynamoDB

– Logs → CloudWatch

– Storage → S3

– Access Control(s) → IAM

http://www.plot2txt.com 01/16

Page 3: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Architecture

● Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function is triggered

S30

S31

S32

S33

UpDown

dynamoDBinputTable

dynamoDBoutputTable

http://www.plot2txt.com 01/16

Page 4: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Setup

● Local development environment:

>uname -a

>Linux bill-ThinkPad-W530 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

● Create an AWS account, launch web console:

http://www.plot2txt.com 01/16

Page 5: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Setup

http://www.plot2txt.com 01/16

Page 6: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Setup

● Assuming python (eg., 2.7) is available, install AWS command line interface and configure, after producing admin creds at the web console (click on IAM link, follow directions for adding new user with ADMIN privileges)

● Folder ~/.aws must exist after this step, with desired config and creds file

>sudo pip install awscli

>aws configure

AWS Access Key ID [****************KHAA]:

AWS Secret Access Key [****************dTr0]:

Default region name [us-east-1]:

Default output format [json]:

>ls .awsconfig credentials

http://www.plot2txt.com 01/16

Page 7: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Setup

● Also need to establish:

– PHP dev environment (for serving upload/browse web pages)

● create composer file composer.phar in PHP project dir:{

"require": {

"aws/aws-sdk-php": "2.*"

}

}

● Install composer & create env (*vendor directory should appear):>curl -sS https://getcomposer.org/installer | php

>php composer.phar install

● Apache/PHP locally for testing (sudo cp pages for testing to /var/www/html etc)● All php files to include app/start.php with creds:

http://www.plot2txt.com 01/16

Page 8: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Setup

use Aws\S3\S3Client;

require 'vendor/autoload.php';$s3 = S3Client::factory(array( 'region' => 'us-east-1', 'version' => 'latest', 'credentials' => array( 'key' => 'xxxxxx', 'secret' => 'xxxxxx', )));

http://www.plot2txt.com 01/16

Page 9: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

dynamoDB

● Using the CLI or web console, create dynamoDB tables:

http://www.plot2txt.com 01/16

Page 10: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

dynamoDB

● Five key tables for p2t processing flow:

– dailyQuota → track upload size on a daily basis

– userQuota → for use with table above

– outputFiles → output meta-data from the processing flow (last lambda function)

● User key, time, input file key, size, output filename, URL for download

– processingJobs → meta-data from the input side (first lambda function)

● User key, time, input file key, size, processing files

– uploadDetails → meta-data from the point of upload

● User key, orginal filename, new random string key

http://www.plot2txt.com 01/16

Page 11: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

dynamoDB

● Example; dailyQuota logging (PHP upload method)

$client = $sdk->createDynamoDb();

$result = $client->putItem(array(

'TableName' => 'dailyQuota',

'Item' => array(

'user' => array('S' => $user),

'time' => array('N' => (string) $t),

'size' => array('N' => (string) $cumlative_size)

)

));

http://www.plot2txt.com 01/16

Page 12: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

dynamoDB

● Example; output details browsing (PHP browse method)

$client = $sdk->createDynamoDb();

// milliseconds$t = strtotime("-2 days") * 1000;

$iterator = $client->getIterator('Query', array( 'TableName' => 'outputFiles', 'KeyConditions' => array( 'email' => array( 'AttributeValueList' => array( array('S' => 'user_handle') ), 'ComparisonOperator' => 'EQ' ), 'time' => array( 'AttributeValueList' => array( array('N' => (string) $t) ), 'ComparisonOperator' => 'GT' ) ) ));

http://www.plot2txt.com 01/16

Page 13: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● Can create lambdas from the command line eg.,

> aws lambda create-function --region us-east-1 --function-name CreateThumbnail --zip-file fileb://textTN.zip --role arn:aws:iam::4856xxxxxxxx:role/lambda_s3_exec_role --handler CreateThumbnail.handler --runtime nodejs --timeout 10 --memory-size 1024

● Billing is function of memory-size used and execution time (<= timeout)

● Region must be consistent with S3 buckets and any other resource used eg., dynamoDB

● Obviously S3 buckets and other resources must be created first

– Pay particular attention to access controls for S3 eg., easy to make buckets publically available via simple URL, may not be what you want :)

http://www.plot2txt.com 01/16

Page 14: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● For initial effort(s), web console is more helpful:

– Upload zip file for function, or point to S3 location for large zip files

– Configure test event

– Debug from logged output

– Quickly change timeout length/memory consumed

http://www.plot2txt.com 01/16

Page 15: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● eg., make a new test event

http://www.plot2txt.com 01/16

Page 16: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● eg., edit the event details

http://www.plot2txt.com 01/16

Page 17: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions● Check cloud watch logs for problems; common issues:

– Permissions

– Timeout

– Missing depends

http://www.plot2txt.com 01/16

Page 18: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● Pay attention to roles & policies; will need to update simple S3 access role eg., if lambda function accesses dynamoDB

– use IAM console to edit existing role/policy, or create a new one

http://www.plot2txt.com 01/16

Page 19: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● Lambdas can timeout/fail to complete for a variety of reasons eg.,

– node.js module or [something else] unavailable

– Premature termination

● The body of the (node.js) function must set state of context for successful termination eg.,

exports.handler = function(event, context) {

context.done()

}

● Async nature of node.js program control/flow is liable to cause some consternation; two npm modules make development and avoidance of timeout/termination easier to avoid

– Callback count

– Async waterfall

http://www.plot2txt.com 01/16

Page 20: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● callback-count →track your callbacks, only proceed when set number complete (works like a thread barrier)

– https://www.npmjs.com/package/async-waterfall

// from the webpage//initializevar counter = callbackCount(3,done);

//use throughout callbackscounter.next(); counter.next(); counter.next();

//once limit specified in callbackCount is reached, execute the followingfunction done(){callback(null,arg1);}

http://www.plot2txt.com 01/16

Page 21: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● async-waterfall → Run asynchronous tasks, cascaded together

– https://www.npmjs.com/package/async-waterfall

//from the webpage

waterfall([

function(callback){

callback(null, 'one', 'two');

},

function(arg1, arg2, callback){

callback(null, 'three');

},

function(arg1, callback){

// arg1 now equals 'three'

callback(null, 'done');

}

], function (err, result) {

// result now equals 'done'

});

http://www.plot2txt.com 01/16

Page 22: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● Can workout the particular lambda instance details by (eg.,) running some basic linux/unix utilities in a bash script called from index.js, uploaded with lambda function:

var cmd = './my_bash_script.sh ';var async = require('child_process').exec;async(cmd, function(error,stdout,stderr){ console.log(stdout); console.log(stderr); if (error !== null){ console.log(error); } else{ callback(null, 'bash script complete'); }});

http://www.plot2txt.com 01/16

Page 23: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● Typical image details:

>cat /proc/meminfo | grep "MemTotal"

model name : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz

model name : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz

>cat /proc/cpuinfo | grep "model name"

MemTotal: 3858728 kB

>uname -a

Linux ip-10-0-89-9 3.14.48-33.39.amzn1.x86_64 #1 SMP Tue Jul 14 23:43:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

>pwd

/var/task

http://www.plot2txt.com 01/16

Page 24: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● Points to note:

– Lambda compute resources appear to be sparse, however, very cost effective and perfect for large number of short running tasks (<= 300s)

– There are some curiously absent unix/linux tools (eg., bc, zip) however the kernel is similar to stock EC2 instances, obtain missing utilties from (eg.,) a test EC2 instance.

– Utilities, bash scripts, required node modules, any executables wrapped in bash/nodejs etc etc all must be zipped up and supplied together

– Assume very little about instances used for lambda functions

http://www.plot2txt.com 01/16

Page 25: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

– Consider using the node-lambda-template for rapid development and testing : https://github.com/motdotla/node-lambda-template

– Lambda function working directory appears to be /var/task; only disk with write permisssion is /tmp

– If lambda function fails, there are multiple subsequent attempts and thus costs incurred

– No state on instance ie., files are not persistant; use dynamoDB or S3, for example ...

http://www.plot2txt.com 01/16

Page 26: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions● Example: loop over output files, upload to final bucket

fs.readdir("/tmp", function(err, files) {

if (err){

console.log(err);

Return;

}

files.forEach(function(f) {

if (pth.extname(f) == '.zip') {

AWS.config.region = 'us-east-1';

var table = "outputFiles";

console.log('uploading: ' + newLabel);

var body = fs.createReadStream("/tmp/" + newLabel);

var s3_out = new AWS.S3({params: {Bucket: 'output', Key: newLabel}});

s3_out.upload({Body: body}).

on('httpUpload', function(evt) { console.log(evt); }).

send(function(err, data) {

console.log(err, data);

counter.next();

});

}

});});

http://www.plot2txt.com 01/16

Page 27: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● Example: output meta-data produced by last lambda function, put into outputFiles table:

var params ={

TableName:'outputFiles',

Item:{

"time": time,

"email": putEmail,

Info:{

"id": globalLabel,

"file": newLabel,

"url": url,

"base": realFile

}

}

};

http://www.plot2txt.com 01/16

Page 28: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Lambda functions

● Example; download link created using a pre-signed URL, generated in lambda function / nodejs:

//expire in 2 days

var exp = 3600*24*2; var url_params = {Bucket: 'my_bucket', Key: 'object_key', Expires: exp}; var url = s3.getSignedUrl('getObject', url_params);

http://www.plot2txt.com 01/16

Page 29: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Summary

● AWS provides many tools, objects, APIs for complete cloud based solutions

● Relatively inituitive and easy to develop with

● Event driven lambda functions coupled with storage (S3) and NoSQL database (dynamoDB) couple to provide a powerful backend

● Development time and costs miniscule compared to alternatives ...

http://www.plot2txt.com 01/16

Page 30: plot 2 txt · Architecture Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function

Billing

● Cost for this development work thus far :

http://www.plot2txt.com 01/16