aws data pipeline€¦ · copy data from amazon s3 to mysql..... 146 extract apache web log data...

AWS Data PipelineDeveloper Guide

API Version 2012-10-29

Amazon Web Services

AWS Data Pipeline Developer Guide

AWS Data Pipeline: Developer GuideAmazon Web Services


What is AWS Data Pipeline? .................................................................................................................. 1How Does AWS Data Pipeline Work? ..................................................................................................... 1

Pipeline Definition .......................................................................................................................... 2Lifecycle of a Pipeline .................................................................................................................... 4Task Runners ................................................................................................................................ 5Pipeline Components, Instances, and Attempts .......................................................................... 10Lifecycle of a Pipeline Task ......................................................................................................... 11

Get Set Up ............................................................................................................................................ 12Access the Console .............................................................................................................................. 12Install the Command Line Interface ...................................................................................................... 15Deploy and Configure Task Runner ...................................................................................................... 19Install the AWS SDK ............................................................................................................................. 20Granting Permissions to Pipelines with IAM ......................................................................................... 21Grant Amazon RDS Permissions to Task Runner ................................................................................ 23Tutorial: Copy CSV Data from Amazon S3 to Amazon S3 .................................................................... 25Using the AWS Data Pipeline Console ................................................................................................. 27Using the Command Line Interface ...................................................................................................... 33Tutorial: Copy Data From a MySQL Table to Amazon S3 ..................................................................... 40Using the AWS Data Pipeline Console ................................................................................................. 42Using the Command Line Interface ...................................................................................................... 48Tutorial: Launch an Amazon EMR Job Flow ......................................................................................... 56Using the AWS Data Pipeline Console ................................................................................................. 57Using the Command Line Interface ...................................................................................................... 63Tutorial: Import/Export Data in Amazon DynamoDB With Amazon EMR and Hive .............................. 69Part One: Import Data into Amazon DynamoDB .................................................................................. 69

Using the AWS Data Pipeline Console ........................................................................................ 74Using the Command Line Interface ............................................................................................. 81

Part Two: Export Data from Amazon DynamoDB ................................................................................. 90Using the AWS Data Pipeline Console ........................................................................................ 92Using the Command Line Interface ............................................................................................. 98

Tutorial: Run a Shell Command to Process MySQL Table .................................................................. 107Using the AWS Data Pipeline Console ............................................................................................... 109Manage Pipelines ............................................................................................................................... 116Using AWS Data Pipeline Console .................................................................................................... 116Using the Command Line Interface .................................................................................................... 121Troubleshoot AWS Data Pipeline ........................................................................................................ 128Pipeline Definition Files ...................................................................................................................... 135Creating Pipeline Definition Files ........................................................................................................ 135Example Pipeline Definitions .............................................................................................................. 139

Copy SQL Data to a CSV File in Amazon S3 ............................................................................ 139Launch an Amazon EMR Job Flow ........................................................................................... 141Run a Script on a Schedule ...................................................................................................... 143Chain Multiple Activities and Roll Up Data ................................................................................ 144Copy Data from Amazon S3 to MySQL ..................................................................................... 146Extract Apache Web Log Data from Amazon S3 using Hive ..................................................... 148Extract Amazon S3 Data (CSV/TSV) to Amazon S3 using Hive ............................................... 150Extract Amazon S3 Data (Custom Format) to Amazon S3 using Hive ...................................... 152

Simple Data Types .............................................................................................................................. 153Expression Evaluation ........................................................................................................................ 155Objects ............................................................................................................................................... 161

Schedule ................................................................................................................................... 163S3DataNode .............................................................................................................................. 165MySqlDataNode ........................................................................................................................ 169DynamoDBDataNode ................................................................................................................ 172ShellCommandActivity .............................................................................................................. 176CopyActivity ............................................................................................................................... 180EmrActivity ................................................................................................................................ 184HiveActivity ................................................................................................................................ 188



ShellCommandPrecondition ...................................................................................................... 192Exists ......................................................................................................................................... 194S3KeyExists .............................................................................................................................. 197S3PrefixNotEmpty ..................................................................................................................... 200RdsSqlPrecondition ................................................................................................................... 203DynamoDBTableExists .............................................................................................................. 204DynamoDBDataExists ............................................................................................................... 204Ec2Resource ............................................................................................................................. 205EmrCluster ................................................................................................................................ 209SnsAlarm ................................................................................................................................... 213

Command Line Reference .................................................................................................................. 215Program AWS Data Pipeline ............................................................................................................... 231Make an HTTP Request to AWS Data Pipeline .................................................................................. 231Actions in AWS Data Pipeline ............................................................................................................. 234Task Runner Reference ...................................................................................................................... 235Web Service Limits ............................................................................................................................. 238AWS Data Pipeline Resources ........................................................................................................... 241Document History ............................................................................................................................... 243



What is AWS Data Pipeline?

AWS Data Pipeline is a web service that you can use to automate the movement and transformation ofdata. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent onthe successful completion of previous tasks.

For example, you can use AWS Data Pipeline to archive your web server's logs to Amazon Simple StorageService (Amazon S3) each day and then run a weekly Amazon Elastic MapReduce (Amazon EMR) jobflow over those logs to generate traffic reports.

In this example, AWS Data Pipeline would schedule the daily tasks to copy data and the weekly task tolaunch the Amazon EMR job flow. AWS Data Pipeline would also ensure that Amazon EMR waits for thefinal day's data to be uploaded to Amazon S3 before it began its analysis, even if there is an unforeseendelay in uploading the logs.

AWS Data Pipeline handles the ambiguities of real-world data management.You define the parametersof your data transformations and AWS Data Pipeline enforces the logic that you've set up.

How Does AWS Data Pipeline Work?Three main components of AWS Data Pipeline work together to manage your data:


AWS Data Pipeline Developer GuideHow Does AWS Data Pipeline Work?

• Pipeline definition specifies the business logic of your data management. For more information, seePipeline Definition Files (p. 135).

• AWS Data Pipeline web service interprets the pipeline definition and assigns tasks to workers tomove and transform data.

• Task Runners poll the AWS Data Pipeline web service for tasks and then perform those tasks. In theprevious example, Task Runner would copy log files to Amazon S3 and launch Amazon EMR job flows.Task Runner is installed and runs automatically on resources created by your pipeline definitions.Youcan write a custom task runner application, or you can use the Task Runner application that is providedby AWS Data Pipeline. For more information, see Task Runner (p. 5) and Custom Task Runner (p. 8).

The following illustration below shows how these components work together. If the pipeline definitionsupports nonserialized tasks, AWS Data Pipeline can manage tasks for multiple task runners working inparallel.

Pipeline DefinitionA pipeline definition is how you communicate your business logic to AWS Data Pipeline. It contains thefollowing information:

• Names, locations, and formats of your data sources.

• Activities that transform the data.

• The schedule for those activities.

• Resources that run your activities and preconditions

• Preconditions that must be satisfied before the activities can be scheduled.

• Ways to alert you with status updates as pipeline execution proceeds.

From your pipeline definition, AWS Data Pipeline determines the tasks that will occur, schedules them,and assigns them to task runners. If a task is not completed successfully, AWS Data Pipeline retries thetask according to your instructions and, if necessary, reassigns it to another task runner. If the task failsrepeatedly, you can configure the pipeline to notify you.

For example, in your pipeline definition, you might specify that in 2013, log files generated by yourapplication will be archived each month to an Amazon S3 bucket. AWS Data Pipeline would then create12 tasks, each copying over a month's worth of data, regardless of whether the month contained 30, 31,28, or 29 days.


AWS Data Pipeline Developer GuidePipeline Definition

You can create a pipeline definition in the following ways:

• Graphically, by using the AWS Data Pipeline console.

• Textually, by writing a JSON file in the format used by the command line interface.

• Programmatically, by calling the web service with either one of the AWS SDKs or the AWS Data PipelineAPI.

A pipeline definition can contain the following types of components:

DescriptionComponent

The location of input data for a task or the location whereoutput data is to be stored.

The following data locations are currently supported:

• Amazon S3 bucket

• MySQL database

• Amazon DynamoDB

• Local data node

Data Node

An interaction with the data.

The following activities are currently supported:

• Copy to a new location

• Launch an Amazon EMR job flow

• Run a Bash script from the command line (requires a UNIXenvironment to run the script)

• Run a database query

• Run a Hive activity

Activity

A conditional statement that must be true before an actioncan run.

The following preconditions are currently supported:

• A command-line Bash script was successfully completed(requires a UNIX environment to run the script)

• Data exists

• A specific time or a time interval relative to another eventhas been reached

• An Amazon S3 location contains data

• An Amazon RDS or Amazon DynamoDB table exists

Precondition

Any or all of the following:

• The time that an action should start

• The time that an action should stop

• How often the action should run

Schedule


AWS Data Pipeline Developer GuidePipeline Definition

DescriptionComponent

A resource that can analyze or modify data.

The following computational resources are currentlysupported:

• Amazon EMR job flow

• Amazon EC2 instance

Resource

A behavior that is triggered when specified conditions aremet, such as the failure of an activity.

The following actions are currently supported:

• Amazon SNS notification

• Terminate action

Action

For more information, see Pipeline Definition Files (p. 135).

Lifecycle of a PipelineAfter you create a pipeline definition, you create a pipeline and then add your pipeline definition to it.Yourpipeline must be validated. After you have a valid pipeline definition, you can activate it. At that point, thepipeline runs and schedules tasks. When you are done with your pipeline, you can delete it.

The complete lifecycle of a pipeline is shown in the following illustration.


AWS Data Pipeline Developer GuideLifecycle of a Pipeline

Task RunnersA task runner is an application that polls AWS Data Pipeline for tasks and then performs those tasks.You can either use Task Runner as provided by AWS Data Pipeline, or create a custom Task Runnerapplication.

Task RunnerTask Runner is a default implementation of a task runner that is provided by AWS Data Pipeline. WhenTask Runner is installed and configured, it polls AWS Data Pipeline for tasks associated with pipelinesthat you have activated. When a task is assigned to Task Runner, it performs that task and reports itsstatus back to AWS Data Pipeline. If your workflow requires non-default behavior, you'll need to implementthat functionality in a custom task runner.

There are three ways you can use Task Runner to process your pipeline:

• AWS Data Pipeline installs Task Runner for you on resources that are launched and managed by theweb service.

• You install Task Runner on a computational resource that you manage, such as a long-running AmazonEC2 instance, or an on-premise server.

• You modify the Task Runner code to create a custom Task Runner, which you then install on acomputational resource that you manage.


AWS Data Pipeline Developer GuideTask Runners

Task Runner on AWS Data Pipeline-Managed Resources

When a resource is launched and managed by AWS Data Pipeline, the web service automatically installsTask Runner on that resource to process tasks in the pipeline.You specify a computational resource(either an Amazon EC2 instance or an Amazon EMR job flow) for the runsOn field of an activity object.When AWS Data Pipeline launches this resource, it installs Task Runner on that resource and configuresit to process all activity objects that have their runsOn field set to that resource.When AWS Data Pipelineterminates the resource, the Task Runner and all its logs are published to an Amazon S3 location beforeit shuts down.

For example, if you use the EmrActivity action in a pipeline, and specify an EmrCluster object in therunsOn field. When AWS Data Pipeline processes that activity, it launches an Amazon EMR job flow anduses a bootstrap step to install Task Runner onto the master node. This Task Runner then processesthe tasks for activities that have their runsOn field set to that EmrCluster object. The following excerptfrom a pipeline definiton shows this relationship between the two objects.

{ "id" : "MyEmrActivity", "name" : "Work to perform on my data", "type" : "EmrActivity", "runsOn" : {"ref" : "MyEmrCluster"}, "preStepCommand" : "scp remoteFiles localFiles", "step" : "s3://myBucket/myPath/myStep.jar,firstArg,secondArg", "step" : "s3://myBucket/myPath/myOtherStep.jar,anotherArg", "postStepCommand" : "scp localFiles remoteFiles", "input" : {"ref" : "MyS3Input"}, "output" : {"ref" : "MyS3Output"}



},{ "id" : "MyEmrCluster", "name" : "EMR cluster to perform the work", "type" : "EmrCluster", "hadoopVersion" : "0.20", "keypair" : "myKeyPair", "masterInstanceType" : "m1.xlarge", "coreInstanceType" : "m1.small", "coreInstanceCount" : "10", "instanceTaskType" : "m1.small", "instanceTaskCount": "10", "bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-ha doop,arg1,arg2,arg3", "bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-other-stuff,arg1,arg2"}

If you have multiple AWS Data Pipeline-managed resources in a pipeline, Task Runner is installed oneach of them, and they all poll AWS Data Pipeline for tasks to process.

Task Runner on User-Managed Resources

You can install Task Runner on computational resources that you manage, such a long-running AmazonEC2 instance or a physical server.This approach can be useful when, for example, you want to use AWSData Pipeline to process data that is stored inside your organization’s firewall. By installing Task Runneron a server in the local network, you can access the local database securely and then poll AWS DataPipeline for the next task to run. When AWS Data Pipeline ends processing or deletes the pipeline, theTask Runner instance remains running on your computational resource until you manually shut it down.Similarly, the Task Runner logs persist after pipeline execution is complete.

You download Task Runner, which is in Java Archive (JAR) format, and install it on your computationalresource. For more information about downloading and installing Task Runner, see Deploy and ConfigureTask Runner (p. 19). To connect a Task Runner that you've installed to the pipeline activities it shouldprocess, add a workerGroup field to the object, and configure Task Runner to poll for that worker groupvalue.You do this by passing the worker group string as a parameter (for example,--workerGroup=wg-12345) when you run the Task Runner JAR file.



{ "id" : "MyStoredProcedureActivity", "type" : "StoredProcedureActivity", "workerGroup" : "wg-12345", "command" : "mkdir new-directory"}

Custom Task RunnerIf your data-management requires behavior other than the default behavior provided by Task Runner,you need to create a custom task runner. Because Task Runner is an open-source application, you canuse it as the basis for creating your custom implementation.

After you write the custom task runner, you install it on a computational resource that you own, such asa long-running EC2 instance or a physical server inside your organization's firewall. To connect yourcustom task runner to the pipeline activities it should process, add a workerGroup field to the object,and configure your custom task runner to poll for that worker group value.



For example, if you use the ShellCommandActivity action in a pipeline, and specify a value for theworkerGroup field, when AWS Data Pipeline processes that activity, it passes the task to a task runnerthat polls the web service for work and specifies that worker group. The following excerpt from a pipelinedefinition shows how to configure the workerGroup field.

{ "id" : "CreateDirectory", "type" : "ShellCommandActivity", "workerGroup" : "wg-67890", "command" : "mkdir new-directory"}

When you create a custom task runner, you have complete control over how your pipeline activities areprocessed. The only requirement is that you communicate with AWS Data Pipeline as follows:

• Poll for tasks—Your task runner should poll AWS Data Pipeline for tasks to process by calling thePollForTask API. If tasks are ready in the work queue, GetRemoteWork returns a Response immediately.If no tasks are available in the queue, GetRemoteWork uses long-polling and holds on to a pollconnection for up to 90 seconds, during which time any newly scheduled tasks are handed to the taskagent.Your remote worker should not call GetRemoteWork again on the same worker group until itreceives a Response, and this may take up to 90 seconds.

• Report progress—Your task runner should report its progress to AWS Data Pipeline by calling theReportTaskProgress API each minute. If a task runner does not report its status after 5 minutes, thenevery 20 minutes afterwards (configurable), AWS Data Pipeline assumes the task runner is unable toprocess the task and assigns it in a subsequent Response to GetRemoteWork.

• Signal completion of a task—Your task runner should inform AWS Data Pipeline of the outcomewhen it completes a task by calling the SetTaskStatus API. The task runner calls this action regardless



of whether the task was sucessful. The task runner does not need to call SetRemoteWorkStatus fortasks canceled by AWS Data Pipeline.

Pipeline Components, Instances, and AttemptsThere are three types of items associated with a scheduled pipeline:

• Pipeline Components — Pipeline components represent the business logic of the pipeline and arerepresented by the different sections of a pipeline definition. Pipeline components specify the datasources, activities, schedule, and preconditions of the workflow.They can inherit properties from parentcomponents. Relationships among components are defined by reference. Pipeline components definethe rules of data management; they are not a to-do list.

• Instances — When AWS Data Pipeline runs a pipeline, it compiles the pipeline components to createa set of actionable instances. Each instance contains all the information needed to perform a specifictask. The complete set of instances is the to-do list of the pipeline. AWS Data Pipeline hands theinstances out to task runners to process.

• Attempts — To provide robust data management, AWS Data Pipeline retries a failed operation. Itcontinues to do so until the task reaches the maximum number of allowed retry attempts. Attemptobjects track the various attempts, results, and failure reasons if applicable. Essentially, it is the instancewith a counter.

NoteRetrying failed tasks is an important part of a fault tolerance strategy, and AWS Data Pipelinepipeline definitions provide conditions and thresholds to control retries. However, too many retriescan delay detection of an unrecoverable failure because AWS Data Pipeline does not reportfailure until it has exhausted all the retries that you specify.The extra retries may accrue additionalcharges if they are running on AWS resources. As a result, carefully consider when it isappropriate to exceed the AWS Data Pipeline default settings that you use to control re-triesand related settings.


AWS Data Pipeline Developer GuidePipeline Components, Instances, and Attempts

Lifecycle of a Pipeline TaskThe following diagram illustrates how AWS Data Pipeline and a task runner interact to process a scheduledtask.


AWS Data Pipeline Developer GuideLifecycle of a Pipeline Task

Get Set Up for AWS Data Pipeline

There are several ways you can interact with AWS Data Pipeline:

• Console — a graphical interface you can use to create and manage pipelines. With it, you fill out webforms to specify the configuration details of your pipeline components.The AWS Data Pipeline consoleprovides several templates, which are pre-configured pipelines for common scenarios. As you keepbuilding your pipeline, graphical representation of the components appear on the design pane. Thearrows between the components indicate the connection between the components. Using the consoleis the easiest way to get started with AWS Data Pipeline. It creates the pipeline definition for you, andno JSON or programming knowledge is required. The console is available online athttps://console.aws.amazon.com/datapipeline/. For more information about accessing the console, seeAccess the Console (p. 12).

• Command Line Interface (CLI) — an application you run on your local machine to connect to AWSData Pipeline and create and manage pipelines. With it, you issue commands into a terminal windowand pass in JSON files that specify the pipeline definition. Using the CLI is the best option if you preferworking from a command line. For more information, see Install the Command Line Interface (p. 15).

• Software Development Kit (SDK) — AWS provides an SDK with functions that call AWS Data Pipelineto create and manage pipelines.With it, you can write applications that automate the process of creatingand managing pipelines. Using the SDK is the best option if you want to extend or customize thefunctionality of AWS Data Pipeline.You can download the AWS SDK for Java fromhttp://aws.amazon.com/sdkforjava/.

• Web Service API — AWS provides a low-level interface that you can use to call the web service directlyusing JSON. Using the API is the best option if you want to create an custom SDK that calls AWS DataPipeline. For more information, see AWS Data Pipeline API Reference.

In addition, there is the Task Runner application, which is a default implementation of a task runner.Depending on the requirements of your data management, you may need to install Task Runner on acomputational resource such as a long-running Amazon EC2 instance or a physical server. For moreinformation about when to install Task Runner, see Task Runner (p. 5). For more information about howto install Task Runner, see Deploy and Configure Task Runner (p. 19).

Access the ConsoleTopics

• Where Do I Go Now? (p. 15)


AWS Data Pipeline Developer GuideAccess the Console

https://console.aws.amazon.com/datapipeline/

http://aws.amazon.com/sdkforjava/

http://docs.aws.amazon.com/datapipeline/latest/APIReference/Welcome.html

The AWS Data Pipeline console enables you to do the following:

• Create, save, and activate your pipeline

• View the details of all the pipelines associated with your account

• Modify your pipeline

• Delete your pipeline

You must have an Amazon Web Services (AWS) account to access the AWS Data Pipeline console.When you create an AWS account, AWS automatically signs up the account for all AWS services, includingAWS Data Pipeline. With AWS Data Pipeline, you pay only for what you use. For more information aboutAWS Data Pipeline usage rates, see AWS Data Pipeline.

If you have an AWS account already, skip to the next step. If you don't have an AWS account, use thefollowing procedure to create one.

To create an AWS account

1. Go to AWS and click Sign Up Now.

2. Follow the on-screen instructions.

Part of the sign-up process involves receiving a phone call and entering a PIN using the phonekeypad.

To access the console

1. Sign in to the AWS Management Console and open the AWS Data Pipeline console athttps://console.aws.amazon.com/datapipeline/.

2. If your account doesn't already have data pipelines, the console displays the following introductoryscreen that prompts you to create your first pipeline. This screen also provides an overview of theprocess for creating a pipeline, and links to relevant documentation and resources. Click CreatePipeline to create your pipeline.



http://aws.amazon.com/awsdatapipeline/

http://aws.amazon.com


If you already have pipelines associated with your account, the console displays the page listing allthe pipelines associated with your account. Click Create New Pipeline to create your pipeline.



Where Do I Go Now?You are now ready to start creating your pipelines. For more information about creating a pipeline, seethe following tutorials:

• Tutorial: Copy CSV Data from Amazon S3 to Amazon S3 (p. 25)

• Tutorial: Copy Data From a MySQL Table to Amazon S3 (p. 40)

• Tutorial: Launch an Amazon EMR Job Flow (p. 56)

• Tutorial: Run a Shell Command to Process MySQL Table (p. 107)

Install the Command Line InterfaceThe AWS Data Pipeline command line interface (CLI) is a tool you can use to create and manage pipelinesfrom a terminal window. It is written in Ruby and makes calls to the web service on your behalf.

Topics

• Install Ruby (p. 15)

• Install the RubyGems package management framework (p. 15)

• Install Prerequisite Ruby Gems (p. 16)

• Install the AWS Data Pipeline CLI (p. 17)

• Locate your AWS Credentials (p. 17)

• Create a Credentials File (p. 18)

• Verify the CLI (p. 18)

Install RubyThe AWS Data Pipeline CLI requires Ruby 1.8.7. Some operating systems, such as Mac OS, come withRuby pre-installed.

To verify the Ruby installation and version

• To check whether Ruby is installed, and which version, run the following command in a terminalwindow. If Ruby is installed, this command displays its version information.

ruby -v

If you don’t have Ruby 1.8.7 installed, use the following procedure to install it.

To install Ruby on Linux/Unix/Mac OS

• Download Ruby from http://www.ruby-lang.org/en/downloads/ and follow the installation instructionsfor your version of Linux/Unix/Mac OS.

Install the RubyGems package managementframeworkThe AWS Data Pipeline CLI requires a version of RubyGems that is compatible with Ruby 1.8.7.


AWS Data Pipeline Developer GuideWhere Do I Go Now?

http://www.ruby-lang.org/en/downloads/

To verify the RubyGems installation and version

• To check whether RubyGems is installed, run the following command from a terminal window. IfRubyGems is installed, this command displays its version information.

gem -v

If you don’t have RubyGems installed, or have a version not compatible with Ruby 1.8.7, you need todownload and install RubyGems before you can install the AWS Data Pipeline CLI.

To install RubyGems on Linux/Unix/Mac OS

1. Download RubyGems from http://rubyforge.org/frs/?group_id=126.

2. Install RubyGems using the following command.

sudo ruby setup.rb

Install Prerequisite Ruby GemsThe AWS Data Pipeline CLI requires Ruby 1.8.7 or greater, a compatible version of RubyGems, and thefollowing Ruby gems:

• json (version 1.4 or greater)

• uuidtools (version 2.1 or greater)

• httparty (version .7 or greater)

• bigdecimal (version 1.0 or greater)

• nokogiri (version 1.4.4 or greater)

The following topics describe how to install the AWS Data Pipeline CLI and the Ruby environment itrequires.

Use the following procedures to ensure that each of the gems listed above is installed.

To verify whether a gem is installed

• To check whether a gem is installed, run the following command from a terminal window. For example,if 'uuidtools' is installed, this command displays the name and version of the 'uuidtools' RubyGem.

gem search 'uuidtools'

If you don’t have 'uuidtools' installed, then you need to install it before you can install the AWS DataPipeline CLI.

To install 'uuidtools' on Windows/Linux/Unix/Mac OS

• Install 'uuidtools' using the following command.


AWS Data Pipeline Developer GuideInstall Prerequisite Ruby Gems

http://rubyforge.org/frs/?group_id=126

sudo gem install uuidtools

Install the AWS Data Pipeline CLIAfter you have verified the installation of your Ruby environment, you’re ready to install the AWS DataPipeline CLI.

To install the AWS Data Pipeline CLI on Windows/Linux/Unix/Mac OS

1. Download datapipeline-cli.zip fromhttps://s3.amazonaws.com/datapipeline-us-east-1/software/latest/DataPipelineCLI/.

2. Unzip the compressed file. For example, on Linux/Unix/Mac OS use the following command:

unzip datapipeline-cli.zip

This uncompresses the CLI and supporting code into a new directory called dp-cli.

3. If you add the new directory, dp-cli, to your PATH variable, you can use the CLI without specifyingthe complete path. In this guide, we assume that you've updated your PATH variable, or that yourun the CLI from the directory where it is installed.

Locate your AWS CredentialsWhen you create an AWS account, AWS assigns you an access key ID and a secret access key. AWSuses these credentials to identify you when you interact with a web service.You need these keys for thenext step of the CLI installation process.

NoteYour secret access key is a shared secret between you and AWS. Keep this ID secret; we useit to bill you for the AWS services that you use. Never include the ID in your requests to AWS,and never email this ID to anyone, even if a request appears to originate from AWS orAmazon.com. No one who legitimately represents Amazon will ever ask you for your secretaccess key.

The following procedure explains how to locate your access key ID and secret access key in the AWSManagement Console.

To view your AWS access credentials

1. Go to the Amazon Web Services website at http://aws.amazon.com.

2. Click My Account/Console, and then click Security Credentials.

3. Under Your Account, click Security Credentials.

4. In the spaces provided, type your user name and password, and then click Sign in using our secureserver.

5. Under Access Credentials, on the Access Keys tab, your access key ID is displayed. To view yoursecret key, under Secret Access Key, click Show.

Make a note of your access key ID and your secret access key; you will use them in the next section.


AWS Data Pipeline Developer GuideInstall the AWS Data Pipeline CLI

https://s3.amazonaws.com/datapipeline-us-east-1/software/latest/DataPipelineCLI/

http://aws.amazon.com

Create a Credentials FileWhen you request services from AWS Data Pipeline, you must pass your credentials with the request sothat AWS can properly authenticate and eventually bill you. The command line interface obtains yourcredentials from a JSON document called a credentials file, which is stored in your home directory, ~/.

Using a credentials file is the simplest way to make your AWS credentials available to the AWS DataPipeline CLI.

The credentials file contains the following name-value pairs.

DescriptionName

An optional comment within the credentials file.comment

The access key ID for your AWS account.access-id

The secret access key for your AWS accountprivate-key

The endpoint for AWS Data Pipeline in the regionwhere you are making requests.

endpoint

The location of the Amazon S3 bucket where AWSData Pipeline writes log files.

log-uri

In the following example credentials file, AKIAIOSFODNN7EXAMPLE represents an access key ID, andwJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY represents the corresponding secret access key.The value of log-uri specifies the location of your Amazon S3 bucket and the path to the log files foractions performed by the AWS Data Pipeline web service on behalf of your pipeline.

{ "access-id": "AKIAIOSFODNN7EXAMPLE", "private-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", "endpoint": "datapipeline.us-east-1.amazonaws.com", "port": "443", "use-ssl": "true", "region": "us-east-1", "log-uri": "s3://myawsbucket/logfiles"}

After you replace the values for the access-id, private-key, and log-uri fields with the appropriateinformation, save the file as credentials.json in either your home directory, ~/.

Verify the CLITo verify that the command line interface (CLI) is installed, use the following command.

datapipeline --help

If the CLI is installed correctly, this command displays the list of commands for the CLI.


AWS Data Pipeline Developer GuideCreate a Credentials File

Deploy and Configure Task RunnerTask Runner is an task runner application that polls AWS Data Pipeline for scheduled tasks and processesthe tasks assigned to it by the web service, reporting status as it does so.

Depending on your application, you may choose to:

• Have AWS Data Pipeline install and manage one or more Task Runner applications for you oncomputational resources managed by the web service. In this case, you do not need to install orconfigure Task Runner.

• Manually install and configure Task Runner on a computational resource such as a long-running AmazonEC2 instance or a physical server. To do so, use the following procedures.

• Manually install and configure a custom task runner instead of Task Runner. The procedures for doingso depends on the implementation of the custom task runner.

For more information about Task Runner and when and where it should be configured, see TaskRunner (p. 5).

NoteYou can only install Task Runner on Linux, UNIX, or Mac OS. Task Runner is not supported onthe Windows operating system.

Topics

• Install Java (p. 19)

• Install Task Runner (p. 235)

• Start Task Runner (p. 235)

• Verify Task Runner (p. 236)

Install JavaTask Runner requires Java version 1.6 or later. To determine whether Java is installed, and the versionthat is running, use the following command:

java -version

If you do not have Java 1.6 or later installed on your computer, you can download the latest version fromhttp://www.oracle.com/technetwork/java/index.html.

Install Task RunnerTo install Task Runner, download TaskRunner-1.0.jar from Task Runner download and copy it intoa folder. Additionally, download mysql-connector-java-5.1.18-bin.jar fromhttp://dev.mysql.com/usingmysql/java/ and copy it into the same folder where you install Task Runner.

Start Task RunnerIn a new command prompt window that is set to the directory where you installed Task Runner, start TaskRunner with the following command. The --config option points to your credentials file. The--workerGroup option specifies the name of your worker group, which must be the same value asspecified in your pipeline for tasks to be processed.


AWS Data Pipeline Developer GuideDeploy and Configure Task Runner

http://www.oracle.com/technetwork/java/index.html

https://s3.amazonaws.com/datapipeline-us-east-1/software/latest/TaskRunner/

http://dev.mysql.com/usingmysql/java/

java -jar TaskRunner-1.0.jar --config ~/credentials.json --workerGroup=myWork erGroup

When Task Runner is active, it prints the path to where log files are written in the terminal window. Thefollowing is an example.

Logging to /myComputerName/.../dist/output/logs

WarningIf you close the terminal window, or interrupt the command with CTRL+C, Task Runner stops,which halts the pipeline runs.

Verify Task RunnerThe easiest way to verify that Task Runner is working is to check whether it is writing log files. The logfiles are stored in the directory where you started Task Runner.

When you check the logs, make sure you that are checking logs for the current date and time. TaskRunner creates a new log file each hour, where the hour from midnight to 1am is 00. So the format of thelog file name is TaskRunner.log.YYYY-MM-DD-HH, where HH runs from 00 to 23, in UDT. To savestorage space, any log files older than eight hours are compressed with GZip.

Install the AWS SDKThe easiest way to write applications that interact with AWS Data Pipeline or to implement a custom taskrunner is to use one of the AWS SDKs. The AWS SDKs provide functionality that simplify calling the webservice APIs from your preferred programming environment.

For more information about the programming languages and development environments that have AWSSDK support, see the AWS SDK listings.

If you are not writing programs that interact with AWS Data Pipeline, you do not need to install any of theAWS SDKs.You can create and run pipelines using the console or command-line interface.

This guide provides examples of programming AWS Data Pipeline using Java.The following are examplesof how to download and install the AWS SDK for Java.

To install the AWS SDK for Java using Eclipse

• Install the AWS Toolkit for Eclipse.

Eclipse is a popular Java development environment. The AWS Toolkit for Eclipse installs the latestversion of the AWS SDK for Java. From Eclipse, you can easily modify, build, and run any of thesamples included in the SDK.

To install the AWS SDK for Java

• If you are using a Java development environment other than Eclipse, download and install the AWSSDK for Java.


AWS Data Pipeline Developer GuideVerify Task Runner

http://aws.amazon.com/search?_encoding=UTF&&searchQuery=sdk

http://aws.amazon.com/eclipse/

http://www.eclipse.org/



Granting Permissions to Pipelines with IAMIn AWS Data Pipeline, IAM roles determine what your pipeline can access and actions it can perform.Additionally, when your pipeline creates a resource, such as when a pipeline creates an Amazon EC2instance, IAM roles determine the EC2 instance's permitted resources and actions. When you create apipeline, you specify one IAM role that governs your pipeline and another IAM role to govern your pipeline'sresources (referred to as a "resource role"), which can be the same role for both. Carefully consider theminimum permissions necessary for your pipeline to perform work and define the IAM roles accordingly.It is important to note that even a modest pipeline might need access to resources and actions to variousareas of AWS, for example:

• Accessing files in Amazon S3

• Creating and managing Amazon EMR clusters

• Creating and managing Amazon EC2 instances

• Accessing data in Amazon RDS or Amazon DynamoDB

• Sending notifications using Amazon SNS

When you use the AWS Data Pipeline console, you can choose a pre-defined, default IAM role andresource role or create a new one to suit your needs. However, when using the AWS Data Pipeline CLI,you must create a new IAM role and apply a policy to it yourself, for which you can use the followingexample policy. For more information about how to create a new IAM role and apply a policy to it, seeManaging IAM Policies in the Using IAM guide.

WarningCarefully review and restrict permissions in the following example policy to only the resourcesthat your pipeline requires.

{ "Statement": [ { "Action": [ "s3:*" ], "Effect": "Allow", "Resource": [ "*" ] }, { "Action": [ "ec2:DescribeInstances", "ec2:RunInstances", "ec2:StartInstances", "ec2:StopInstances", "ec2:TerminateInstances" ], "Effect": "Allow", "Resource": [ "*" ] }, { "Action": [ "elasticmapreduce:*" ],


AWS Data Pipeline Developer GuideGranting Permissions to Pipelines with IAM

http://docs.aws.amazon.com/IAM/latest/UserGuide/ManagingPolicies.html

"Effect": "Allow", "Resource": [ "*" ] }, { "Action": [ "dynamodb:*" ], "Effect": "Allow", "Resource": [ "*" ] }, { "Action": [ "rds:DescribeDBInstances", "rds:DescribeDBSecurityGroups" ], "Effect": "Allow", "Resource": [ "*" ] }, { "Action": [ "sns:GetTopicAttributes", "sns:ListTopics", "sns:Publish", "sns:Subscribe", "sns:Unsubscribe" ], "Effect": "Allow", "Resource": [ "*" ] }, { "Action": [ "iam:PassRole" ], "Effect": "Allow", "Resource": [ "*" ] },{ "Action": [ "datapipeline:*" ], "Effect": "Allow", "Resource": [ "*" ] } ]}


AWS Data Pipeline Developer GuideGranting Permissions to Pipelines with IAM

After you define a role and apply its policy, you define a trusted entities list, which indicates the entitiesor services that are permitted to use your new role.You can use the following IAM trust relationshipdefinition to allow AWS Data Pipeline and Amazon EC2 to use your new pipeline and resource roles. Formore information about editing IAM trust relationships, see Modifying a Role in the Using IAM guide.

{ "Version": "2008-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "Service": [ "ec2.amazonaws.com", "datapipeline.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ]}

Grant Amazon RDS Permissions to Task RunnerAmazon RDS allows you to control access to your DB Instances using database security groups (DBSecurity Groups). A DB Security Group acts like a firewall controlling network access to your DB Instance.By default, network access is turned off to your DB Instances.You must modify your DB Security Groupsto let Task Runner access your Amazon RDS instances. Task Runner gains Amazon RDS access fromthe instance on which it runs, so the accounts and security groups that you add to your Amazon RDSinstance depend on where you install Task Runner.

To grant permissions to Task Runner,

1. Sign in to the AWS Management Console and open the Amazon RDS console.

2. In the Amazon RDS: My DB Security Groups pane, click your Amazon RDS instance. In the DBSecurity Group pane, under Connection Type, select EC2 Security Group. Configure the fieldsin the EC2 Security Group pane as described below:

For Task Runner running on an EC2 Resource,

• AWS Account Id: Your AccountId

EC2 Security Group: Your Security Group Name

For a Task Runner running on an EMR Resource,

• AWS Account Id: Your AccountId

EC2 Security Group: ElasticMapReduce-master

AWS Account Id: Your AccountId

EC2 Security Group: ElasticMapReduce-slave


AWS Data Pipeline Developer GuideGrant Amazon RDS Permissions to Task Runner

http://docs.aws.amazon.com/IAM/latest/UserGuide/modifying-role.html

https://console.aws.amazon.com/rds/

For a Task Runner running in your local environment (on-premise),

• CIDR: The IP address range of your on premise machine, or firewall if your on-premise computer isbehind a firewall.

To allow connection from an RdsSqlPrecondition

• AWS Account Id: 793385162516

EC2 Security Group: DataPipeline


AWS Data Pipeline Developer GuideGrant Amazon RDS Permissions to Task Runner

Tutorial: Copy CSV Data fromAmazon S3 to Amazon S3

After you read What is AWS Data Pipeline? (p. 1) and decide you want to use AWS Data Pipeline toautomate the movement and transformation of your data, it is time to get started with creating datapipelines. To help you make sense of how AWS Data Pipeline works, let’s walk through a simple task.

This tutorial walks you through the process of creating a data pipeline to copy data from one Amazon S3bucket to another and then send an Amazon SNS notification after the copy activity completes successfully.You use the Amazon EC2 instance resource managed by AWS Data Pipeline for this copy activity.

ImportantThis tutorial does not employ the Amazon S3 API for high speed data transfer between AmazonS3 buckets. It is intended only for demonstration purposes to help new customers understandhow to create a simple pipeline and the related concepts. For advanced information about datatransfer using Amazon S3, see Working with Buckets in the Amazon S3 Developer Guide.

The first step in pipeline creation process is to select the pipeline objects that make up your pipelinedefinition. After you select the pipeline objects, you add fields for each object. For more information, seePipeline Definition (p. 2).

This tutorial uses the following objects to create a pipeline definition:

ActivityActivity the AWS Data Pipeline performs for this pipeline.

This tutorial uses the CopyActivity object to copy CSV data from one Amazon S3 bucket to another.

ImportantThere are distinct limitations regarding the CSV file format with CopyActivity andS3DataNode. For more information, see CopyActivity (p. 180).

ScheduleThe start date, time, and the recurrence for this activity.You can optionally specify the end date andtime.

ResourceResource AWS Data Pipeline must use to perform this activity.

This tutorial uses Ec2Resource, an Amazon EC2 instance provided by AWS Data Pipeline, to copydata. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates theinstance after the task finishes.



http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/UsingBucket.html

DataNodesInput and output nodes for this pipeline.

This tutorial uses S3DataNode for both input and output nodes.

ActionAction AWS Data Pipeline must take when the specified conditions are met.

This tutorial uses SnsAlarm action to send Amazon SNS notifications to the email address youspecify, after the task finishes successfully.You must subscribe to the Amazon SNS Topic Arn toreceive the notifications.

The following steps outline how to create a data pipeline to copy data from one Amazon S3 bucket toanother Amazon S3 bucket.

1. Create your pipeline definition

2. Validate and save your pipeline definition

3. Activate your pipeline

4. Monitor the progress of your pipeline

5. [Optional] Delete your pipeline

Before You Begin...Be sure you've completed the following steps.

• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For moreinformation, see Access the Console (p. 12).

Set up the AWS Data Pipeline tools and interface you plan on using. For more information on interfacesand tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS Data Pipeline (p. 12).

• Create an Amazon S3 bucket as a data source.

For more information, see Create a Bucket in the Amazon Simple Storage Service Getting StartedGuide.

• Upload your data to your Amazon S3 bucket.

For more information, see Add an Object to a Bucket in the Amazon Simple Storage Service GettingStarted Guide.

• Create another Amazon S3 bucket as a data target

• Create an Amazon SNS topic for sending email notification and make a note of the topic AmazonResource Name (ARN). For more information, see Create a Topic in the Amazon Simple NotificationService Getting Started Guide.

• [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you wouldrather create and configure your own IAM role policy and trust relationships, follow the instructionsdescribed in Granting Permissions to Pipelines with IAM (p. 21).

NoteSome of the actions described in this tutorial can generate AWS usage charges, depending onwhether you are using the AWS Free Usage Tier.


AWS Data Pipeline Developer GuideBefore You Begin...

http://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html

http://docs.aws.amazon.com/AmazonS3/latest/gsg/PuttingAnObjectInABucket.html

http://docs.aws.amazon.com/sns/latest/gsg/CreateTopic.html

http://aws.amazon.com/free/

Using the AWS Data Pipeline ConsoleTopics

• Create and Configure the Pipeline Definition Objects (p. 27)

• Validate and Save Your Pipeline (p. 30)

• Verify your Pipeline Definition (p. 30)

• Activate your Pipeline (p. 31)

• Monitor the Progress of Your Pipeline Runs (p. 31)

• [Optional] Delete your Pipeline (p. 33)

The following sections include the instructions for creating the pipeline using the AWS Data Pipelineconsole.

To create your pipeline definition

1. Sign in to the AWS Management Console and open the AWS Data Pipeline console.

2. Click Create Pipeline.

3. On the Create a New Pipeline page:

a. In the Pipeline Name box, enter a name (for example, CopyMyS3Data).

b. In Pipeline Description, enter a description.

c. Leave the Select Schedule Type: button set to the default type for this tutorial.

NoteSchedule type allows you to specify whether the objects in your pipeline definitionshould be scheduled at the beginning of interval or end of the interval. Time SeriesStyle Scheduling means instances are scheduled at the end of each interval and CronStyle Scheduling means instances are scheduled at the beginning of each interval.

d. Leave the Role boxes set to their default values for this tutorial.

NoteIf you have created your own IAM roles and would like to use them in this tutorial, youcan select them now.

e. Click Create a new pipeline.

Create and Configure the Pipeline DefinitionObjectsNext, you define the Activity object in your pipeline definition. When you define the Activity object,you also define the objects that AWS Data Pipeline must use to perform this activity.

1. On the Pipeline: name of your pipeline page, select Add activity.

2. In the Activities pane:

a. Enter the name of the activity; for example, copy-myS3-data.

b. In the Type box, select CopyActivity.

c. In the Input box, select Create new: DataNode.

d. In the Output box, select Create new: DataNode.

e. In the Schedule box, select Create new: Schedule.


AWS Data Pipeline Developer GuideUsing the AWS Data Pipeline Console

https://console.aws.amazon.com/datapipeline/home

f. In the Add an optional field .. box, select RunsOn.

g. In the Runs On box, select Create new: Resource.

h. In the Add an optional field... box, select On Success.

i. In the On Success box, select Create new: Action.

j. In the left pane, separate the icons by dragging them apart.

You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline usesto perform the copy activity.

The Pipeline: name of your pipeline pane shows the graphical representation of the pipelineyou just created. The arrows indicate the connection between the various objects.

Next, configure the run date and time for your pipeline.

To configure run date and time for your pipeline

1. On the Pipeline: name of your pipeline page, in the right pane, click Schedules.

2. In the Schedules pane:

a. Enter a schedule name for this activity (for example, copy-myS3-data-schedule).

b. In the Type box, select Schedule.

c. In the Start Date Time box, select the date from the calendar, and then enter the time to startthe activity.

NoteAWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS"format in UTC/GMT only.

d. In the Period box, enter the duration for the activity (for example, 1), and then select the periodcategory (for example, Days).

e. [Optional] To specify the date and time to end the activity, in the Add an optional field box,select endDateTime, and enter the date and time.

To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWSData Pipeline then starts launching the "past due" runs immediately in an attempt to address whatit perceives as a backlog of work. This backfilling means you don't have to wait an hour to see AWSData Pipeline launch its first job flow.

Next, configure the input and the output data nodes for your pipeline.


AWS Data Pipeline Developer GuideCreate and Configure the Pipeline Definition Objects

To configure the input and output data nodes of your pipeline

1. On the Pipeline: name of your pipeline page, in the right pane, click DataNodes.

2. In the DataNodes pane:

a. In the DefaultDataNode1Name box , enter the name for your input node (for example,MyS3Input).

In this tutorial, your input node is the Amazon S3 data source bucket.

b. In the Type box, select S3DataNode.

c. In the Schedule box, select copy-myS3-data-schedule.

d. In the Add an optional field... box, select File Path.

e. In the File Path box, enter the path to your Amazon S3 bucket (for example,s3://my-data-pipeline-input/name of your data file).

f. In the DefaultDataNode2 Name box, enter the name for your output node (for example,MyS3Output).

In this tutorial, your output node is the Amazon S3 data target bucket.

g. In the Type box, select S3DataNode.

h. In the Schedule box, select copy-myS3-data-schedule.

i. In the Add an optional field... box, select File Path.

j. In the File Path box, enter the path to your Amazon S3 bucket (for example,s3://my-data-pipeline-output/name of your data file).

Next, configure the resource AWS Data Pipeline must use to perform the copy activity.

To configure the resource,

1. On the Pipeline: name of your pipeline page, in the right pane, click Resources.

2. In the Resources pane:

a. In the Name box, enter the name for your resource (for example, CopyDataInstance).

b. In the Type box, select Ec2Resource.

c. In the Schedule box, select copy-myS3-data-schedule.

d. Leave the Role and Resource Role boxes set to default values for this tutorial.

NoteIf you have created your own IAM roles, you can select them now.

Next, configure the SNS notification action AWS Data Pipeline must perform after the copy activity finishessuccessfully.

To configure the SNS notification action

1. On the Pipeline: name of your pipeline page, in the right pane, click Others.

2. In the Others pane:

a. In the DefaultAction1Name box, enter the name for your Amazon SNS notification (forexample, CopyDataNotice).

b. In the Type box, select SnsAlarm.



c. In the Topic Arn box, enter the ARN of your Amazon SNS topic.

d. In the Message box, enter the message content.

e. In the Subject box, enter the subject line for your notification.

f. Leave the Role box set to the default value for this tutorial.

You have now completed all the steps required for creating your pipeline definition. Next, validateand save your pipeline.

Validate and Save Your PipelineYou can save your pipeline definition at any point during the creation process. As soon as you save yourpipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan tocontinue the creation process later, you can ignore the error message. If your pipeline definition is complete,and you are getting the validation error message, you have to fix the errors in the pipeline definition beforeactivating your pipeline.

To validate and save your pipeline

1. On the Pipeline: name of your pipeline page, click Save Pipeline.

2. AWS Data Pipeline validates your pipeline definition and returns either success or the error message.If you get an error message, click Close and then, in the right pane, click Errors.

3. The Errors pane lists the objects failing validation. Click the plus (+) sign next to the object namesand look for an error message in red.

4. When you see the error message, click the specific object pane where you see the error and fix it.For example, if you see an error message in the DataNodes object, click the DataNodes pane tofix the error.

5. After you've fixed the errors listed in the Errors pane, click Save Pipeline.

6. Repeat the process until your pipeline is validated.

Next, verify that your pipeline definition has been saved.

Verify your Pipeline DefinitionIt is important to verify that your pipeline was correctly initialized from your definitions before you activateit.

To verify your pipeline definition

1. On the Pipeline: name of your pipeline page, click Back to list of pipelines.

2. On the List Pipelines page, check if your newly-created pipeline is listed.

AWS Data Pipeline has created a unique Pipeline ID for your pipeline definition.

The Status column in the row listing your pipeline should show PENDING.

3. Click on the triangle icon next to your pipeline.The Pipeline Summary panel below shows the detailsof your pipeline runs. Because your pipeline is not yet activated, you should see only 0s at this point.

4. In the Pipeline summary panel, click View all fields to see the configuration of your pipeline definition.


AWS Data Pipeline Developer GuideValidate and Save Your Pipeline

5. Click Close.

Next, activate your pipeline.

Activate your PipelineYou must activate your pipeline to start creating and processing runs based on the specifications in yourpipeline definition.

To activate your pipeline

1. On the List Pipelines page, in the Details column of your pipeline, click View pipeline.

2. In the Pipeline: name of your pipeline page, click Activate.

A confirmation dialog box opens up confirming the activation.

3. Click Close.

Next, verify if your pipeline is running.

Monitor the Progress of Your Pipeline RunsTo monitor the progress of your pipeline

1. On the List Pipelines page, in the Details column of your pipeline, click View instance details.


AWS Data Pipeline Developer GuideActivate your Pipeline

2. The Instance details: name of your pipeline page lists the status of each instance.

NoteIf you do not see runs listed, depending on when your pipeline was scheduled, either clickthe End (in UTC) date box and change it to a later date or click the Start (in UTC) date boxand change it to an earlier date. Then click Update.

3. If the Status column of all the instances in your pipeline indicates FINISHED, your pipeline hassuccessfully completed the copy activity.You should receive an email about the successful completionof this task, to the account you specified for receiving your Amazon SNS notification.

You can also check your Amazon S3 data target bucket to verify if the data was copied.

4. If the Status column of any of your instance indicates a status other than FINISHED, either yourpipeline is waiting for some precondition to be met or it has failed.

a. To troubleshoot the failed or the incomplete instance runs,

Click the triangle next to an instance, Instance summary panel opens to show the details ofthe selected instance.

b. In the Instance summary pane. click View instance fields to see details of fields associatedwith the selected instance. If the status of your selected instance is FAILED, the details box hasan entry indicating the reason for failure. For example, @failureReason = Resource nothealthy terminated.

c. In the Instance summary pane, in the Select attempt for this instance box, select the attemptnumber.

d. In the Instance summary pane, click View attempt fields to see details of fields associatedwith the selected attempt.

5. To take an action on your incomplete or failed instance, select an action (Rerun|Cancel|MarkFinished) from the Action column of the instance.

You can use the information in the Instance summary pane and the View instance fields box totroubleshoot issues with your failed pipeline.

For more information about instance status, see Interpret Pipeline Status Details (p. 129). For moreinformation about troubleshooting the failed or incomplete instance runs of your pipeline, see AWS DataPipeline Problems and Solutions (p. 131).


AWS Data Pipeline Developer GuideMonitor the Progress of Your Pipeline Runs

ImportantYour pipeline is running and is incurring charges. For more information, see AWS Data Pipelinepricing. If you would like to stop incurring the AWS Data Pipeline usage charges, delete yourpipleline.

[Optional] Delete your PipelineDeleting your pipeline deletes the pipeline definition including all the associated objects.You stop incurringcharges as soon as your pipeline is deleted.

To delete your pipeline

1. In the List Pipelines page, click the check box next to your pipeline.

2. Click Delete.

3. In the confirmation dialog box, click Delete to confirm the delete request.

Using the Command Line InterfaceTopics

• Define a Pipeline in JSON Format (p. 33)

• Upload the Pipeline Definition (p. 38)

• Activate the Pipeline (p. 39)

• Verify the Pipeline Status (p. 39)

The following topics explain how to use the AWS Data Pipeline CLI to create and use pipelines to copydata from one Amazon S3 bucket to another. In this example, we perform the following steps:

• Create a pipeline definition using the CLI in JSON format

• Create the necessary IAM roles and define a policy and trust relationships

• Upload the pipeline definition using the AWS Data Pipeline CLI tools

• Monitor the progress of the pipeline

Define a Pipeline in JSON FormatThis example scenario shows how to use JSON pipeline definitions and the AWS Data Pipeline CLI toschedule copying data between two Amazon S3 buckets at a specific time interval.This is the full pipelinedefinition JSON file followed by an explanation for each of its sections.

NoteWe recommend that you use a text editor that can help you verify the syntax of JSON-formattedfiles, and name the file using the .json file extension.

{ "objects": [


AWS Data Pipeline Developer Guide[Optional] Delete your Pipeline

aws.amazon.com/datapipeline/#pricing


{ "id": "MySchedule", "type": "Schedule", "startDateTime": "2012-11-25T00:00:00", "endDateTime": "2012-11-26T00:00:00", "period": "1 day" }, { "id": "S3Input", "type": "S3DataNode", "schedule": { "ref": "MySchedule" }, "filePath": "s3://testbucket/file.txt" }, { "id": "S3Output", "type": "S3DataNode", "schedule": { "ref": "MySchedule" }, "filePath": "s3://testbucket/file-copy.txt" }, { "id": "MyEC2Resource", "type": "Ec2Resource", "schedule": { "ref": "MySchedule" }, "actionOnTaskFailure": "terminate", "actionOnResourceFailure": "retryAll", "maximumRetries": "1", "role": "test-role", "resourceRole": "test-role", "instanceType": "m1.medium", "instanceCount": "1", "securityGroups": [ "test-group", "default" ], "keyPair": "test-pair" }, { "id": "MyCopyActivity", "type": "CopyActivity", "runsOn": { "ref": "MyEC2Resource" }, "input": { "ref": "S3Input" }, "output": { "ref": "S3Output" }, "schedule": { "ref": "MySchedule" } }


AWS Data Pipeline Developer GuideDefine a Pipeline in JSON Format

]}

ScheduleThe example AWS Data Pipeline JSON file begins with a section to define the schedule by which to copythe data. Many pipeline components have a reference to a schedule and you may have more than one.The Schedule component is defined by the following fields:

{ "id": "MySchedule", "type": "Schedule", "startDateTime": "2012-11-22T00:00:00", "endDateTime":"2012-11-23T00:00:00", "period": "1 day"},

NoteIn the JSON file, you can define the pipeline components in any order you prefer. In this example,we chose the order that best illustrates the pipeline component dependencies.

NameThe user-defined name for the pipeline schedule, which is a label for your reference only.

TypeThe pipeline component type, which is Schedule.

startDateTimeThe date/time (in UTC format) that you want the task to begin.

endDateTimeThe date/time (in UTC format) that you want the task to stop.

periodThe time period that you want to pass between task attempts, even if the task occurs only one time.The period must evenly divide the time between startDateTime and endDateTime. In this example,we set the period to be 1 day so that the pipeline copy operation can only run one time.

Amazon S3 Data NodesNext, the input S3DataNode pipeline component defines a location for the input files; in this case, anAmazon S3 bucket location. The input S3DataNode component is defined by the following fields:

{ "id" : "S3Input", "type" : "S3DataNode", "schedule" : {"ref" : "MySchedule"}, "filePath" : "s3://testbucket/file.txt", "schedule": { "ref": "MySchedule" }},

NameThe user-defined name for the input location (a label for your reference only).

TypeThe pipeline component type, which is "S3DataNode" to match the location where the data resides,in an Amazon S3 bucket.



ScheduleA reference to the schedule component that we created in the preceding lines of the JSON file labeled“MySchedule”.

PathThe path to the data associated with the data node. The syntax for a data node is determined by itstype. For example, a data node for a database follows a different syntax that is appropriate for adatabase table.

Next, the output S3DataNode component defines the output destination location for the data. It followsthe same format as the input S3DataNode component, except the name of the component and a differentpath to indicate the target file.

{ "id" : "S3Output", "type" : "S3DataNode", "schedule" : {"ref" : "MySchedule"}, "filePath" : "s3://testbucket/file-copy.txt", "schedule": { "ref": "MySchedule" }},

ResourceThis is a definition of the computational resource that performs the copy operation. In this example, AWSData Pipeline should automatically create an EC2 instance to perform the copy task and terminate theresource after the task completes.The fields defined here control the creation and function of the AmazonEC2 instance that does the work. The EC2Resource is defined by the following fields:

{ "id": "MyEC2Resource", "type": "Ec2Resource", "schedule": {"ref": "MySchedule"}, "actionOnTaskFailure": "terminate", "actionOnResourceFailure": "retryAll", "maximumRetries": "1", "role" : "test-role", "resourceRole": "test-role", "instanceType": "m1.medium", "instanceCount": "1", "securityGroups": [ "test-group", "default" ], "keyPair": "test-pair"},


TypeThe type of computational resource to perform work; in this case, an Amazon EC2 instance. Thereare other resource types available, such as an EmrCluster type.

ScheduleThe schedule on which to create this computational resource.

actionOnTaskFailureThe action to take on the Amazon EC2 instance if the task fails. In this case, the instance shouldterminate so that each failed attempt to copy data does not leave behind idle, abandoned AmazonEC2 instances with no work to perform. These instances require manual termination by anadministrator.



actionOnResourceFailureThe action to perform if the resource is not created successfully. In this case, retry the creation of anAmazon EC2 instance until it is successful.

maximumRetriesThe number of times to retry the creation of this computational resource before marking the resourceas a failure and stopping any further creation attempts. This setting works in conjunction with theactionOnResourceFailure field.

RoleThe IAM role of the account that accesses resources, such as accessing an Amazon S3 bucket toretrieve data.

resourceRoleThe IAM role of the account that creates resources, such as creating and configuring an AmazonEC2 instance on your behalf. Role and ResourceRole can be the same role, but separately providegreater granularity in your security configuration.

instanceTypeThe size of the Amazon EC2 instance to create. Ensure that you set the appropriate size of EC2instance that best matches the load of the work that you want to perform with AWS Data Pipeline.In this case, we set an m1.medium sized EC2 instance. For more information about the differentinstance types and when to use each one, see to the Amazon EC2 Instance Types topic athttp://aws.amazon.com/ec2/instance-types/.

instanceCountThe number of Amazon EC2 instances in the computational resource pool to service any pipelinecomponents depending on this resource.

securityGroupsThe Amazon EC2 security groups to grant access to this EC2 instance. As the example shows, youcan provide more than one group at a time (test-group and default).

keyPairThe name of the SSH public/private key pair to log in to the Amazon EC2 instance. For moreinformation, see Amazon EC2 Key Pairs.

ActivityThe last section in the JSON file is the definition of the activity that represents the work to perform. Thisexample uses CopyActivity to copy data from a file in an Amazon S3 bucket to another file. TheCopyActivity component is defined by the following fields:

{ "id" : "MyCopyActivity", "type" : "CopyActivity", "runsOn":{"ref":"MyEC2Resource"}, "input" : {"ref" : "S3Input"}, "output" : {"ref" : "S3Output"}, "schedule" : {"ref" : "MySchedule"}}

NameThe user-defined name for the activity, which is a label for your reference only.

TypeThe type of activity to perform, such as MyCopyActivity.

runsOnThe computational resource that performs the work that this activity defines. In this example, weprovide a reference to the Amazon EC2 instance defined previously. Using the runsOn field causesAWS Data Pipeline to create the EC2 instance for you. The runsOn field indicates that the resource



http://docs.aws.amazon.com/AWSSecurityCredentials/1.0/AboutAWSCredentials.html#EC2KeyPairs

exists in the AWS infrastructure, while the workerGroup value indicates that you want to use yourown on-premises resources to perform the work.

ScheduleThe schedule on which to run this activity.

InputThe location of the data to copy.

OutputThe target location data.

Upload the Pipeline DefinitionYou can upload a pipeline definition file using the AWS Data Pipeline CLI tools. For more information,see Install the Command Line Interface (p. 15)

To upload your pipeline definition, use the following command.

On Linux/Unix/Mac OS:

./datapipeline -–create pipeline_name -–put pipeline_file

On Windows:

ruby datapipeline -–create pipeline_name -–put pipeline_file

Where pipeline_name is the label for your pipeline and pipeline_file is the full path and file name for thefile with the .json file extension that defines your pipeline.

If your pipeline validates successfully, you receive the following message:

Pipeline with name pipeline_name and id df-AKIAIOSFODNN7EXAMPLE created. Pipeline definition pipeline_file.json uploaded.

NoteFor more information about any errors returned by the –create command or other commands,see Troubleshoot AWS Data Pipeline (p. 128).

Ensure that your pipeline appears in the pipeline list by using the following command.


./datapipeline --list-pipelines

On Windows:

ruby datapipeline -–list-pipelines

The list of pipelines includes details such as Name, Id, State, and UserId. Take note of your pipeline ID,because you use this value for most AWS Data Pipeline CLI commands. The pipeline ID is a uniqueidentifier using the format df-AKIAIOSFODNN7EXAMPLE.


AWS Data Pipeline Developer GuideUpload the Pipeline Definition

Activate the PipelineYou must activate the pipeline, by using the --activate command-line parameter, before it will beginperforming work. Use the following command.


./datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE

On Windows:

ruby datapipeline --activate -–id df-AKIAIOSFODNN7EXAMPLE

Where df-AKIAIOSFODNN7EXAMPLE is the identifier for your pipeline.

Verify the Pipeline StatusView the status of your pipeline and its components, along with its activity attempts and retries with thefollowing command.


./datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE

On Windows:

ruby datapipeline --list-runs –id df-AKIAIOSFODNN7EXAMPLE


The --list-runs command displays a list of pipelines components and details such as Name, ScheduledStart, Status, ID, Started, and Ended.

NoteIt is important to note the difference between the Scheduled Start date/time vs. the Started time.It is possible to schedule a pipeline component to run at a certain time (Scheduled Start), butthe actual start time (Started) could be later due to problems or delays with preconditions,dependencies, failures, or retries.

NoteAWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Startdate/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipelinecomponents the number of times the activity should have run if it had started on the ScheduledStart time. When this happens, you see pipeline components run back-to-back at a greaterfrequency than the period value that you specified when you created the pipeline. AWS DataPipeline returns your pipeline to the defined frequency only when it catches up to the number ofpast runs.

Successful pipeline runs are indicated by all the activities in your pipeline reporting the FINISHED status.Your pipeline frequency determines how many times the pipeline runs, and each run has its own successor failure, as indicated by the --list-runs command. Resources that you define in your pipeline, such asAmazon EC2 instances, may show the SHUTTING_DOWN status until they are finally terminated after asuccessful run. Depending on how you configured your pipeline, you may have multiple Amazon EC2resources, each with their own final status.


AWS Data Pipeline Developer GuideActivate the Pipeline

Tutorial: Copy Data From a MySQLTable to Amazon S3

Topics

• Before You Begin ... (p. 41)

• Using the AWS Data Pipeline Console (p. 42)

• Using the Command Line Interface (p. 48)

This tutorial walks you through the process of creating a data pipeline to copy data (rows) from a tablein MySQL database to a CSV (comma-separated values) file in Amazon S3 bucket and then send anAmazon SNS notification after the copy activity completes successfully.You will use the Amazon EC2instance resource provided by AWS Data Pipeline for this copy activity.

The first step in pipeline creation process is to select the pipeline objects that make up your pipelinedefinition. After you select the pipeline objects, you add fields for each pipeline object. For more informationon pipeline definition, see Pipeline Definition (p. 2).


ActivityActivity the AWS Data Pipeline must perform for this pipeline.

This tutorial uses the CopyActivity to copy data from a MySQL table to an Amazon S3 bucket.

ScheduleThe start date, time, and the duration for this activity.You can optionally specify the end date andtime.


This tutorial uses Ec2Resource, an Amazon EC2 instance provided by AWS Data Pipeline, to copydata. AWS Data Pipeline automatically launches the Amazon EC2 instance and then terminates theinstance after the task finishes.


This tutorial uses MySQLDataNode for source data and S3DataNode for target data.




This tutorial uses SnsAlarm action to send Amazon SNS notification to the email address you specify,after the task finishes successfully.

For information about the additional objects and fields supported by the copy activity, seeCopyActivity (p. 180).

The following steps outline how to create a data pipeline to copy data from MySQL table to Amazon S3bucket.


2. Create and configure the pipeline definition objects


4. Verify that your pipeline definition is saved




Before You Begin ...Be sure you've completed the following steps.

• Set up an Amazon Web Services (AWS) account to access the AWS Data Pipeline console. For moreinformation, see Access the Console (p. 12) .



For more information, see Create a Bucket in Amazon Simple Storage Service Getting Started Guide.

• Create and launch a MySQL database instance as a data source.

For more information, see Launch a DB Instance in the Amazon Relational Database Service(RDS)Getting Started Guide.

NoteMake a note of the user name and the password you used for creating the MySQL instance.After you've launched your MySQL database instance, make a note of the instance's endpoint.You will need all this information in this tutorial.

• Connect to your MySQL database instance, create a table, and then add test data values to the newlycreated table.

For more information, go to Create a Table in the MySQL documentation.

• Create an Amazon SNS topic for sending email notification and make a note of the topic AmazonResource Name (ARN). For more information, go to Create a Topic in Amazon Simple NotificationService Getting Started Guide.

• [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you wouldrather create and configure your IAM role policy and trust relationships, follow the instructions describedin Granting Permissions to Pipelines with IAM (p. 21).


AWS Data Pipeline Developer GuideBefore You Begin ...


http://docs.aws.amazon.com/AmazonRDS/latest/GettingStartedGuide/LaunchDBInstance.html

http://dev.mysql.com/doc/refman/5.5/en//creating-tables.html






• Verify Your Pipeline Definition (p. 45)






1. Sign in to the AWS Management Console and open the AWS Data Pipeline Console.



a. In the Pipeline Name box, enter a name (for example, CopyMySQLData).








1. On the Pipeline: name of your pipeline page, click Add activity.

2. In the Activities pane

a. Enter the name of the activity; for example, copy-mysql-data





b. In the Type box, select CopyActivity.

c. In the Input box, select Create new: DataNode.

d. In the Schedule box, select Create new: Schedule.

e. In the Output box, select Create new: DataNode.

f. In the Add an optional field .. box, select RunsOn.

g. In the Runs On box, select Create new: Resource.

h. In the Add an optional field .. box, select On Success.

i. In the On Success box, select Create new: Action.

j. In the left pane, separate the icons by dragging them apart.

You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline willuse to perform the copy activity.


Next step, configure run date and time for your pipeline.

To configure run date and time for your pipeline,



a. Enter a schedule name for this activity (for example, copy-mysql-data-schedule).



NoteAWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS"format in UTC/GMT only





To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWSData Pipeline will then starting launching the "past due" runs immediately in an attempt to addresswhat it percieves as a backlog of work. This backfilling means you don't have to wait an hour to seeAWS Data Pipeline launch its first job flow.

Next step, configure the input and the output data nodes for your pipeline.

To configure the input and output data nodes of your pipeline,



a. In the DefaultDataNode1Name box , enter the name for your input node (for example,MySQLInput).

In this tutorial, your input node is the Amazon RDS MySQL instance you just created.

b. In the Type box, select MySQLDataNode.

c. In the Username box, enter the user name you used when you created your MySQL databaseinstance.

d. In the Connection String box, enter the end point of your MySQL database instance (forexample, mydbinstance.c3frkexample.us-east-1.rds.amazonaws.com).

e. In the *Password box, enter the password you used when you created your MySQL databaseinstance.

f. In the Table box, enter the name of the source MySQL database table (for example,input-table

g. In the Schedule box, select copy-mysql-data-schedule.

h. In the DefaultDataNode2 Name box, enter the name for your output node (for example,MyS3Output).

In this tutorial, your output node is the Amazon S3 data target bucket.

i. In the Type box, select S3DataNode.

j. In the Schedule box, select copy-mysql-data-schedule.

k. In the Add an optional field .. box, select File Path.

l. In the File Path box, enter the path to your Amazon S3 bucket (for example,s3://my-data-pipeline-output/name of your csv file).

Next step, configure the the resource AWS Data Pipeline must use to perform the copy activity.




a. In the Name box, enter the name for your resource (for example, CopyDataInstance).


c. In the Schedule box, select copy-mysql-data-schedule.

Next step, configure the SNS notification action AWS Data Pipeline must perform after the copy activityfinishes successfully.



To configure the SNS notification action,



a. In the DefaultAction1Name box, enter the name for your Amazon SNS notification (forexample, CopyDataNotice).


c. In the Message box, enter the message content.

d. Leave the entry in the Role box set to default value.

e. In the Topic Arn box, enter the ARN of your Amazon SNS topic.

f. In the Subject box, enter the subject line for your notification.

You have now completed all the steps required for creating your pipeline definition. Next step, validateand save your pipeline.

Validate and Save Your PipelineYou can save your pipeline definition at any point during the creation process. As soon as you save yourpipeline definition, AWS Data Pipeline looks for syntax errors and missing values in your pipeline definition.If your pipeline is incomplete or is incorrect, AWS Data Pipeline throws a validation error. If you plan tocontinue the creation process later, you can ignore the error message. If your pipeline definition is complete,and you are getting the validation error message, you'll have to fix the errors in the pipeline definitionbefore activating your pipeline.

To validate and save your pipeline,


2. AWS Data Pipeline validates your pipeline definition and returns either success or the error message.

3. If you get an error message, click Close and then, in the right pane, click Errors.

4. The Errors pane lists the objects failing validation.

Click the plus (+) sign next to the object names and look for an error message in red.




Next step, verify that your pipeline definition has been saved.

Verify Your Pipeline DefinitionIt is important to verify that your pipeline was correctly initialized from your definitions before you activateit.

To verify your pipeline definition,






The Status column in the row listing your pipeine should show PENDING.

3. Click on the triangle icon next to your pipeline.The Pipeline Summary panel below shows the detailsof your pipeline runs. Because your pipeline is not yet activated, you should see only 0's at this point.


5. Click Close.

Next step, activate your pipeline.


To activate your pipeline,




3. Click Close.

Next step, verify if your pipeline is running.



Monitor the Progress of Your Pipeline RunsTo monitor the progress of your pipeline,


2. The Instance details: name of your pipeline page lists the status of each instance.

NoteIf you do not see instances listed, depending on when your pipeline was scheduled, eitherclick End (in UTC) date box and change it to a later date or click Start (in UTC) date boxand change it to an earlier date. And then click Update.

3. If the Status column of all the objects in your pipeline indicates FINISHED, your pipeline hassuccessfully completed the copy activity.You should receive an email about the successful completionof this task, to the account you specified for receiving your Amazon SNS notification.



a. To troubleshoot the failed or the incomplete instance runs,

Click the triangle next to an instance , Instance summary panel opens to show the details ofthe selected instance.

b. Click View instance fields to see additional details of the instance. If the status of your selectedinstance is FAILED the additional details box will have an entry indicating the reason for failure.For example, @failureReason = Resource not healthy terminated.








ImportantYour pipeline is running and is incurring charges. For more information, see AWS Data Pipelinepricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete yourpipeline.

[Optional] Delete your PipelineDeleting your pipeline will delete the pipeline definition including all the associated objects.You will stopincurring charges as soon as your pipeline is deleted.

To delete your pipeline,


2. Click Delete.



• Define a Pipeline in JSON Format (p. 49)

• Schedule (p. 50)

• MySQL Data Node (p. 51)

• Amazon S3 Data Node (p. 51)

• Resource (p. 52)

• Activity (p. 53)




The following topics explain how to use the AWS Data Pipeline CLI to create a pipeline to copy data froma MySQL table to a file in an Amazon S3 bucket. In this example, we perform the following steps:

• Create a pipeline definition using the CLI in JSON format

• Create the necessary IAM roles and define a policy and trust relationships

• Upload the pipeline definition using the AWS Data Pipeline CLI tools

• Monitor the progress of the pipeline

To complete the steps in this example, you need a MySQL database instance with a table that containsdata. To create a MySQL database using Amazon RDS, see Get Started with Amazon RDS -http://docs.aws.amazon.com/AmazonRDS/latest/GettingStartedGuide/Welcome.html. After you have an





Amazon RDS instance, see the MySQL documentation to Create a Table -http://dev.mysql.com/doc/refman/5.5/en//creating-tables.html.

Define a Pipeline in JSON FormatThis example scenario shows how to use JSON pipeline definitions and the AWS Data Pipeline CLI tocopy data (rows) from a table in a MySQL database to a CSV (comma-separated values) file in an AmazonS3 bucket at a specified time interval.This is the full pipeline definition JSON file followed by an explanationfor each of its sections.


{ "objects": [ { "id": "MySchedule", "type": "Schedule", "startDateTime": "2012-11-25T00:00:00", "endDateTime": "2012-11-26T00:00:00", "period": "1 day" }, { "id": "MySQLInput", "type": "MySqlDataNode", "schedule": { "ref": "MySchedule" }, "table": "table_name", "username": "user_name", "*password": "my_password", "connectionString": "jdbc:mysql://mysqlinstance-rds.example.us-east-1.rds.amazonaws.com:3306/database_name", "selectQuery": "select * from #{table}" }, { "id": "S3Output", "type": "S3DataNode", "filePath": "s3://testbucket/output/output_file.csv", "schedule": { "ref": "MySchedule" } }, { "id": "MyEC2Resource", "type": "Ec2Resource", "schedule": { "ref": "MySchedule" }, "actionOnTaskFailure": "terminate", "actionOnResourceFailure": "retryAll", "maximumRetries": "1", "role": "test-role", "resourceRole": "test-role", "instanceType": "m1.medium", "instanceCount": "1",



"securityGroups": [ "test-group", "default" ], "keyPair": "test-pair" }, { "id": "MyCopyActivity", "type": "CopyActivity", "runsOn": { "ref": "MyEC2Resource" }, "input": { "ref": "MySQLInput" }, "output": { "ref": "S3Output" }, "schedule": { "ref": "MySchedule" } } ]}










AWS Data Pipeline Developer GuideSchedule

MySQL Data NodeNext, the input MySqlDataNode pipeline component defines a location for the input data; in this case, anAmazon RDS instance. The input MySqlDataNode component is defined by the following fields:

{ "id": "MySQLInput", "type": "MySqlDataNode", "schedule": {"ref": "MySchedule"}, "table": "table_name", "username": "user_name", "*password": "my_password", "connectionString": "jdbc:mysql://mysqlinstance-rds.example.us-east-1.rds.amazonaws.com:3306/database_name", "selectQuery" : "select * from #{table}"},

NameThe user-defined name for the MySQL database, which is a label for your reference only.

TypeThe MySqlDataNode type, which is an Amazon RDS instance using MySQL in this example..


TableThe name of the database table that contains the data to copy. Replace table_name with the nameof your database table.

UsernameThe user name of the database account that has sufficient permission to retrieve data from thedatabase table. Replace user_name with the name of your user account.

PasswordThe password for the database account with the asterisk prefix to indicate that AWS Data Pipelinemust encrypt the password value. Replace my_password with the correct password for your useraccount.

connectionStringThe JDBC connection string for the CopyActivity object to connect to the database.

selectQueryA valid SQL SELECT query that specifies which data to copy from the database table. Note that#{table} is a variable that re-uses the table name provided by the "table" variable in the precedinglines of the JSON file.

Amazon S3 Data NodeNext, the S3Output pipeline component defines a location for the output file; in this case a CSV file in anS3 bucket location. The output S3DataNode component is defined by the following fields:

{ "id": "S3Output", "type": "S3DataNode", "filePath": "s3://testbucket/output/output_file.csv", "schedule":{"ref":"MySchedule"}},


AWS Data Pipeline Developer GuideMySQL Data Node



PathThe path to the data associated with the data node. The syntax for a data node is determined by itstype. For example, a data node for a database follows a different syntax that is appropriate for adatabase table.


ResourceThis is a definition of the computational resource that performs the copy operation. In this example, AWSData Pipeline should automatically create an EC2 instance to perform the copy task and terminate theresource after the task completes.The fields defined here control the creation and function of the AmazonEC2 instance that does the work. The EC2Resource is defined by the following fields:

{ "id": "MyEC2Resource", "type": "Ec2Resource", "schedule": {"ref": "MySchedule"}, "actionOnTaskFailure": "terminate", "actionOnResourceFailure": "retryAll", "maximumRetries": "1", "role" : "test-role", "resourceRole": "test-role", "instanceType": "m1.medium", "instanceCount": "1", "securityGroups": [ "test-group", "default" ], "keyPair": "test-pair"},


TypeThe type of computational resource to perform work; in this case, an Amazon EC2 instance. Thereare other resource types available, such as an EmrCluster type.

ScheduleThe schedule on which to create this computational resource.

actionOnTaskFailureThe action to take on the Amazon EC2 instance if the task fails. In this case, the instance shouldterminate so that each failed attempt to copy data does not leave behind idle, abandoned AmazonEC2 instances with no work to perform. These instances require manual termination by anadministrator.

actionOnResourceFailureThe action to perform if the resource is not created successfully. In this case, retry the creation of anAmazon EC2 instance until it is successful.

maximumRetriesThe number of times to retry the creation of this computational resource before marking the resourceas a failure and stopping any further creation attempts. This setting works in conjunction with theactionOnResourceFailure field.


AWS Data Pipeline Developer GuideResource

RoleThe IAM role of the account that accesses resources, such as accessing an Amazon S3 bucket toretrieve data.

resourceRoleThe IAM role of the account that creates resources, such as creating and configuring an AmazonEC2 instance on your behalf. Role and ResourceRole can be the same role, but separately providegreater granularity in your security configuration.

instanceTypeThe size of the Amazon EC2 instance to create. Ensure that you set the appropriate size of EC2instance that best matches the load of the work that you want to perform with AWS Data Pipeline.In this case, we set an m1.medium sized EC2 instance. For more information about the differentinstance types and when to use each one, see to the Amazon EC2 Instance Types topic athttp://aws.amazon.com/ec2/instance-types/.

instanceCountThe number of Amazon EC2 instances in the computational resource pool to service any pipelinecomponents depending on this resource.

securityGroupsThe Amazon EC2 security groups to grant access to this EC2 instance. As the example shows, youcan provide more than one group at a time (test-group and default).

keyPairThe name of the SSH public/private key pair to log in to the Amazon EC2 instance. For moreinformation, see Amazon EC2 Key Pairs.

ActivityThe last section in the JSON file is the definition of the activity that represents the work to perform. In thiscase we use a CopyActivity component to copy data from a file in an Amazon S3 bucket to another file.The CopyActivity component is defined by the following fields:

{ "id": "MyCopyActivity", "type": "CopyActivity", "runsOn":{"ref":"MyEC2Resource"}, "input": {"ref": "MySQLInput"}, "output": {"ref": "S3Output"}, "schedule":{"ref":"MySchedule"}}

NameThe user-defined name for the activity, which is a label for your reference only.

TypeThe type of activity to perform, such as MyCopyActivity.

runsOnThe computational resource that performs the work that this activity defines. In this example, weprovide a reference to the EC2 instance defined previously. Using the runsOn field causes AWSData Pipeline to create the EC2 instance for you. The runsOn field indicates that the resource existsin the AWS infrastructure, while the workerGroup value indicates that you want to use your ownon-premises resources to perform the work.

ScheduleThe schedule on which to run this activity.

InputThe location of the data to copy.


AWS Data Pipeline Developer GuideActivity

http://docs.aws.amazon.com/AWSSecurityCredentials/1.0/AboutAWSCredentials.html#EC2KeyPairs

OutputThe target location data.





On Windows:









On Windows:






AWS Data Pipeline Developer GuideUpload the Pipeline Definition


On Windows:






On Windows:





NoteAWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Startdate/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipelinecomponents the number of times the activity should have run if it had started on the ScheduledStart time. When this happens, you see pipeline components run back-to-back at a greaterfrequency than the period value that you specified when you created the pipeline. AWS DataPipeline returns your pipeline to the defined frequency only when it catches up to the number ofpast runs.



AWS Data Pipeline Developer GuideVerify the Pipeline Status

Tutorial: Launch an Amazon EMRJob Flow

If you regularly run an Amazon EMR job flow, such as to analyze web logs or perform analysis of scientificdata, you can use AWS Data Pipeline to manage your Amazon EMR job flows. With AWS Data Pipelineyou can specify preconditions that must be met before the job flow is launched (for example, ensuringthat today's data been uploaded to Amazon S3), a schedule for repeatedly running the job flow, and thecluster configuration to use for the job flow. The following tutorial walks you through launching a simplejob flow as an example. This can be used as a model for a simple Amazon EMR-based pipeline, or aspart of a more involved pipeline.

This tutorial walks you through the process of creating a data pipeline for a simple Amazon EMR job flowto run a pre-existing Hadoop Streaming job provided by Amazon EMR, and then send an Amazon SNSnotification after the task completes successfuly.You will use the Amazon EMR cluster resource providedby AWS Data Pipeline for this task. This sample application is called WordCount, and can also be runmanually from the Amazon EMR console.




This tutorial uses the EmrActivity to run a pre-existing Hadoop Streaming job provided by AmazonEMR.

ScheduleStart date, time, and the duration for this activity.You can optionally specify the end date and time.


This tutorial uses EmrCluster, a set of Amazon EC2 instances, provided AWS Data Pipeline to runthe job flow.AWS Data Pipeline automatically launches the Amazon EMR cluster and then terminatesthe cluster after the task finishes.





For more information about the additional objects and fields supported by Amazon EMR activity, seeEmrCluster (p. 209).

The following steps outline how to create a data pipeline to launch an Amazon EMR job flow.








Before You Begin ...Be sure you've completed the following steps.


Set up the AWS Data Pipeline tools and interface you plan on using. For more information aboutinterfaces and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS DataPipeline (p. 12).

• Create an Amazon SNS topic for sending email notification and make a note of the topic AmazonResource Name (ARN). For more information, see Create a Topic in the Amazon Simple NotificationService Getting Started Guide.





• Verify Your Pipeline Definition (p. 60)













a. In the Pipeline Name box, enter a name (for example, MyEmrJob).







1. On the Pipeline: name of your pipeline page, select Add activity.


a. Enter the name of the activity; for example, my-emr-job

b. In the Type box, select EmrActivity.

c. In the Step box, enter /home/hadoop/contrib/streaming/hadoop-streaming.jar,-input,\

s3n://elasticmapreduce/samples/wordcount/input,-output,\

s3://myawsbucket/word count/output/#{@scheduledStartTime},\

-mapper,s3n://elasticmapreduce/samples/wordcount/wordSplitter.py,-reducer,aggregate.

d. In the Schedule box, select Create new: Schedule.

e. In the Add an optional field .. box, select Runs On.

f. In the Runs On box, select Create new: EmrCluster.

g. In the Add an optional field .. box, select On Success.

h. In the On Success box, select Create new: Action.

You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline willuse to launch an Amazon EMR job flow.

The Pipeline: name of your pipeline pane shows a single activity icon for this pipeline.

Next, configure run date and time for your pipeline.







a. Enter a schedule name for this activity (for example, my-emr-job-schedule).






To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWSData Pipeline will then starting launching the "past due" runs immediately in an attempt to addresswhat it percieves as a backlog of work. This backfilling means you don't have to wait an hour to seeAWS Data Pipeline launch its first job flow.

Next, configure the resource AWS Data Pipeline must use to perform the Amazon EMR job.

To configure the resource



a. In the Name box, enter the name for your EMR cluster (for example, MyEmrCluster).

b. Leave the Type box set to the default value.

c. In the Schedule box, select my-emr-job-schedule.

Next, configure the SNS notification action AWS Data Pipeline must perform after the Amazon EMR jobfinishes successfully.




a. In the DefaultAction1Name box, enter the name for your Amazon SNS notification (forexample, EmrJobNotice).


c. In the Message box, enter the message content.

d. Leave the entry in the Role box set to default.

e. In the Subject box, enter the subject line for your notification.

f. In the Topic Arn box, enter the ARN of your Amazon SNS topic.











5. When you see the error message, click the specific object pane where you see the error and fix it.For example, if you see an error message in the Schedules object, click the Schedules pane to fixthe error.




Verify Your Pipeline DefinitionIt is important to verify that your pipeline was correctly initialized from your definitions before you activateit.










5. Click Close.







3. Click Close.






NoteYou can also view the job flows in the Amazon EMR console. The job flows spawned byAWS Data Pipeline on your behalf are displayed in the Amazon EMR console and billed toyour AWS Account in the same manner as job flows that you launch manually.You can tellwhich job flows were spawned by AWS Data Pipeline by looking at the name of the job flow.Those spawned by AWS Data Pipeline have a name formatted as follows:job-flow-identifier_@emr-cluster-name_launch-time. For more information,see View Job Flow Details in the Amazon Elastic MapReduce Developer Guide.

2. The Instance details: name of your pipeline page lists the status of each instance in yourpipeline definition.

NoteIf you do not see instances listed, depending on when your pipeline was scheduled, eitherclick End (in UTC) date box and change it to a later date or click Start (in UTC) date boxand change it to an earlier date. Then click Update.


4. If the Status column of any of your instances indicates a status other than FINISHED, either yourpipeline is waiting for some precondition to be met or it has failed.

a. To troubleshoot the failed or the incomplete runs

Click the triangle next to an instance; the Instance summary panel opens to show the detailsof the selected instance.

b. Click View instance fields to see additional details of the instance. If the status of your selectedinstance is FAILED, the details box has an entry indicating the reason for failure. For example,@failureReason = Resource not healthy terminated.







http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CLI_HowToViewDetails.html


ImportantYour pipeline is running and is incurring charges. For more information, see AWS Data Piplinepricing. If you would like to stop incurring the AWS Data Pipeline usage charges, delete yourpipleline.




2. Click Delete.



If you regularly run an Amazon EMR job flow to analyze web logs or perform analysis of scientific data,you can use AWS Data Pipeline to manage your Amazon EMR job flows. With AWS Data Pipeline, youcan specify preconditions that must be met before the job flow is launched (for example, ensuring thattoday's data been uploaded to Amazon S3.) The following tutorial walks you through launching the jobflow that can be a model for a simple Amazon EMR-based pipeline, or as part of a more involved pipeline.

The following code is the pipeline definition file for a simple Amazon EMR job flow that runs a pre-existingHadoop streaming job provided by Amazon EMR. This sample application is called WordCount, and canalso be run manually from the Amazon EMR console. In the following code, you should replace theAmazon S3 bucket location with the name of an Amazon S3 bucket that you own.You should also replacethe start and end dates. To get job flows launching immediately, set startDateTime to a date one dayin the past and endDateTime to one day in the future. AWS Data Pipeline then starts launching the "pastdue" job flows immediately in an attempt to address what it perceives as a backlog of work.This backfillingmeans you don't have to wait an hour to see AWS Data Pipeline launch its first job flow.

{ "objects": [ { "id": "Hourly", "type": "Schedule", "startDateTime": "2012-11-19T07:48:00", "endDateTime": "2012-11-21T07:48:00", "period": "1 hours" },





{ "id": "MyCluster", "type": "EmrCluster", "masterInstanceType": "m1.small", "schedule": { "ref": "Hourly" } }, { "id": "MyEmrActivity", "type": "EmrActivity", "schedule": { "ref": "Hourly" }, "runsOn": { "ref": "MyCluster" }, "step": "/home/hadoop/contrib/streaming/hadoop-streaming.jar,-in put,s3n://elasticmapreduce/samples/wordcount/input,-output,s3://myawsbucket/word count/output/#{@scheduledStartTime},-mapper,s3n://elasticmapreduce/samples/word count/wordSplitter.py,-reducer,aggregate" } ]}

This pipeline has three objects:

• Hourly, which represents the schedule of the work.You can set a schedule as one of the fields on antransform. When you do, the transform runs according to that schedule, or in this case, hourly.

• MyCluster, which represents the set of Amazon EC2 instances used to run the job flow.You canspecify the size and number of EC2 instances to run as the cluster. If you do not specify the numberof instances, the job flow launches with two, a master node and a task node.You can add additionalconfigurations to the cluster, such as bootstrap actions to load additional software onto the AmazonEMR-provided AMI.

• MyEmrActivity, which represents the computation to process with the job flow. Amazon EMR supportsseveral types of job flows, including streaming, Cascading, and Scripted Hive. The runsOn field refersback to MyCluster, using that as the specification for the underpinnings of the job flow.

To create a pipeline that launches an Amazon EMR job flow

1. Open a terminal window in the directory where you've installed the AWS Data Pipeline CLI. For moreinformation about how to install the CLI, see Install the Command Line Interface (p. 15).

2. Create a new pipeline.

./datapipeline --credentials ./credentials.json --create MyEmrPipeline

When the pipeline is created, AWS Data Pipeline returns a success message and an identifier forthe pipeline.

Pipeline with name 'MyEmrPipeline' and id 'df-07634391Y0GRTUD0SP0' created.


AWS Data Pipeline Developer GuideUsing the Command Line Interface

3. Add the JSON definition to the pipeline. This gives AWS Data Pipeline the business logic it needsto manage your data.

./datapipeline --credentials ./credentials.json --put MyEmrPipelineDefini tion.df --id df-07634391Y0GRTUD0SP0

The following message is an example of a successfully uploaded pipeline.

State of pipeline id 'df-07634391Y0GRTUD0SP0' is currently 'PENDING'

4. Activate the pipeline.

./datapipeline --credentials ./credentials.json --activate --id df-07634391Y0GRTUD0SP0

If the pipeline definition is valid, the previous --put command uploads the business logic and activatesthe pipeline. If the pipeline is invalid, AWS Data Pipeline returns an error code indicating what theproblems are.

5. Wait until the pipeline has had time to start running, then verify the pipeline's operation.

./datapipeline --credentials ./credentials.json --list-runs --id df-07634391Y0GRTUD0SP0

This returns information about the runs initiated by the pipeline, such as the following.

State of pipeline id 'df-07634391Y0GRTUD0SP0' is currently 'SCHEDULED'

The --list-runs command is fetching the last 4 days of pipeline runs.If this takes too long, use --help for how to specify a differentinterval with --start-interval or --schedule-interval.

Name Scheduled Start Status ID Started Ended ------------------------------------------------------------------------------------------------------- 1. MyCluster 2012-11-19T07:48:00 FINISHED @MyCluster_2012-11-19T07:48:00 2012-11-20T22:29:33



2012-11-20T22:40:46

2. MyEmrActivity 2012-11-19T07:48:00 FINISHED @MyEmrActivity_2012-11-19T07:48:00 2012-11-20T22:29:31 2012-11-20T22:38:43

3. MyCluster 2012-11-19T08:03:00 RUNNING @MyCluster_2012-11-19T08:03:00 2012-11-20T22:34:32

4. MyEmrActivity 2012-11-19T08:03:00 RUNNING @MyEmrActivity_2012-11-19T08:03:00 2012-11-20T22:34:31

5. MyCluster 2012-11-19T08:18:00 CREATING @MyCluster_2012-11-19T08:18:00 2012-11-20T22:39:31

6. MyEmrActivity 2012-11-19T08:18:00 WAITING_FOR_RUNNER @MyEmrActivity_2012-11-19T08:18:00 2012-11-20T22:39:30

All times are listed in UTC and all command line input is treated as UTC.

Total of 6 pipeline runs shown from pipeline named 'MyEmrPipeline' where --start-interval 2012-11-16T22:41:32,2012-11-20T22:41:32

You can view job flows launched by AWS Data Pipeline in the Amazon EMR console. The job flowsspawned by AWS Data Pipeline on your behalf are displayed in the Amazon EMR console and billed toyour AWS Account in the same manner as job flows that you launch manually.

To check the progress of job flows launched by AWS Data Pipeline

1. Look at the name of the job flow to tell which job flows were spawned by AWS Data Pipeline. Thosespawned by AWS Data Pipeline have a name formatted as follows:<job-flow-identifier>_@<emr-cluster-name>_<launch-time>. This is shown in thefollowing screen.



2. Click on the Bootstrap Actions tab to display the bootstrap action that AWS Data Pipeline uses toinstall AWS Data Pipeline Task Agent on the Amazon EMR clusters that it launches.



3. After one of the runs is complete, navigate to the Amazon S3 console and check that the time-stampedoutput folder exists and contains the expected results of the job flow.



Tutorial: Import/Export Data inAmazon DynamoDB With AmazonEMR and Hive

This is the first of a two-part tutorial that demonstrates how to bring together multiple AWS features tosolve real-world problems in a scalable way through a common scenario: moving schema-less data inand out of Amazon DynamoDB using Amazon EMR and Hive. Complete part one before you move onto part two. This tutorial involves the following concepts and procedures:

• Using the AWS Data Pipeline console and command-line interface (CLI) to create and configure pipelines

• Creating and configuring Amazon DynamoDB tables

• Creating and allocating work to Amazon EMR clusters

• Querying and processing data with Hive scripts

• Storing and accessing data using Amazon S3

Part One: Import Data into Amazon DynamoDBTopics

• Before You Begin... (p. 70)

• Create an Amazon SNS Topic (p. 73)

• Create an Amazon S3 Bucket (p. 74)



The first part of this tutorial explains how to define an AWS Data Pipeline pipeline to retrieve data froma tab-delimited file in Amazon S3 to populate a Amazon DynamoDB table, use a Hive script to define thenecessary data transformation steps, and automatically create an Amazon EMR cluster to perform thework. The first part of the tutorial involves the following steps:

1. Create a Amazon DynamoDB table to store the data



AWS Data Pipeline Developer GuidePart One: Import Data into Amazon DynamoDB

3. Upload your pipeline definition

4. Verify your results

Before You Begin...Be sure you've completed the following steps.





• Create an Amazon DynamoDB table to store data as defined by the following procedure.

Be aware of the following:

• Imports may overwrite data in your Amazon DynamoDB table. When you import data from AmazonS3, the import may overwrite items in your Amazon DynamoDB table. Make sure that you are importingthe right data and into the right table. Be careful not to accidentally set up a recurring import pipelinethat will import the same data multiple times.

• Exports may overwrite data in your Amazon S3 bucket. When you export data to Amazon S3, you mayoverwrite previous exports if you write to the same bucket path. The default behavior of the ExportDynamoDB to S3 template will append the job’s scheduled time to the Amazon S3 bucket path, whichwill help you avoid this problem.

• Import and Export jobs will consume some of your Amazon DynamoDB table’s provisioned throughputcapacity. This section explains how to schedule an import or export job using Amazon EMR. TheAmazon EMR cluster will consume some read capacity during exports or write capacity during imports.You can control the percentage of the provisioned capacity that the import/export jobs consume bywith the settings MyImportJob.myDynamoDBWriteThroughputRatio andMyExportJob.myDynamoDBReadThroughputRatio. Be aware that these settings determine howmuch capacity to consume at the beginning of the import/export process and will not adapt in real timeif you change your table’s provisioned capacity in the middle of the process.

• Be aware of the costs. AWS Data Pipeline manages the import/export process for you, but you still payfor the underlying AWS services that are being used.The import and export pipelines will create AmazonEMR clusters to read and write data and there are per-instance charges for each node in the cluster.You can read more about the details of Amazon EMR Pricing. The default cluster configuration is onem1.small instance master node and one m1.xlarge instance task node, though you can change thisconfiguration in the pipeline definition. There are also charges for AWS Data Pipeline. For moreinformation, see AWS Data Pipeline Pricing and Amazon S3 Pricing.

Create an Amazon DynamoDB TableThis section explains how to create an Amazon DynamoDB table that is a prerequisite for this tutorial.For more information, see Working with Tables in Amazon DynamoDB in the Amazon DynamoDBDeveloper Guide.

NoteIf you already have a Amazon DynamoDB table, you can skip this procedure to create one.



http://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.htm

https://aws.amazon.com/elasticmapreduce/pricing/

https://aws.amazon.com/datapipeline/pricing/

https://aws.amazon.com/s3/pricing/

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithDDTables.html

To create a Amazon DynamoDB table

1. Sign in to the AWS Management Console and open the Amazon DynamoDB console.

2. Click Create Table.

3. On the Create Table / Primary Key page, enter a name (for example, MyTable) in the Table Namebox.

NoteYour table name must be unique.

4. In the Primary Key section, for the Primary Key Type radio button, select Hash.

5. In the Hash Attribute Name field, select Number and enter Id in the text box as shown:

6. Click Continue.

7. On the Create Table / Provisioned Throughput Capacity page, in the Read Capacity Units box,enter 5.

8. In the Write Capacity Units box, enter 5 as shown:



https://console.aws.amazon.com/dynamodb/

NoteIn this example, we use read and write capacity unit values of five because the sample inputdata is small.You may need a larger value depending on the size of your actual input dataset. For more information, see Provisioned Throughput in Amazon DynamoDB in the AmazonDynamoDB Developer Guide.

9. Click Continue.

10. On the Create Table / Throughput Alarms page, in the Send notification to box, enter your emailaddress as shown:



http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ProvisionedThroughputIntro.html

Create an Amazon SNS TopicThis section explains how to create an Amazon SNS topic and subscribe to receive notifications fromAWS Data Pipeline regarding the status of your pipeline components. For more information, see Createa Topic in the Amazon SNS Getting Started Guide.

NoteIf you already have an Amazon SNS topic ARN to which you have subscribed, you can skip thisprocedure to create one.

To create an Amazon SNS topic

1. Sign in to the AWS Management Console and open the Amazon SNS console.

2. Click Create New Topic.

3. In the Topic Name field, type your topic name, such as my-example-topic, and select CreateTopic.

4. Note the value from the Topic ARN field, which should be similar in format to this example:arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic.

To create an Amazon SNS subscription

1. Sign in to the AWS Management Console and open the Amazon SNS console.

2. In the navigation pane, select your Amazon SNS topic and click Create New Subscription.

3. In the Protocol field, choose Email.

4. In the Endpoint field, type your email address and select Subscribe.

NoteYou must accept the subscription confirmation email to begin receiving Amazon SNSnotifications at the email address you specify.


AWS Data Pipeline Developer GuideCreate an Amazon SNS Topic



https://console.aws.amazon.com/sns/

https://console.aws.amazon.com/sns/

Create an Amazon S3 BucketThis section explains how to create an Amazon S3 bucket as a storage location for your input and outputfiles related to this tutorial. For more information, see Create a Bucket in the Amazon Simple StorageService Getting Started Guide.

NoteIf you already have an Amazon S3 bucket configured with write permissions, you can skip thisprocedure to create one.

To create an Amazon S3 bucket

1. Sign in to the AWS Management Console and open the Amazon S3 console.

2. Click Create Bucket.

3. In the Bucket Name field, type your topic name, such as my-example-bucket and select Create.

4. In the Buckets pane, select your new bucket and select Permissions.

5. Ensure that all user accounts that you want to access these files appear in the Grantee list.


• Start Import from the Amazon DynamoDB Console (p. 74)

• Create the Pipeline Definition using the AWS Data Pipeline Console (p. 75)

• Create and Configure the Pipeline from a Template (p. 76)

• Complete the Data Nodes (p. 76)

• Complete the Resources (p. 77)

• Complete the Activity (p. 78)

• Complete the Notifications (p. 78)






The following topics explain how to how to define an AWS Data Pipeline pipeline to retrieve data from atab-delimited file using the AWS Data Pipeline console.

Start Import from the Amazon DynamoDB ConsoleYou can begin the Amazon DynamoDB import operation from within the Amazon DynamoDB console.

To start the data import


2. On the Tables screen, click your Amazon DynamoDB table and click the Import Table button.

3. On the Import Table screen, read the walkthrough and check the I have read the walkthroughbox, then select Build a Pipeline.This opens the AWS Data Pipeline console so that you can choosea template to import the Amazon DynamoDB table data.


AWS Data Pipeline Developer GuideCreate an Amazon S3 Bucket


https://console.aws.amazon.com/s3/


Create the Pipeline Definition using the AWS Data PipelineConsole

To create the new pipeline

1. Sign in to the AWS Management Console and open the AWS Data Pipeline console or arrive at theAWS Data Pipeline console through the Build a Pipeline button in the Amazon DynamoDB console.

2. Click Create new pipeline.




c. Leave the Select Schedule Type: button set to the default type Time Series Time Schedulingfor this tutorial. Schedule type allows you to specify whether the objects in your pipeline definitionshould be scheduled at the beginning of interval or end of the interval. Time Series StyleScheduling means instances are scheduled at the end of each interval and Cron Style Schedulingmeans instances are scheduled at the beginning of each interval.

d. Leave the Role boxes set to their default values for this tutorial, which isDataPipelineDefaultRole.


e. Leave the Role boxes set to their default values for this tutorial, which areDataPipelineDefaultRole for the role and DataPipelineDefaultResourceRole for the resourcerole.


f. Click Create a new Pipeline.




Create and Configure the Pipeline from a TemplateOn the Pipeline screen, click Templates and select Export S3 to DynamoDB. The AWS Data Pipelineconsole pre-populates a pipeline definition template with the base objects necessary to import data fromAmazon S3, as shown in the following screen.

Review the template and complete the missing fields.You start by choosing the schedule and frequencyby which you want your data export operation to run.

To complete the schedule

• On the Pipeline screen, click Schedules.

a. In the ImportSchedule section, set Period to 1 Hours.

b. Set Start Date Time using the calendar to the current date, such as 2012-12-18 and the timeto 00:00:00 UTC.

c. In the Add an optional field .. box, select End Date Time.

d. Set End Date Time using the calendar to the following day, such as 2012-12-19 and the timeto 00:00:00 UTC.

Complete the Data NodesNext, you complete the data node objects in your pipeline definition template.

To complete the Amazon DynamoDB data node

1. On the Pipeline: name of your pipeline page, select DataNodes.




a. Enter the Name; for example: DynamoDB.

b. In the MyDynamoDBData section, in the Table Name box, type the name of the AmazonDynamoDB table where you want to store the output data; for example: MyTable.

To complete the Amazon S3 data node

• In the DataNodes pane:

• In the MyS3Data section, in the Directory Path field, type a valid Amazon S3 directory path forthe location of your source data, for example,s3://elasticmapreduce/samples/Store/ProductCatalog.This sample file is a fictionalproduct catalog that is pre-populated with delimited data for demonstration purposes.

Complete the ResourcesNext, you complete the resources that will run the data import activities. Many of the fields areauto-populated by the template, as shown in the following screen.You only need to complete the emptyfields.

To complete the resources

• On the Pipeline page, select Resources.

• In the Emr Log Uri box, type the path where to store Amazon EMR debugging logs, using theAmazon S3 bucket that you configured in part one of this tutorial; for example:s3://my-test-bucket/emr_debug_logs.



Complete the ActivityNext, you complete the activity that represents the steps to perform in your data import operation.

To complete the activity

1. On the Pipeline: name of your pipeline page, select Activities.

2. In the MyImportJob section, review the default options already provided.You are not required tomanually configure any options in this section.

Complete the NotificationsNext, configure the SNS notification action AWS Data Pipeline must perform depending on the outcomeof the activity.

To configure the SNS success, failure, and late notification action



a. In the LateSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topicthat you created earlier in the tutorial; for example:arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic.

b. In the FailureSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topicthat you created earlier in the tutorial; for example:arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic.

c. In the SuccessSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNStopic that you created earlier in the tutorial; for example:arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic.























5. Click Close.









3. Click Close.


Monitor the Progress of Your Pipeline Runs

To monitor the progress of your pipeline


2. The Instance details:name of your pipeline page lists the status of each object in your pipelinedefinition.





a. To troubleshoot the failed or the incomplete instance runs

Click the triangle next to an instance; the Instance summary panel opens to show the detailsof the selected instance.

b. Click View instance fields to see additional details of the instance. If the status of your selectedinstance is FAILED, the details box has an entry indicating the reason for failure; for example:@failureReason = Resource not healthy terminated.








ImportantYour pipeline is running and is incurring charges. For more information, see AWS Data Piplinepricing. If you would like to stop incurring the AWS Data Pipeline usage charges, delete yourpipeline.


To delete your pipeline,


2. Click Delete.



• Define the Import Pipeline in JSON Format (p. 82)







• Precondition (p. 85)

• Amazon EMR Cluster (p. 86)

• Amazon EMR Activity (p. 86)




• Verify Data Import (p. 90)

The following topics explain how to perform the steps in this tutorial using the AWS Data Pipeline CLI.

Define the Import Pipeline in JSON FormatThis example pipeline definition shows how to use AWS Data Pipeline to retrieve data from a file inAmazon S3 to populate a Amazon DynamoDB table, use a Hive script to define the necessary datatransformation steps, and automatically create an Amazon EMR cluster to perform the work. Additionally,this pipeline sends Amazon SNS notifications if the pipeline succeeds, fails, or runs late. This is the fullpipeline definition JSON file followed by an explanation for each of its sections.


{ "objects": [ { "id": "MySchedule", "type": "Schedule", "startDateTime": "2012-11-22T00:00:00", "endDateTime":"2012-11-23T00:00:00", "period": "1 day" }, { "id": "MyS3Data", "type": "S3DataNode", "schedule": { "ref": "MySchedule" }, "filePath": "s3://input_bucket/ProductCatalog", "precondition": { "ref": "InputReady" } }, { "id": "InputReady", "type": "S3PrefixNotEmpty", "role": "test-role", "s3Prefix": "#{node.filePath}" }, { "id": "ImportCluster", "type": "EmrCluster", "masterInstanceType": "m1.small", "instanceCoreType": "m1.xlarge", "instanceCoreCount": "1",



"schedule": { "ref": "MySchedule" }, "enableDebugging": "true", "emrLogUri": "s3://test_bucket/emr_logs" }, { "id": "MyImportJob", "type": "EmrActivity", "dynamoDBOutputTable": "MyTable", "dynamoDBWritePercent": "1.00", "s3MyS3Data": "#{input.path}", "lateAfterTimeout": "12 hours", "attemptTimeout": "24 hours", "maximumRetries": "0", "input": { "ref": "MyS3Data" }, "runsOn": { "ref": "ImportCluster" }, "schedule": { "ref": "MySchedule" }, "onSuccess": { "ref": "SuccessSnsAlarm" }, "onFail": { "ref": "FailureSnsAlarm" }, "onLateAction": { "ref": "LateSnsAlarm" }, "step": "s3://elasticmapreduce/libs/script-runner/script-run ner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importDynamoD BTableFromS3,-d,DYNAMODB_OUTPUT_TABLE=#{dynamoDBOutputTable},-d,S3_INPUT_BUCK ET=#{s3MyS3Data},-d,DYNAMODB_WRITE_PERCENT=#{dynamoDBWritePercent},-d,DYNAMODB_EN DPOINT=dynamodb.us-east-1.amazonaws.com" }, { "id": "SuccessSnsAlarm", "type": "SnsAlarm", "topicArn": "arn:aws:sns:us-east-1:286192228708:mysnsnotify", "role": "test-role", "subject": "DynamoDB table '#{node.dynamoDBOutputTable}' import succeeded",

"message": "DynamoDB table '#{node.dynamoDBOutputTable}' import from S3 bucket '#{node.s3MyS3Data}' succeeded at #{node.@actualEndTime}. JobId: #{node.id}" }, { "id": "LateSnsAlarm", "type": "SnsAlarm", "topicArn": "arn:aws:sns:us-east-1:286192228708:mysnsnotify", "role": "test-role", "subject": "DynamoDB table '#{node.dynamoDBOutputTable}' import is taking a long time!",



"message": "DynamoDB table '#{node.dynamoDBOutputTable}' import from S3 bucket '#{node.s3MyS3Data}' has exceeded the late warning period '#{node.lateAfterTimeout}'. JobId: #{node.id}" }, { "id": "FailureSnsAlarm", "type": "SnsAlarm", "topicArn": "arn:aws:sns:us-east-1:286192228708:mysnsnotify", "role": "test-role", "subject": "DynamoDB table '#{node.dynamoDBOutputTable}' import failed!",

"message": "DynamoDB table '#{node.dynamoDBOutputTable}' import from S3 bucket '#{node.s3MyS3Data}' failed. JobId: #{node.id}. Error: #{node.errorMes sage}." } ]}









Amazon S3 Data NodeNext, the S3DataNode pipeline component defines a location for the input file; in this case a tab-delimitedfile in an Amazon S3 bucket location.The input S3DataNode component is defined by the following fields:



{ "id": "MyS3Data", "type": "S3DataNode", "schedule": { "ref": "MySchedule" }, "filePath": "s3://input_bucket/ProductCatalog", "precondition": { "ref": "InputReady" }},




PathThe path to the data associated with the data node. This path contains a sample product cataloginput file that we use for this scenario. The syntax for a data node is determined by its type. Forexample, a data node for a file in Amazon S3 follows a different syntax that is appropriate for adatabase table.

PreconditionA reference to a precondition that must evaluate as true for the pipeline to consider the data nodeto be valid. The precondition itself is defined later in the pipeline definition file.

PreconditionNext, the precondition defines a condition that must be true for the pipeline to use the S3DataNodeassociated with this precondition. The precondition is defined by the following fields:

{ "id": "InputReady", "type": "S3PrefixNotEmpty", "role": "test-role", "s3Prefix": "#{node.filePath}"},

NameThe user-defined name for the precondition (a label for your reference only).

TypeThe type of the precondition is S3PrefixNotEmpty, which checks an Amazon S3 prefix to ensure thatit is not empty.

RoleThe IAM role that provides the permissions necessary to access the S3DataNode.

S3PrefixThe Amazon S3 prefix to check for emptiness. This field uses an expression #{node.filePath}populated from the referring component, which in this example is the S3DataNode that refers to thisprecondition.



Amazon EMR ClusterNext, the EmrCluster pipeline component defines an Amazon EMR cluster that processes and movesthe data in this tutorial. The EmrCluster component is defined by the following fields:

{ "id": "ImportCluster", "type": "EmrCluster", "masterInstanceType": "m1.small", "instanceCoreType": "m1.xlarge", "instanceCoreCount": "1", "schedule": { "ref": "MySchedule" }, "enableDebugging": "true", "emrLogUri": "s3://test_bucket/emr_logs"},

NameThe user-defined name for the Amazon EMR cluster (a label for your reference only).

TypeThe computational resource type, which is an Amazon EMR cluster. For more information, seeOverview of Amazon EMR in the Amazon EMR Developer Guide.

masterInstanceTypeThe type of Amazon EC2 instance to use as the master node of the Amazon EMR cluster. For moreinformation, see Amazon EC2 Instance Types in the Amazon EC2 Documentation.

instanceCoreTypeThe type of Amazon EC2 instance to use as the core node of the Amazon EMR cluster. For moreinformation, see Amazon EC2 Instance Types in the Amazon EC2 Documentation.

instanceCoreCountThe number of core Amazon EC2 instances to use in the Amazon EMR cluster.


enableDebuggingIndicates whether to create detailed debug logs for the Amazon EMR job flow.

emrLogUriSpecifies an Amazon S3 location to store the Amazon EMR job flow debug logs if you enableddebugging with the previously mentioned enableDebugging field.

Amazon EMR ActivityNext, the EmrActivity pipeline component brings together the schedule, resources, and data nodes todefine the work to perform, the conditions under which to do the work, and the actions to perform whencertain events occur. The EmrActivity component is defined by the following fields:

{ "id": "MyImportJob", "type": "EmrActivity", "dynamoDBOutputTable": "MyTable", "dynamoDBWritePercent": "1.00", "s3MyS3Data": "#{input.path}", "lateAfterTimeout": "12 hours",



http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/Introduction_EMROverview.html

http://aws.amazon.com/ec2/instance-types/


"attemptTimeout": "24 hours", "maximumRetries": "0", "input": { "ref": "MyS3Data" }, "runsOn": { "ref": "ImportCluster" }, "schedule": { "ref": "MySchedule" }, "onSuccess": { "ref": "SuccessSnsAlarm" }, "onFail": { "ref": "FailureSnsAlarm" }, "onLateAction": { "ref": "LateSnsAlarm" }, "step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elast icmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importDynamoDBTableFromS3,-d,DYNAMODB_OUTPUT_TABLE=#{dynamoDBOutputTable},-d,S3_INPUT_BUCKET=#{s3MyS3Data},-d,DYNAMODB_WRITE_PERCENT=#{dynamoDBWritePercent},-d,DYNAMODB_ENDPOINT=dy namodb.us-east-1.amazonaws.com"},

NameThe user-defined name for the Amazon EMR activity (a label for your reference only).

TypeThe EmrActivity pipeline component type, which creates an Amazon EMR job flow to perform thedefined work. For more information, see Overview of Amazon EMR in the Amazon EMR DeveloperGuide.

dynamoDBOutputTableThe Amazon DynamoDB table where the Amazon EMR job flow writes the output of the Hive script.

dynamoDBWritePercentSets the rate of write operations to keep your Amazon DynamoDB database instance provisionedthroughput rate in the allocated range for your table. The value is between 0.1 and 1.5, inclusively.For more information, see Hive Options in Amazon EMR Developer Guide.

s3MyS3DataAn expression that refers to the Amazon S3 location path of the input data defined by the S3DataNodelabeled "MyS3Data".

lateAfterTimeoutThe amount of time, after the schedule start time, that the activity can wait to start before AWS DataPipeline considers it late.

attemptTimeoutThe amount of time, after the schedule start time, that the activity has to complete before AWS DataPipeline considers it as failed.

maximumRetriesThe maximum number of times that AWS Data Pipeline retries the activity.

inputThe Amazon S3 location path of the input data defined by the S3DataNode labeled "MyS3Data".




http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/EMR_Interactive_Hive.html#EMR_Hive_Options

runsOnA reference to the computational resource that will run the activity; in this case, an EmrCluster labeled"ImportCluster".

scheduleA reference to the schedule component that we created in the preceding lines of the JSON file labeled“MySchedule”.

onSuccessA reference to the action to perform when the activity is successful. In this case, it is to send anAmazon SNS notification.

onFailA reference to the action to perform when the activity fails. In this case, it is to send an Amazon SNSnotification.

onLateActionA reference to the action to perform when the activity is late. In this case, it is to send an AmazonSNS notification.

stepDefines the steps for the EMR job flow to perform. This step calls a Hive script namedimportDynamoDBTableFromS3 that is provided by Amazon EMR and is specifically designed tomove data from Amazon S3 into Amazon DynamoDB.To perform more complex data transformationtasks, you would customize this Hive script and provide its name and path here. For more informationabout sample Hive scripts that show how to perform data transformation tasks, see ContextualAdvertising using Apache Hive and Amazon EMR in AWS Articles and Tutorials.





On Windows:










http://aws.amazon.com/articles/2855



On Windows:






On Windows:






On Windows:





NoteAWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Startdate/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipelinecomponents the number of times the activity should have run if it had started on the Scheduled



Start time. When this happens, you see pipeline components run back-to-back at a greaterfrequency than the period value that you specified when you created the pipeline. AWS DataPipeline returns your pipeline to the defined frequency only when it catches up to the number ofpast runs.


Verify Data ImportNext, verify that the data import occurred successfully using the Amazon DynamoDB console to inspectthe data in the table.

To create a Amazon DynamoDB table


2. On the Tables screen, click your Amazon DynamoDB table and click the Explore Table button.

3. On the Browse Items tab, columns that correspond to the data input file should display, such as Id,Price, ProductCategory, as shown in the following screen. This indicates that the import operationfrom the file to the Amazon DynamoDB table occurred successfully.

Part Two: Export Data from Amazon DynamoDBTopics

• Before You Begin ... (p. 91)



This is the second of a two-part tutorial that demonstrates how to bring together multiple AWS featuresto solve real-world problems in a scalable way through a common scenario: moving schema-less data inand out of Amazon DynamoDB using Amazon EMR and Hive.This tutorial involves the following conceptsand procedures:


AWS Data Pipeline Developer GuidePart Two: Export Data from Amazon DynamoDB


• Using the AWS Data Pipeline console and command-line interface (CLI) to create and configure pipelines

• Creating and configuring Amazon DynamoDB tables

• Creating and allocating work to Amazon EMR clusters

• Querying and processing data with Hive scripts

• Storing and accessing data using Amazon S3

Before You Begin ...You must complete part one of this tutorial to ensure that your Amazon DynamoDB table contains thenecessary data to perform the steps in this section. For more information, see Part One: Import Data intoAmazon DynamoDB (p. 69).

Additionally, be sure you've completed the following steps:


Set up the AWS Data Pipeline tools and interface you plan on using. For more information aboutinterfaces and tools you can use to interact with AWS Data Pipeline, see Get Set Up for AWS DataPipeline (p. 12).

• Create an Amazon S3 bucket as a data output location.


• Ensure that you have the Amazon DynamoDB table that was created and populated with data in partone of this tutorial. This table will be your data source for part two of the tutorial. For more information,see Part One: Import Data into Amazon DynamoDB (p. 69).

Be aware of the following:

• Imports may overwrite data in your Amazon DynamoDB table. When you import data from AmazonS3, the import may overwrite items in your Amazon DynamoDB table. Make sure that you are importingthe right data and into the right table. Be careful not to accidentally set up a recurring import pipelinethat will import the same data multiple times.

• Exports may overwrite data in your Amazon S3 bucket. When you export data to Amazon S3, you mayoverwrite previous exports if you write to the same bucket path. The default behavior of the ExportDynamoDB to S3 template will append the job’s scheduled time to the Amazon S3 bucket path, whichwill help you avoid this problem.

• Import and Export jobs will consume some of your Amazon DynamoDB table’s provisioned throughputcapacity. This section explains how to schedule an import or export job using Amazon EMR. TheAmazon EMR cluster will consume some read capacity during exports or write capacity during imports.You can control the percentage of the provisioned capacity that the import/export jobs consume bywith the settings MyImportJob.myDynamoDBWriteThroughputRatio andMyExportJob.myDynamoDBReadThroughputRatio. Be aware that these settings determine howmuch capacity to consume at the beginning of the import/export process and will not adapt in real timeif you change your table’s provisioned capacity in the middle of the process.

• Be aware of the costs. AWS Data Pipeline manages the import/export process for you, but you still payfor the underlying AWS services that are being used.The import and export pipelines will create AmazonEMR clusters to read and write data and there are per-instance charges for each node in the cluster.You can read more about the details of Amazon EMR Pricing. The default cluster configuration is onem1.small instance master node and one m1.xlarge instance task node, though you can change thisconfiguration in the pipeline definition. There are also charges for AWS Data Pipeline. For moreinformation, see AWS Data Pipeline Pricing and Amazon S3 Pricing.




https://aws.amazon.com/elasticmapreduce/pricing/

https://aws.amazon.com/datapipeline/pricing/

https://aws.amazon.com/s3/pricing/


• Start Export from the Amazon DynamoDB Console (p. 92)

• Create the Pipeline Definition using the AWS Data Pipeline Console (p. 93)

• Create and Configure the Pipeline from a Template (p. 93)

• Complete the Data Nodes (p. 94)

• Complete the Resources (p. 95)

• Complete the Activity (p. 95)

• Complete the Notifications (p. 96)






The following topics explain how to perform the steps in part two of this tutorial using the AWS DataPipeline console.

Start Export from the Amazon DynamoDB ConsoleYou can begin the Amazon DynamoDB export operation from within the Amazon DynamoDB console.

To start the data export


2. On the Tables screen, click your Amazon DynamoDB table and click the Export Table button.

3. On the Import / Export Table screen, select Build a Pipeline. This opens the AWS Data Pipelineconsole so that you can choose a template to export the Amazon DynamoDB table data.




Create the Pipeline Definition using the AWS Data PipelineConsole

To create the new pipeline

1. Sign in to the AWS Management Console and open the AWS Data Pipeline console or arrive at theAWS Data Pipeline console through the Build a Pipeline button in the Amazon DynamoDB console.

2. Click Create new pipeline.




c. Leave the Select Schedule Type: button set to the default type Time Series Time Schedulingfor this tutorial. Schedule type allows you to specify whether the objects in your pipeline definitionshould be scheduled at the beginning of interval or end of the interval. Time Series StyleScheduling means instances are scheduled at the end of each interval and Cron Style Schedulingmeans instances are scheduled at the beginning of each interval.

d. Leave the Role boxes set to their default values for this tutorial, which isDataPipelineDefaultRole.


e. Leave the Role boxes set to their default values for this tutorial, which areDataPipelineDefaultRole for the role and DataPipelineDefaultResourceRole for the resourcerole.


f. Click Create a new Pipeline.

Create and Configure the Pipeline from a TemplateOn the Pipeline screen, click Templates and select Export DynamoDB to S3. The AWS Data Pipelineconsole pre-populates a pipeline definition template with the base objects necessary to export data fromAmazon DynamoDB, as shown in the following screen.




Review the template and complete the missing fields.You start by choosing the schedule and frequencyby which you want your data export operation to run.

To complete the schedule

• On the Pipeline screen, click Schedules.

a. In the DefaultSchedule1 section, set Name to ExportSchedule.

b. Set Period to 1 Hours.

c. Set Start Date Time using the calendar to the current date, such as 2012-12-18 and the timeto 00:00:00 UTC.

d. In the Add an optional field .. box, select End Date Time.

e. Set End Date Time using the calendar to the following day, such as 2012-12-19 and the timeto 00:00:00 UTC.

Complete the Data NodesNext, you complete the data node objects in your pipeline definition template.

To complete the Amazon DynamoDB data node

1. On the Pipeline: name of your pipeline page, select DataNodes.

2. In the DataNodes pane, in the Table Name box, type the name of the Amazon DynamoDB tablethat you created in part one of this tutorial; for example: MyTable.



To complete the Amazon S3 data node

• In the MyS3Data section, in the Directory Path field, type the path to the files where you want theAmazon DynamoDB table data to be written, which is the Amazon S3 bucket that you configured inpart one of this tutorial. For example: s3://mybucket/output/MyTable.

Complete the ResourcesNext, you complete the resources that will run the data import activities. Many of the fields areauto-populated by the template, as shown in the following screen.You only need to complete the emptyfields.

To complete the resources

• On the Pipeline page, select Resources.

• In the Emr Log Uri box, type the path where to store EMR debugging logs, using the AmazonS3 bucket that you configured in part one of this tutorial; for example:s3://mybucket/emr_debug_logs.

Complete the ActivityNext, you complete the activity that represents the steps to perform in your data export operation.

To complete the activity

1. On the Pipeline: name of your pipeline page, select Activities.

2. In the MyExportJob section, review the default options already provided.You are not required tomanually configure any options in this section.



Complete the NotificationsNext, configure the SNS notification action AWS Data Pipeline must perform depending on the outcomeof the activity.

To configure the SNS success, failure, and late notification action



a. In the LateSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topicthat you created earlier in the tutorial; for example:arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic.

b. In the FailureSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNS topicthat you created earlier in the tutorial; for example:arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic.

c. In the SuccessSnsAlarmsection, in the Topic Arn box, enter the ARN of your Amazon SNStopic that you created earlier in the tutorial; for example:arn:aws:sns:us-east-1:403EXAMPLE:my-example-topic.























5. Click Close.







Monitor the Progress of Your Pipeline Runs

To monitor the progress of your pipeline


2. The Instance details:name of your pipeline page lists the status of each object in your pipelinedefinition.

NoteIf you do not see runs listed, depending on when your pipeline was scheduled, either clickEnd (in UTC) date box and change it to a later date or click Start (in UTC) date box andchange it to an earlier date. Then click Update.



4. If the Status column of any of your objects indicates a status other than FINISHED, either yourpipeline is waiting for some precondition to be met or it has failed.

a. To troubleshoot the failed or the incomplete runs

Click the triangle next to a run; the Instance summary panel opens to show the details of theselected run.

b. Click View instance fields to see additional details of the run. If the status of your selected runis FAILED, the additional details box has an entry indicating the reason for failure; for example:@failureReason = Resource not healthy terminated.



c. You can use the information in the Instance summary panel and the View instance fields boxto troubleshoot issues with your failed pipeline.

For more information about the status listed for the runs, see Interpret Pipeline StatusDetails (p. 129). For more information about troubleshooting the failed or incomplete runs of yourpipeline, see AWS Data Pipeline Problems and Solutions (p. 131).

ImportantYour pipeline is running and is incurring charges. For more information, see AWS Data Pipelinepricing. If you would like to stop incurring the AWS Data Pipeline usage charges, delete yourpipeline.




2. Click Delete.



• Define the Export Pipeline in JSON Format (p. 98)



• Amazon EMR Cluster (p. 102)

• Amazon EMR Activity (p. 102)




• Verify Data Export (p. 106)

The following topics explain how to perform the steps in this tutorial using the AWS Data Pipeline CLI.

Define the Export Pipeline in JSON FormatThis example pipeline definition shows how to use AWS Data Pipeline to retrieve data from an AmazonDynamoDB table to populate a tab-delimited file in Amazon S3, use a Hive script to define the necessarydata transformation steps, and automatically create an Amazon EMR cluster to perform the work.





Additionally, this pipeline will send Amazon SNS notifications if the pipeline succeeds, fails, or runs late.This is the full pipeline definition JSON file followed by an explanation for each of its sections.


{ "objects": [ { "id": "MySchedule", "type": "Schedule", "startDateTime": "2012-11-22T00:00:00", "endDateTime":"2012-11-23T00:00:00", "period": "1 day" }, { "id": "MyS3Data", "type": "S3DataNode", "schedule": { "ref": "MySchedule" }, "filePath": "s3://output_bucket/ProductCatalog" }, { "id": "ExportCluster", "type": "EmrCluster", "masterInstanceType": "m1.small", "instanceCoreType": "m1.xlarge", "instanceCoreCount": "1", "schedule": { "ref": "MySchedule" }, "enableDebugging": "true", "emrLogUri": "s3://test_bucket/emr_logs" }, { "id": "MyExportJob", "type": "EmrActivity", "dynamoDBInputTable": "MyTable", "dynamoDBReadPercent": "0.25", "s3OutputBucket": "#{output.path}", "lateAfterTimeout": "12 hours", "attemptTimeout": "24 hours", "maximumRetries": "0", "output": { "ref": "MyS3Data" }, "runsOn": { "ref": "ExportCluster" }, "schedule": { "ref": "MySchedule" }, "onSuccess": { "ref": "SuccessSnsAlarm" }, "onFail": {



"ref": "FailureSnsAlarm" }, "onLateAction": { "ref": "LateSnsAlarm" }, "step": "s3://elasticmapreduce/libs/script-runner/script-run ner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/exportDynamoD BTableToS3,-d,DYNAMODB_INPUT_TABLE=#{dynamoDBInputTable},-d,S3_OUTPUT_BUCK ET=#{s3OutputBucket}/#{format(@actualStartTime,'YYYY-MM-dd_hh.mm')},-d,DY NAMODB_READ_PERCENT=#{dynamoDBReadPercent},-d,DYNAMODB_ENDPOINT=dynamodb.us-east-1.amazonaws.com" }, { "id": "SuccessSnsAlarm", "type": "SnsAlarm", "topicArn": "arn:aws:sns:us-east-1:286198878708:mysnsnotify", "role": "test-role", "subject": "DynamoDB table '#{node.dynamoDBInputTable}' export succeeded",

"message": "DynamoDB table '#{node.dynamoDBInputTable}' export to S3 bucket '#{node.s3OutputBucket}' succeeded at #{node.@actualEndTime}. JobId: #{node.id}" }, { "id": "LateSnsAlarm", "type": "SnsAlarm", "topicArn": "arn:aws:sns:us-east-1:286198878708:mysnsnotify", "role": "test-role", "subject": "DynamoDB table '#{node.dynamoDBInputTable}' export is taking a long time!", "message": "DynamoDB table '#{node.dynamoDBInputTable}' export to S3 bucket '#{node.s3OutputBucket}' has exceeded the late warning period '#{node.lateAfterTimeout}'. JobId: #{node.id}" }, { "id": "FailureSnsAlarm", "type": "SnsAlarm", "topicArn": "arn:aws:sns:us-east-1:286198878708:mysnsnotify", "role": "test-role", "subject": "DynamoDB table '#{node.dynamoDBInputTable}' export failed!",

"message": "DynamoDB table '#{node.dynamoDBInputTable}' export to S3 bucket '#{node.s3OutputBucket}' failed. JobId: #{node.id}. Error: #{node.er rorMessage}." } ]}











Amazon S3 Data NodeNext, the S3DataNode pipeline component defines a location for the output file; in this case a tab-delimitedfile in an Amazon S3 bucket location. The output S3DataNode component is defined by the followingfields:

{ "id": "MyS3Data", "type": "S3DataNode", "schedule": { "ref": "MySchedule" }, "filePath": "s3://output_bucket/ProductCatalog" }},

NameThe user-defined name for the output location (a label for your reference only).

TypeThe pipeline component type, which is "S3DataNode" to match the data output location, in an AmazonS3 bucket.


PathThe path to the data associated with the data node.This path is an empty Amazon S3 location wherea tab-delimited output file will be written that has the contents of a sample product catalog in an



Amazon DynamoDB table. The syntax for a data node is determined by its type. For example, a datanode for a file in Amazon S3 follows a different syntax that is appropriate for a database table.

Amazon EMR ClusterNext, the EmrCluster pipeline component defines an Amazon EMR cluster that processes and movesthe data in this tutorial. The EmrCluster component is defined by the following fields:

{ "id": "ImportCluster", "type": "EmrCluster", "masterInstanceType": "m1.small", "instanceCoreType": "m1.xlarge", "instanceCoreCount": "1", "schedule": { "ref": "MySchedule" }, "enableDebugging": "true", "emrLogUri": "s3://test_bucket/emr_logs"},

NameThe user-defined name for the Amazon EMR cluster (a label for your reference only).

TypeThe computational resource type, which is an Amazon EMR cluster. For more information,seeOverview of Amazon EMR in the Amazon EMR Developer Guide.

masterInstanceTypeThe type of Amazon EC2 instance to use as the master node of the Amazon EMR cluster. For moreinformation, see Amazon EC2 Instance Types in the Amazon EC2 Documentation.

instanceCoreTypeThe type of Amazon EC2 instance to use as the core node of the Amazon EMR cluster. For moreinformation, see Amazon EC2 Instance Types in the Amazon EC2 Documentation.

instanceCoreCountThe number of core Amazon EC2 instances to use in the Amazon EMR cluster.


enableDebuggingIndicates whether to create detailed debug logs for the Amazon EMR job flow.

emrLogUriSpecifies an Amazon S3 location to store the Amazon EMR job flow debug logs if you enableddebugging with the previously-mentioned enableDebugging field.

Amazon EMR ActivityNext, the EmrActivity pipeline component brings together the schedule, resources, and data nodes todefine the work to perform, the conditions under which to do the work, and the actions to perform whencertain events occur. The EmrActivity component is defined by the following fields:

{ "id": "MyExportJob", "type": "EmrActivity",






"dynamoDBInputTable": "MyTable", "dynamoDBReadPercent": "0.25", "s3OutputBucket": "#{output.path}", "lateAfterTimeout": "12 hours", "attemptTimeout": "24 hours", "maximumRetries": "0", "output": { "ref": "MyS3Data" }, "runsOn": { "ref": "ExportCluster" }, "schedule": { "ref": "ExportPeriod" }, "onSuccess": { "ref": "SuccessSnsAlarm" }, "onFail": { "ref": "FailureSnsAlarm" }, "onLateAction": { "ref": "LateSnsAlarm" }, "step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elast icmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/exportDynamoDBTableToS3,-d,DYNAMODB_INPUT_TABLE=#{dynamoDBInputTable},-d,S3_OUTPUT_BUCKET=#{s3OutputBuck et}/#{format(@actualStartTime,'YYYY-MM-dd_hh.mm')},-d,DYNAMODB_READ_PERCENT=#{dy namoDBReadPercent},-d,DYNAMODB_ENDPOINT=dynamodb.us-east-1.amazonaws.com"},

NameThe user-defined name for the Amazon EMR activity (a label for your reference only).

TypeThe EmrActivity pipeline component type, which creates an Amazon EMR job flow to perform thedefined work. For more information, see Overview of Amazon EMR in the Amazon EMR DeveloperGuide.

dynamoDBInputTableThe Amazon DynamoDB table that the Amazon EMR job flow reads as the input for the Hive script.

dynamoDBReadPercentSet the rate of read operations to keep your Amazon DynamoDB provisioned throughput rate in theallocated range for your table. The value is between 0.1 and 1.5, inclusively. For more information,see Hive Options in Amazon EMR Developer Guide.

s3OutputBucketAn expression that refers to the Amazon S3 location path for the output file defined by the S3DataNodelabeled "MyS3Data".

lateAfterTimeoutThe amount of time, after the schedule start time, that the activity can wait to start before AWS DataPipeline considers it late.

attemptTimeoutThe amount of time, after the schedule start time, that the activity has to complete before AWS DataPipeline considers it as failed.

maximumRetriesThe maximum number of times that AWS Data Pipeline retries the activity.




http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/EMR_Interactive_Hive.html#EMR_Hive_Options

inputThe Amazon S3 location path of the input data defined by the S3DataNode labeled "MyS3Data".

runsOnA reference to the computational resource that will run the activity; in this case, an EmrCluster labeled"ImportCluster".

scheduleA reference to the schedule component that we created in the preceding lines of the JSON file labeled“MySchedule”.

onSuccessA reference to the action to perform when the activity is successful. In this case, it is to send anAmazon SNS notification.

onFailA reference to the action to perform when the activity fails. In this case, it is to send an Amazon SNSnotification.

onLateActionA reference to the action to perform when the activity is late. In this case, it is to send an AmazonSNS notification.

stepDefines the steps for the EMR job flow to perform. This step calls a Hive script namedexportDynamoDBTableToS3 that is provided by Amazon EMR and is specifically designed to movedata from Amazon DynamoDB to Amazon S3. To perform more complex data transformation tasks,you would customize this Hive script and provide its name and path here. For more information aboutsample Hive scripts that show how to perform data transformation tasks, see Contextual Advertisingusing Apache Hive and Amazon EMR in AWS Articles and Tutorials.





On Windows:













On Windows:






On Windows:






On Windows:





NoteAWS Data Pipeline may backfill a pipeline, which happens when you define a Scheduled Startdate/time for a date in the past. In that situation, AWS Data Pipeline immediately runs the pipeline



components the number of times the activity should have run if it had started on the ScheduledStart time. When this happens, you see pipeline components run back-to-back at a greaterfrequency than the period value that you specified when you created the pipeline. AWS DataPipeline returns your pipeline to the defined frequency only when it catches up to the number ofpast runs.


Verify Data ExportNext, verify that the data export occurred successfully using viewing the output file contents.

To view the export file contents

1. Sign in to the AWS Management Console and open the Amazon S3 console.

2. On the Buckets pane, click the Amazon S3 bucket that contains your file output (the example pipelineuses the output path s3://output_bucket/ProductCatalog) and open the output file with yourpreferred text editor. The output file name is an identifier value with no extension, such as thisexample: ae10f955-fb2f-4790-9b11-fbfea01a871e_000000.

3. Using your preferred text editor, view the contents of the output file and ensure that there is delimiteddata that corresponds to the Amazon DynamoDB source table, such as Id, Price, ProductCategory,as shown in the following screen. This indicates that the export operation from Amazon DynamoDBto the output file occurred successfully.




Tutorial: Run a Shell Command toProcess MySQL Table

This tutorial walks you through the process of creating a data pipeline to use a script stored in AmazonS3 bucket to process a MySQL table, write the output in a comma-separated values (CSV) file in AmazonS3 bucket, and then send an Amazon SNS notification after the task completes successfully.You willuse the Amazon EC2 instance resource provided by AWS Data Pipeline for this shell command activity.




This tutorial uses the ShellCommandActivity to process the data in MySQL table and write theoutput in a CSV file.

ScheduleThe start date, time, and the duration for this activity.You can optionally specify the end date andtime.


This tutorial uses Ec2Resource, an Amazon EC2 instance provided by AWS Data Pipeline, to runa command for processing the data. AWS Data Pipeline automatically launches the Amazon EC2instance and then terminates the instance after the task finishes.


This tutorial uses two input nodes and one output node. The first input node is the MySQLDataNodethat contains the MySQL table. The second input node is the S3DataNode that contains the script.The output node is the S3DataNode for storing the CSV file.





For more information about the additional objects and fields supported by the copy activity, seeShellCommandActivity (p. 176).

The following steps outline how to create a data pipeline to run a script stored in an Amazon S3 bucket.








Before you begin ...Be sure you've completed the following steps.



• Create and launch a MySQL database instance as a data source.

For more information, see Launch a DB Instance in the Amazon Relational Database Service (RDS)Getting Started Guide.

NoteMake a note of the user name and the password you used for creating the MySQL instance.After you've launched your MySQL database instance, make a note of the instance's endpoint.You will need all this information in this tutorial.

• Connect to your MySQL database instance, create a table, and then add test data values to thenewly-created table.

For more information, see Create a Table in the MySQL documentation.

• Create an Amazon S3 bucket as a source for the script.

For more information, see Create a Bucket in the Amazon Simple Storage Service getting StartedGuide.

• Create a script to read the data in the MySQL table, process the data, and then write the results in aCSV file. The script must run on an Amazon EC2 Linux instance.

NoteThe AWS Data Pipeline computational resources (Amazon EMR job flow and Amazon EC2instance) are not supported on Windows in this release.

• Upload your script to your Amazon S3 bucket.

For more information, see Add an Object to a Bucket in the Amazon Simple Storage Service GettingStarted Guide.

• Create another Amazon S3 bucket as a data target.

• Create an Amazon SNS topic for sending email notification and make a note of the topic AmazonResource Name (ARN). For more information on creating an Amazon SNS topic, see Create a Topicin the Amazon Simple Notification Service Getting Started Guide.


AWS Data Pipeline Developer GuideBefore you begin ...

http://docs.aws.amazon.com/AmazonRDS/latest/GettingStartedGuide/LaunchDBInstance.html

http://dev.mysql.com/doc/refman/5.5/en//creating-tables.html


http://docs.aws.amazon.com/AmazonS3/latest/gsg/PuttingAnObjectInABucket.html


• [Optional] This tutorial uses the default IAM role policies created by AWS Data Pipeline. If you wouldrather create and configure your IAM role policy and trust relationships, follow the instructions describedin Granting Permissions to Pipelines with IAM (p. 21).














a. In the Pipeline Name box, enter a name (for example, RunDailyScript).












1. On the Pipeline: name of your pipeline page, click Add activity.


a. Enter the name of the activity; for example, run-my-script.

b. In the Type box, select ShellCommandActivity.

c. In the Schedule box, select Create new: Schedule.

d. In the Add an optional field .. box, select Script Uri.

e. In the Script Uri box, enter the path to your uploaded script; for example,s3://my-script/myscript.txt.

f. In the Add an optional field .. box, select Input.

g. In the Input box, select Create new: DataNode.

h. In the Add an optional field .. box, select Output.

i. In the Output box, select Create new: DataNode.

j. In the Add an optional field .. box, select RunsOn.

k. In the Runs On box, select Create new: Resource.

l. In the Add an optional field .. box, select On Success.

m. In the On Success box, select Create new: Action.

n. In the left pane, separate the icons by dragging them apart.

You've completed defining your pipeline definition by specifying the objects AWS Data Pipeline willuse to perform the shell command activity.


Next, configure run date and time for your pipeline.




a. Enter a schedule name for this activity (for example, run-mysql-script-schedule).








To get your pipeline to launch immediately, set Start Date Time to a date one day in the past. AWSData Pipeline will then starting launching the "past due" runs immediately in an attempt to addresswhat it perceives as a backlog of work. This backfilling means you don't have to wait an hour to seeAWS Data Pipeline launch its first job flow.

Next, configure the input and the output data nodes for your pipeline.

To configure the input and output data nodes of your pipeline



a. In the DefaultDataNode1Name box , enter the name for your MySQL data source node (forexample, MySQLTableInput).

b. In the Type box, select MySQLDataNode.

c. In the Connection String box, enter the end point of your MySQL database instance (forexample, mydbinstance.c3frkexample.us-east-1.rds.amazonaws.com).

d. In the Table box, enter the name of the source database table (for example,mysql-input-table).

e. In the Schedule box, select run-mysql-script-schedule.

f. In the *Password box, enter the password you used when you created your MySQL databaseinstance.

g. In the Username box, enter the user name you used when you created your MySQL databaseinstance.

h. In the DefaultDataNode2 Name box, enter the name for the data target node for your CSVfile (for example, MySQLScriptOutput).

i. In the Type box, select S3DataNode.

j. In the Schedule box, select run-mysql-script-schedule.

k. In the Add an optional field .. box, select File Path.

l. In the File Path box, enter the path to your Amazon S3 bucket (for example,s3://my-data-pipeline-output/name of your csv file).

Next, configure the resource AWS Data Pipeline must use to run your script.




a. In the Name box, enter the name for your resource (for example, RunScriptInstance).




c. Leave the Resource Role and Role boxes set to the default values for this tutorial.

d. In the Schedule box, select run-mysql-script-schedule.

Next, configure the SNS notification action AWS Data Pipeline must perform after your script runssuccessfully.




a. In the DefaultAction1Name box, enter the name for your Amazon SNS notification (forexample, RunDailyScriptNotice).


c. In the Topic Arn box, enter the ARN of your Amazon SNS topic.

d. In the Subject box, enter the subject line for your notification.

e. In the Message box, enter the message content.

f. Leave the entry in the Role box set to default.























5. Click Close.







AWS Data Pipeline Developer GuideVerify your Pipeline Definition


3. Click Close.




2. The Instance details: name of your pipeline page lists the status of each instance in yourpipeline definition.



You can also check your Amazon S3 data target bucket to verify if the data was processed.

4. If the Status column of any of your instances indicates a status other than FINISHED, either yourpipeline is waiting for some precondition to be met or it has failed.

a. To troubleshoot the failed or the incomplete instance runs

Click the triangle next to an instance, Instance summary panel opens to show the details ofthe selected instance.

b. Click View instance fields to see additional details of the instance. If the status of your selectedinstance is FAILED, the details box has an entry indicating the reason for failure. For example,@failureReason = Resource not healthy terminated.








ImportantYour pipeline is running and incurring charges. For more information, see AWS Data Pipelinepricing . If you would like to stop incurring the AWS Data Pipeline usage charges, delete yourpipeline.




2. Click Delete.






Manage Pipelines

Topics

• Using AWS Data Pipeline Console (p. 116)


You can use either the AWS Data Pipeline console or the AWS Data Pipeline command line interface(CLI) to view the details of your pipeline or to delete your pipeline.

Using AWS Data Pipeline ConsoleTopics

• View pipeline definition (p. 116)

• View details of each instance in an active pipeline (p. 117)

• Modify pipeline definition (p. 119)

• Delete a Pipeline (p. 121)

With the AWS Data Pipeline console, you can:

• View the pipeline definition of any pipeline associated with your account

• View the details of each instance in your pipeline and use the information to troubleshoot a failedinstance run

• Modify pipeline definition

• Delete pipeline

The following sections walk you through the steps for managing your pipeline. Before you begin, be surethat you have at least one pipeline associated with your account, have access to the AWS ManagementConsole, and have opened the AWS Data Pipeline console athttps://console.aws.amazon.com/datapipeline/.

View pipeline definitionIf you are signed in and have opened the AWS Data Pipeline console , your screen shows a list of pipelinesassociated with your account.


AWS Data Pipeline Developer GuideUsing AWS Data Pipeline Console



The Status column in the pipeline listing displays the current state of your pipelines. A pipeline isSCHEDULED if the pipeline definition has passed validation and is activated, is currently running, or hascompleted its run. A pipeline is PENDING if the pipeline definition is incomplete or might have failed thevalidation step that all pipelines go through before being saved.

If you want to modify or complete your pipeline definition, see Modify pipeline definition (p. 119).

To view the pipeline definition of your pipeline,

1. On the List Pipelines page, in the Details column of your pipeline, click View instance details (orclick View pipeline, if PENDING).

2. If your pipeline is SCHEDULED,

a. On the Instance details: name of your pipeline page, click View pipeline.

b. The Pipeline: name of your pipeline [The pipeline is active.] page opens.

This is your pipeline definition page. As indicated in the title of the page, this pipeline is active.

3. To view the pipeline definition object definitions, on the Pipeline: name of your pipeline page,click the object icons in the design pane. The corresponding object pane on the right panel opens.

4. You can also click the object panes on the right panel to view the objects and the associated fields.

5. If your pipeline definition graph does not fit in the design pane, use the pan buttons on the right sideof the design pane to slide the canvas.

6. Click Back to list of pipelines to get back to the List Pipelines page.

View details of each instance in an active pipelineIf you are signed in and have opened the AWS Data Pipeline console, your screen looks similar to this:


AWS Data Pipeline Developer GuideView details of each instance in an active pipeline


The Status column in the pipeline listing displays the current state of your pipelines.Your pipeline isactive if the status is SCHEDULED. A pipeline is in SCHEDULED state if the pipeline definition haspassed validation and is activated, is currently running, or has completed its run.You can view the pipelinedefinition, the runs list, and the details of each run of an active pipeline. For information on modifying anactive pipeline, seeModify pipeline definition (p. 119)

To retrieve the details of your active pipeline

1. On the List Pipelines page, identify your active pipeline, and then click the small triangle that is nextto the pipeline ID.

2. In the Pipeline summary pane, click View fields to see additional information on your pipelinedefinition.

3. Click Close to close the View fields box, and then click the triangle of your active pipeline again toclose the Pipeline Summary pane.

4. In the row that lists your active pipeline, click View instance details.

5. The Instance details: name of your pipeline page lists all the instances of your active pipeline.

NoteIf you do not see the list of instances, click End (in UTC) date box, change it to a later date,and then click Update.

6. You can also use the Filter Object, Start, or End date-time fields to filter the number of instancesreturned based on either their current status or the date-range in which they were launched. Filteringthe results is useful because, depending on the pipeline age and scheduling, the instance run historycan be very large.

7. If the Status column of all the runs in your pipeline displays the FINISHED state, your pipeline hassuccessfully completed running.


AWS Data Pipeline Developer GuideView details of each instance in an active pipeline

If the Status column of any one of your runs indicate a status other than FINISHED, your pipelineis either running, waiting for some precondition to be met or has failed.

8. Click the triangle next to an instance to show the details of the selected instance.

9. In the Instance summary pane. click View instance fields to see details of fields associated withthe selected instance. If the status of your selected instance is FAILED, the additional details boxhas an entry indicating the reason for failure. For example, @failureReason = Resource nothealthy terminated.

10. In the Instance summary pane, in the Select attempt for this instance box, select the attemptnumber.

11. In the Instance summary pane, click View attempt fields to see details of fields associated withthe selected attempt.


13. You can use the information in the Instance summary pane and the View instance fields box totroubleshoot issues with your failed pipeline.

For more information about instance status, see Interpret Pipeline Status Details (p. 129). For moreinformation about troubleshooting the failed or incomplete instance runs of your pipeline, see AWSData Pipeline Problems and Solutions (p. 131).

14. Click Back to list of pipelines to get back to the List Pipelines page.

Modify pipeline definitionIf your pipeline is in a PENDING state, either your pipeline definition is incomplete or your pipeline mighthave failed the validation step that all pipelines go through before saving. If your pipeline is active, youmay need to change some aspect of it. However, if you are modifying the pipeline definition of an activepipeline, you must keep in mind the following rules:

• Cannot change the Default objects

• Cannot change the schedule of an object

• Cannot change the dependencies between objects

• Cannot add/delete/modify reference fields for existing objects, only non-reference fields are allowed

• New objects cannot reference an previously existing object for the output field, only the input fields isallowed

Follow the steps in this section to either complete or modify your pipeline definition.


AWS Data Pipeline Developer GuideModify pipeline definition

To modify your pipeline definition

1. On the List Pipelines page, in the Details column of your pipeline, click View instance details (orclick View pipeline, if PENDING).

2. If your pipeline is SCHEDULED,

a. On the Instance details: name of your pipeline page, click View pipeline.

b. The Pipeline: name of your pipeline [The pipeline is active.] page opens.

This is your pipeline definition page. As indicated in the title of the page, this pipeline is active.

3. To complete or modify your pipeline definition:

a. On the Pipeline: name of your pipeline page, click the object panes in the right side paneland complete defining the objects and fields of your pipeline definition.

NoteIf you are modifying an active pipeline, you will see some fields are grayed out and areinactive.You cannot modify those fields.

b. Skip the next step and follow the steps to validate and save your pipeline definition.

4. To edit your pipeline definition:

a. On the Pipeline: name of your pipeline page, click the Errors pane. The Errors panelists the objects of your pipeline that failed validation.

b. Click on the plus (+) sign next to the object names and look for an error message in red.

c. Click the object pane where you see the error and fix it. For example, if you see error messagein the DataNodes object, click the DataNodes pane to fix the error.

To validate and save your pipeline definition

1. Click Save Pipeline. AWS Data Pipeline validates your pipeline definition and returns one of thefollowing messages:

Or

2. If you get an Error! message, click Close and then, on the right side panel, click Errors to see theobjects that did not pass the validation.

Fix the errors and save. Repeat this step till your pipeline definition passes validation.


AWS Data Pipeline Developer GuideModify pipeline definition

Activate and verify your pipeline

1. After you've saved your pipeline definition with no validation errors, click Activate.

2. To verify that your your pipeline definition has been activated, click Back to list of pipelines.

3. In the List Pipelines page, check if your newly-created pipeline is listed and the Status columndisplays SCHEDULED.

Delete a PipelineWhen you no longer require a pipeline, such as a pipeline created during application testing, you shoulddelete it to remove it from active use. Deleting a pipeline puts it into a deleting state. When the pipelineis in the deleted state, its pipeline definition and run history are gone.Therefore, you can no longer performoperations on the pipeline, including describing it.

You can't restore a pipeline after you delete it, so be sure that you won’t need the pipeline in the futurebefore you delete it.


1. In the List Pipelines page, click the empty box next to your pipeline.

2. Click Delete.



• Install the AWS Data Pipeline Command-Line Client (p. 122)

• Command-Line Syntax (p. 122)

• Setting Credentials for the AWS Data Pipeline Command Line Interface (p. 122)

• List Pipelines (p. 124)

• Create a New Pipeline (p. 124)

• Retrieve Pipeline Details (p. 124)

• View Pipeline Versions (p. 125)

• Modify a Pipeline (p. 126)

• Delete a Pipeline (p. 126)

The AWS Data Pipeline Command-Line Client (CLI) is one of three ways to interact with AWS DataPipeline. The other two are: using the AWS Data Pipeline console—a graphical user interface—or callingthe APIs (calling functions in the AWS Data Pipeline SDK). For more information, see What is AWS DataPipeline? (p. 1).


AWS Data Pipeline Developer GuideDelete a Pipeline

Install the AWS Data Pipeline Command-Line ClientInstall the AWS Data Pipeline Command-Line Client (CLI) as described in Install the Command LineInterface (p. 15)

Command-Line SyntaxUse the AWS Data Pipeline CLI from your operating system's command-line prompt and type the CLItool name "datapipeline" followed by one or more parameters. However, the syntax of the command isdifferent on Linux/Unix/Mac OS compared to Windows. Linux/Unix/Mac users must use the "./" prefix forthe CLI command and Windows users must specify "ruby" before the CLI command. For example, toview the CLI help text on Linux/Unix/Mac, the syntax is:

./datapipeline --help

However, to perform the same action on Windows, the syntax of the command is:

ruby datapipeline --help

Other than the prefix, the AWS Data Pipeline CLI syntax is the same between operating systems.

NoteFor brevity, we do not list all the operating system syntax permutations for each example in thisdocumentation. Instead, we simply refer to the commands like the following example.

datapipeline --help

Setting Credentials for the AWS Data PipelineCommand Line InterfaceIn order to connect to the AWS Data Pipeline web service to process your commands, the CLI needs theaccount details of an AWS account that has permissions to create and/or manage data pipelines. Thereare three ways to pass your credentials into the CLI:

• Implicitly, using a JSON file.

• Explicitly, by specifying a JSON file at the command line.

• Explicitly, by specifying credentials using a series of command-line options.

To Set Your Credentials Implicitly with a JSON File

• The easiest and most common way is implicitly, by creating a JSON file named credentials.json ineither your home directory, or the directory where CLI is installed. For example, when you use theCLI with Windows, the folder may be c:\datapipeline-cli\amazon\datapipeline. When you do this, theCLI loads the credentials implicitly and you do not need to specify any credential information at thecommand line. Verify the credentials file syntax using the following example JSON file, where youreplace the example access-id and private-key values with your own:

{ "access-id": "AKIAIOSFODNN7EXAMPLE", "private-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", "endpoint": "datapipeline.us-east-1.amazonaws.com",


AWS Data Pipeline Developer GuideInstall the AWS Data Pipeline Command-Line Client

"port": "443", "use-ssl": "true", "region": "us-east-1", "log-uri": "log-uri": "s3://myawsbucket/logfiles"}

After setting your credentials file, test the CLI using the following command, which uses implicitcredentials to call the CLI and display a list of all the data pipelines those credentials can access.

datapipeline --list-pipelines

To Set Your Credentials Explicitly with a JSON File

• The next easiest way to pass in credentials is explicitly, using the --credentials option to specify thelocation of a JSON file. With this method, you’ll have to add the --credentials option to eachcommand-line call. This can be useful if you are SSHing into a machine to run the command-lineclient remotely, or if you are testing various sets of credentials. For information about what to includein the JSON file, go to How to Format the JSON File.

For example, the following command explicitly uses the credentials stored in a JSON file to call theCLI and display a list of all the data pipelines those credentials can access.

datapipeline --credentials /my-directory/my-credentials.json --list-pipelines

To Set Your Credentials Using Command-Line Options

• The final way to pass in credentials is to specify them using a series of options at the command line.This is the most verbose way to pass in credentials, but may be useful if you are scripting the CLIand want the flexibility of changing credential information without having to edit a JSON file. Theoptions you’ll use in this scenario are --access-key, --secret-key and --endpoint.

For example, the following command explicitly uses credentials specified at the command line to callthe CLI and display a list of all the data pipelines those credentials can access.

datapipeline --access-key my-access-key-id --secret-key my-secret-access-key --endpoint datapipeline.us-east-1.amazonaws.com --list-pipelines

In the preceding, my-access-key-id would be replaced with your AWS Access Key ID, andmy-secret-access-key replaced with AWS Secret Access Key, and --endpoint wound specifythe endpoint the CLI should use when contacting the AWS Data Pipeline web service. For moreinformation about how to locate your AWS security credentials, go to Locating Your AWS SecurityCredentials.

NoteBecause you are passing in your credentials at every command-line call, you may wish totake additional security precautions to ensure the privacy of your command-line calls, suchas clearing auto-complete when you are done with your terminal session.You also shouldnot store this script in an unsecured file.


AWS Data Pipeline Developer GuideSetting Credentials for the AWS Data Pipeline Command

Line Interface

List PipelinesA simple example that also helps you confirm that your credentials are set correctly is to view a get a listof the currently running pipelines using --list-pipelines command.This command returns the namesand identifiers of all pipelines that you have permission to access.

datapipeline --list-pipelines

This is how you get the ID of pipelines that you want to work with using the CLI, because many commandsrequire you to specify the pipeline ID using the --id parameter.

For more information, see --list-pipelines (p. 221).

Create a New PipelineThe first step to create a new data pipeline is to define your data activities and their data dependenciesusing a pipeline definition file. The syntax and usage of the pipeline definition is described in PipelineDefinition Language Reference in the AWS Data Pipeline Developer’s Guide.

Once you’ve written the details of the new pipeline using JSON syntax, save them to a text file with theextension .json.You’ll then specify this pipeline definition file as part of the input when creating the newpipeline.

After creating your pipeline definition file, you can create a new pipeline by calling the --create (p. 217)action of the AWS Data Pipeline CLI, as shown below.

datapipeline --create my-pipeline --put my-pipeline-file.json

If you leave off the --put option, as shown following, AWS Data Pipeline creates an empty pipeline.Youcan then use a subsequent --put call to attach a pipeline definition to the empty pipeline.

datapipeline --create pipeline_name

The --put parameter does not activate a pipeline by default.You must explicitly activate a pipeline beforeit will begin doing work, using the --activate command and specifying a pipeline ID as shown below.

datapipeline --activate --id pipeline_id

For more information about creating pipelines, see the --create (p. 217) and --put actions.

Retrieve Pipeline DetailsUsing the CLI, you can retrieve all the information about a pipeline, which includes the pipeline definitionand the run attempt history of the pipeline components.

Retrieving the Pipeline Definition

To get the complete pipeline definition, use the --get command. The pipeline objects are returned inalphabetical order, not in the order they had in the pipeline definition file that you uploaded, and the slotsfor each object are also returned in alphabetical order.

You can specify an output file to receive the pipeline definition, but the default is to print the informationto standard output (which is typically your terminal screen).


AWS Data Pipeline Developer GuideList Pipelines

The following example prints the pipeline definition to a file named output.txt.

datapipeline --get --file output.txt --id df-00627471SOVYZEXAMPLE

The following example prints the pipeline definition to standard output (stdout).

datapipeline --get --id df-00627471SOVYZEXAMPLE

It's a good idea to retrieve the pipeline definition before you submit modifications, because it’s possiblethat another user or process changed the pipeline definition after you last worked with it. By downloadinga copy of the current definition and using that as the basis for your modifications, you can be sure thatyou are working with the most recent pipeline definition.

It’s also a good idea to retrieve the pipeline definition again after you modify it to ensure that the updatewas successful.

For more information, see --get, --g (p. 219).

Retrieving the Pipeline Run History

To retrieve a history of the times that a pipeline has run, use the --list-runs command.This commandhas options that you can use to filter the number of runs returned based on either their current status orthe date-range in which they were launched. Filtering the results is useful because, depending on thepipeline’s age and scheduling, the run history can be very large.

This example shows how to retrieve information for all runs.

datapipeline --list-runs --id df-00627471SOVYZEXAMPLE

This example shows how to retrieve information for all runs that have completed.

datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --status finished

This example shows how to retrieve information for all runs launched in the specified time frame.

datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --start-interval "2011-09-02", "2011-09-11"

For more information, see --list-runs (p. 222).

View Pipeline VersionsThere are two versions of a pipeline that you can view with the CLI. There is the "active" version, whichis the version of a pipeline that is currently running. There is also the "latest" version, which is createdwhen a user edits a running pipeline and is created as a copy of the "active" pipeline until you edit it.When you upload the edited pipeline, it becomes the "active" version and the previous "active" versionis no longer accessible. A new "latest" version is created if you edit the pipeline again, repeating thepreviously described cycle.

To retrieve a specific version of a pipeline, use the --version command, specifying the version nameof the pipeline. For example, the following command retrieves the "active" version of a pipeline.

datapipeline --get --version active --id df-00627471SOVYZEXAMPLE


AWS Data Pipeline Developer GuideView Pipeline Versions

For more information, see --delete (p. 218).

Modify a PipelineAfter you’ve created a pipeline, you may need to change some aspect of it. To do this, get the currentpipeline definition and save it to a file, update the pipeline definition file, and upload the updated pipelinedefinition to AWS Data Pipeline using the --put command.

The following rules apply when you modify a pipeline definition:

• Cannot change the Default object

• Cannot change the schedule of an object

• Cannot change the dependencies between objects

• Cannot added/delete/modify reference fields for existing objects, only non-reference fields are allowed

• New objects cannot reference an previously existing object for the output field, only the input fields isallowed

It's a good idea to retrieve the pipeline definition before you submit modifications, because it’s possiblethat another user or process changed the pipeline definition after you last worked with it. By downloadinga copy of the current definition and using that as the basis for your modifications, you can be sure thatyou are working with the most recent pipeline definition.

The following example prints the pipeline definition to a file named output.txt.

datapipeline --get --file output.txt --id df-00627471SOVYZEXAMPLE

Update your pipeline definition file and save it as my-updated-file.txt.The following example uploadsthe updated pipeline definition.

datapipeline --put my-updated-file.txt --id df-00627471SOVYZEXAMPLE

You can retrieve the pipeline definition using --get to ensure that the update was successful.

When you use --put to replace the pipeline definition file, the previous pipeline definition is completelyreplaced. Currently there is no way to change only a portion, such as a single object, of a pipeline; youmust include all previously defined objects in the updated pipeline definition.

For more information, see --put (p. 224) and --get, --g (p. 219).

Delete a PipelineWhen you no longer require a pipeline, such as a pipeline created during application testing, you shoulddelete it to remove it from active use. Deleting a pipeline puts it into a deleting state. When the pipelineis in the deleted state, its pipeline definition and run history are gone.Therefore, you can no longer performoperations on the pipeline, including describing it.

You can't restore a pipeline after you delete it, so be sure that you won’t need the pipeline in the futurebefore you delete it.

To delete a pipeline, use the --delete command, specifying the identifier of the pipeline. For example,the following command deletes a pipeline.

datapipeline --delete --id df-00627471SOVYZEXAMPLE


AWS Data Pipeline Developer GuideModify a Pipeline

For more information, see --delete (p. 218).


AWS Data Pipeline Developer GuideDelete a Pipeline

Troubleshoot AWS Data Pipeline

Topics

• Proactively Monitor Your Pipeline (p. 128)

• Verify Your Pipeline Status (p. 129)

• Interpret Pipeline Status Details (p. 129)

• Error Log Locations (p. 130)

• AWS Data Pipeline Problems and Solutions (p. 131)

When you have a problem while AWS Data Pipeline , the most common symptom is that a pipeline won'trun. Since there are several possible causes, this topic explains how to track the status of your AWS DataPipeline pipelines, get notifications when problems occur, and gather more information. After you haveenough information to narrow the list of potential problems, this topic guides you to solutions. To get themost benefit from these troubleshooting steps and scenarios, you should use the console or CLI to gatherthe required information.

Proactively Monitor Your PipelineThe best way to detect problems is to monitor your pipelines from the start proactively.You can configurepipeline components to inform you of certain situations or events, such as when a pipeline componentfails or doesn't begin by its scheduled start time. AWS Data Pipeline makes it easy to configure notificationsusing Amazon SNS.

Using the AWS Data Pipeline CLI, you can configure a pipeline component to send Amazon SNSnotifications on failures. Add the following code to your pipeline definition JSON file. This example alsodemonstrates how to use the AWS Data Pipeline expression language to insert details about the specificexecution attempt denoted by the #{node.interval.start} and #{node.interval.end} variables:

NoteYou must create an Amazon SNS topic to use for the Topic ARN value in the following example.For more information, see the Create a Topic documentation athttp://docs.aws.amazon.com/sns/latest/gsg/CreateTopic.html.

{ "id" : "FailureNotify", "type" : "SnsAlarm",


AWS Data Pipeline Developer GuideProactively Monitor Your Pipeline

"subject" : "Failed to run pipeline component", "message": "Error for interval #{node.interval.start}..#{node.inter val.end}.", "topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"},

You must also associate the notification to the pipeline component that you want to monitor, as shownby the following example. In this example using the onFail action, the component sends a notification ifa file doesn't exist in an Amazon S3 bucket:

{ "id": "S3Data", "type": "S3DataNode", "schedule": { "ref": "MySchedule" }, "filePath": "s3://mys3bucket/file.txt", "precondition": {"ref":"ExampleCondition"}, "onFail": {"ref":"FailureNotify"}},

Using the steps mentioned in the previous examples, you can also use the onLateAction, and onSuccesspipeline component fields to notify you when a pipeline component has not been scheduled on-time orhas succeeded, respectively.You should configure notifications for any critical tasks in a pipeline byadding notifications to the default object in a pipeline, they automatically apply to all components in thatpipeline. Pipeline components get the ability to send notifications through their IAM roles. Do not modifythe default IAM roles unless your situation demands it, otherwise notifications may not work.

Verify Your Pipeline StatusWhen you notice a problem with a pipeline, check the status of your pipeline components and look forerror messages using the console or CLI and look for error messages.

To locate your pipeline ID using the CLI, run this command:

ruby datapipeline --list-pipelines

After you have the pipeline ID, view the status of the pipeline components using the CLI. In this example,replace your pipeline ID with the example provided:

ruby datapipeline --list-runs --id df-AKIAIOSFODNN7EXAMPLE

On the list of pipeline components, look at the status column of each component and pay special attentionto any components that indicate a status of FAILED, WAITING_FOR_RUNNER, or CANCELLED.Additionally, look at the Scheduled Start column and match it with a corresponding value for the ActualStart column to ensure that the tasks occur with the timing that you expect.

Interpret Pipeline Status DetailsThe various status levels displayed in the AWS Data Pipeline console and CLI indicate the condition ofa pipeline and its components. Pipelines have a SCHEDULED status if they have passed validation andare ready, currently performing work, or done with their work. PENDING status means the pipeline is notable to perform work for some reason; for example, the pipeline definition might be incomplete or might


AWS Data Pipeline Developer GuideVerify Your Pipeline Status

have failed the validation step that all pipelines go through before activation. The pipeline status is simplyan overview of a pipeline; to see more information, view the status of individual pipeline components.You can do this by clicking through a pipeline in the console or retrieving pipeline component details usingthe CLI.

A pipeline component has the following available status values:

CHECKING_PRECONDITIONSThe component is checking to ensure that all its default and user-configured preconditions are metbefore performing its work.

WAITING_FOR_RUNNERThe component is waiting for its worker client to retrieve it as a work item.The component and workerclient relationship is controlled by the runsOn or theworkerGroup field defined by that component.

CREATINGThe component or resource is in the process of being started, such as an Amazon EC2 instance.

VALIDATINGThe pipeline definition is in the process of being validated by AWS Data Pipeline.

RUNNINGThe resource is running and ready to receive work.

CANCELLEDThe component was pre-emptively cancelled by a user or AWS Data Pipeline before it could run.This can happen automatically when a failure occurs in different component or resource that thiscomponent depends on.

PAUSEDThe component has been paused and is not currently performing work.

FINISHEDThe component has completed its assigned work.

SHUTTING_DOWNThe resource is shutting down after successfully performing its defined work.

FAILEDThe component or resource encountered an error and stopped working. When a component orresource fails, it can cause cancellations and failures to cascade to other components that dependon it.

Error Log LocationsThis section explains the various logs that AWS Data Pipeline writes that you can use to determine thesource of certain failures and errors.

Task Runner LogsTask Runner writes a log file named TaskRunner.log to the local computer which runs in the <your homedirectory>/output/logs directory, where "AmazonDataPipeline_location" is the directory where you extractedthe AWS Data Pipeline CLI tools. In this directory, Task Runner also creates several nested directoriesthat are named after the pipeline ID that it ran, with subdirectories for the year, month, day, and attemptnumber in the format <pipeline ID>/<year>/<month>/<day>/<pipeline object attempt ID_Attempt=X>. Inthese folders, Task Runner writes three files:

• <Pipeline Attempt ID>_Attempt_<number>_main.log.gz - This archive logs the step-by-step executionof Task Runner work items (both succeeded and failed) along with any error messages that weregenerated.

• <Pipeline Attempt ID>_Attempt_<number>_stderr.log.gz - This archive logs only error messagesthat occurred while Task Runner processed tasks.


AWS Data Pipeline Developer GuideError Log Locations

• <Pipeline Attempt ID>_Attempt_<number>_stdout.log.gz - This log provides any standard outputtext if provided by certain tasks.

Pipeline LogsYou can configure pipelines to create log files in a location, such as in the following example where youuse the Default object in a pipeline to cause all pipeline components to use that log location by default(you can override this by configuring a log location in a specific pipeline component).

To configure the log location using the AWS Data Pipeline CLI in a pipeline JSON file, begin your pipelinefile with the following text:

{ "objects": [{ "id":"Default", "logUri":"s3://mys3bucket/error_logs"},...

After you configure a pipeline log directory, Task Runner creates a copy of the logs in your directory, withthe same formatting and file names described in the previous section about Task Runner logs.

AWS Data Pipeline Problems and SolutionsThis topic provides various symptoms of AWS Data Pipeline problems and the recommended steps tosolve them.

Pipeline Stuck in Pending StatusA pipeline that appears stuck in the PENDING status indicates a fundamental error in the pipeline definition.Ensure that you did not receive any errors when you submitted your pipeline using the AWS Data PipelineCLI or when you attempted to save or activate your pipeline using the AWS Data Pipeline console.Additionally, check that your pipeline has a valid definition.

To view your pipeline definition on the screen using the CLI:

ruby datapipeline --get --id df-EXAMPLE_PIPELINE_ID

Ensure that the pipeline definition is complete, check your closing braces, verify required commas, checkfor missing references, and other syntax errors. It is best to use a text editor that can visually validate thesyntax of JSON files.

Pipeline Component Stuck in Waiting for RunnerStatusIf your pipeline is in the SCHEDULED state and one or more tasks appear stuck in theWAITING_FOR_RUNNER state, ensure that you set a valid value for either the runsOn or workerGroupfields for those tasks. If both values are empty or missing, the task cannot start because there is noassociation between the task and a worker to perform the tasks. In this situation, you've defined work buthaven't defined what computer will do that work. If applicable, verify that the workerGroup value assigned


AWS Data Pipeline Developer GuidePipeline Logs

to the pipeline component is exactly the same name and case as the workerGroup value that you configuredfor Task Runner.

Another potential cause of this problem is that the endpoint and access key provided to Task Runner isnot the same as the AWS Data Pipeline console or the computer where the AWS Data Pipeline CLI toolsare installed.You might have created new pipelines with no visible errors, but Task Runner polls thewrong location due to the difference in credentials, or polls the correct location with insufficient permissionsto identify and run the work specified by the pipeline definition.

Pipeline Component Stuck in CheckingPreconditions StatusIf your pipeline is in the SCHEDULED state and one or more tasks appear stuck in theCHECKING_PRECONDITIONS state, make sure your pipeline's initial preconditions have been met. Ifthe preconditions of the first object in the logic chain are not met, none of the objects that depend on thatfirst object will be able to move out of the CHECKING_PRECONDITIONS state.

For example, consider the following excerpt from a pipeline definition. In this case, the InputData objecthas a precondition 'Ready' specifying that the data must exist before the InputData object is complete. Ifthe data does not exist, the InputData object remains in the CHECKING_PRECONDITIONS state, waitingfor the data specified by the path field to become available. Any objects that depend on InputData likewiseremain in a CHECKING_PRECONDITIONS state waiting for the InputData object to reach the FINISHEDstate.

{ "id": "InputData", "type": "S3DataNode", "filePath": "s3://elasticmapreduce/samples/wordcount/wordSplitter.py", "schedule":{"ref":"MySchedule"}, "precondition": "Ready" },{ "id": "Ready", "type": "Exists"...

Also, check that your objects have the proper permissions to access the data. In the preceding example,if the information in the credentials field did not have permissions to access the data specified in the pathfield, the InputData object would get stuck in a CHECKING_PRECONDITIONS state because it cannotaccess the data specified by the path field, even if that data exists.

Run Doesn't Start When ScheduledCheck that you have properly specified the dates in your schedule objects and that the startDateTimeand endDateTime values are in UTC format, such as in the following example:

{ "id": "MySchedule", "startDateTime": "2012-11-12T19:30:00", "endDateTime":"2012-11-12T20:30:00", "period": "1 Hour", "type": "Schedule"},


AWS Data Pipeline Developer GuidePipeline Component Stuck in Checking Preconditions

Status

Pipeline Components Run in Wrong OrderYou might notice that the start and end times for your pipeline components are running in the wrong order,or in a different sequence than you expect. It is important to understand that pipeline components canstart running simultaneously if their preconditions are met at start-up time. In other words, pipelinecomponents do not execute sequentially by default; if you need a specific execution order, you mustcontrol the execution order with preconditions and dependsOn fields. Verify that you are using thedependsOn field populated with a reference to the correct prerequisite pipeline components, and that allthe necessary pointers between components are present to achieve the order you require.

EMR Cluster Fails With Error:The security tokenincluded in the request is invalidVerify your IAM roles, policies, and trust relationships as described in Granting Permissions to Pipelineswith IAM (p. 21).

Insufficient Permissions to Access ResourcesPermissions that you set on IAM roles determine whether AWS Data Pipeline can access your EMRclusters and EC2 instances to run your pipelines. Additionally, IAM provides the concept of trustrelationships that go further to allow creation of resources on your behalf. For example, when you createa pipeline that uses an EC2 instance to run a command to move data, AWS Data Pipeline can provisionthis EC2 instance for you. If you encounter problems, especially those involving resources that you canaccess manually but AWS Data Pipeline cannot, verify your IAM roles, policies, and trust relationshipsas described in Granting Permissions to Pipelines with IAM (p. 21).

Creating a Pipeline Causes a Security Token ErrorYou receive the following error when you try to create a pipeline:

Failed to create pipeline with 'pipeline_name'. Error: UnrecognizedClientException - The security tokenincluded in the request is invalid.

Cannot See Pipeline Details in the ConsoleThe AWS Data Pipeline console pipeline filter applies to the scheduled start date for a pipeline, withoutregard to when the pipeline was submitted. It is possible to submit a new pipeline using a scheduled startdate that occurs in the past, which the default date filter may not show.To see the pipeline details, changeyour date filter to ensure that the scheduled pipeline start date fits within the date range filter.

Error in remote runner Status Code: 404, AWSService: Amazon S3This error means that Task Runner could not access your files in Amazon S3. Verify that:

• You have credentials correctly set

• The Amazon S3 bucket that you are trying to access exists

• You are authorized to access the Amazon S3 bucket


AWS Data Pipeline Developer GuidePipeline Components Run in Wrong Order

Access Denied - Not Authorized to PerformFunction datapipeline:In the Task Runner logs, you may see an error that is similar to the following:

• ERROR Status Code: 403

• AWS Service: DataPipeline

• AWS Error Code: AccessDenied

• AWS Error Message: User: arn:aws:sts::XXXXXXXXXXXX:federated-user/i-XXXXXXXX is not authorizedto perform: datapipeline:PollForWork.

NoteIn the this error message, PollForWork may be replaced with names of other AWS Data Pipelinepermissions.

This error message indicates that the IAM role you specified needs additional permissions necessary tointeract with AWS Data Pipeline. Ensure that your IAM role policy contains the following lines, wherePollForWork is replaced with the name of the permission you want to add (use * to grant all permissions):

{"Action": [ "datapipeline:PollForWork" ],"Effect": "Allow","Resource": ["*"]}


AWS Data Pipeline Developer GuideAccess Denied - Not Authorized to Perform Function

datapipeline:

Pipeline Definition Files

Topics

• Creating Pipeline Definition Files (p. 135)

• Example Pipeline Definitions (p. 139)

• Simple Data Types (p. 153)

• Expression Evaluation (p. 155)

• Objects (p. 161)

The AWS Data Pipeline web service receives a pipeline definition file as input. This file specifies objectsfor the data nodes, activities, schedules, and computational resources for the pipeline.

Creating Pipeline Definition FilesTo create a pipeline definition file, you can use either the AWS Data Pipeline console interface or a texteditor that supports saving files using the UTF-8 file format.

This topic describes creating a pipeline definition file using a text editor.

Topics

• Prerequisites (p. 135)

• General Structure of a Pipeline Definition File (p. 136)

• Pipeline Objects (p. 136)

• Pipeline fields (p. 136)

• User-Defined Fields (p. 137)

• Expressions (p. 138)

• Saving the Pipeline Definition File (p. 139)

PrerequisitesBefore you create your pipeline definition file, you should determine the following:

• Objectives and tasks you need to accomplish

• Location and format of your source data (data nodes) and how often you update them


AWS Data Pipeline Developer GuideCreating Pipeline Definition Files

• Calculations or changes to the data (activities) you need

• Dependencies and checks (preconditions) that indicate when tasks are ready to run

• Frequency (schedule) you need for the pipeline to run

• Validation tests to confirm your data reached the destination

• How you want to be notified about success and failure

• Performance, volume, and runtime goals that suggest using other AWS services like EMR to processyour data

General Structure of a Pipeline Definition FileThe first step in pipeline creation is to compose pipeline definition objects in a pipeline definition file. Thefollowing example illustrates the general structure of a pipeline definition file.This file defines two objects,which are delimited by '{' and '}', and separated by a comma. The first object defines two name-valuepairs, known as fields. The second object defines three fields.

{ "objects" : [ { "name1" : "value1", "name2" : "value2" }, { "name1" : "value3", "name3" : "value4", "name4" : "value5" } ]}

Pipeline ObjectsWhen creating a pipeline definition file, you must select the types of pipeline objects that you'll need, addthem to the pipeline definition file, and then add the appropriate fields. For more information about pipelineobjects, see Objects (p. 161).

For example, you could create a pipeline definition object for an input data node and another for the outputdata node.Then create another pipeline definition object for an activity, such as processing the input datausing Amazon EMR.

Pipeline fieldsAfter you know which object types to include in your pipeline definition file, you add fields to the definitionof each pipeline object. Field names are enclosed in quotes, and are separated from field values by aspace, a colon, and a space, as shown in the following example.

"name" : "value"

The field value can be a text string, a reference to another object, a function call, an expression, or anordered list of any of the preceding types. for more information about the types of data that can be usedfor field values, see Simple Data Types (p. 153) . For more information about functions that you can useto evaluate field values, see Expression Evaluation (p. 155).


AWS Data Pipeline Developer GuideGeneral Structure of a Pipeline Definition File

Fields are limited to 2048 characters. Objects can be 20 KB in size, which means that you can't add manylarge fields to an object.

Each pipeline object must contain the following fields: id and type, as shown in the following example.Other fields may also be required based on the object type. Select a value for id that's meaningful toyou, and is unique within the pipeline definition.The value for type specifies the type of the object. Specifyone of the supported pipeline definition object types, which are listed in the topic Objects (p. 161).

{ "id": "MyCopyToS3", "type": "CopyActivity"}

For more information about the required and optional fields for each object, see the documentation forthe object.

To include fields from one object in another object, use the parent field with a reference to the object.For example, object "B" includes its fields, "B1" and "B2", plus the fields from object "A", "A1" and "A2".

{ "id" : "A", "A1" : "value", "A2" : "value"},{ "id" : "B", "parent" : {"ref" : "A"}, "B1" : "value", "B2" : "value"}

You can define common fields in an object named "Default". These fields are automatically included inevery object in the pipeline definition file that doesn't explicitly set its parent field to reference a differentobject.

{ "id" : "Default", "onFail" : {"ref" : "FailureNotification"}, "maximumRetries" : "3", "workerGroup" : "myWorkerGroup"}

User-Defined FieldsYou can create user-defined or custom fields on your pipeline components and refer to them withexpressions. The following example shows a custom field named "myCustomField" and"myCustomFieldReference" added to an S3DataNode:

{ "id": "S3DataInput", "type": "S3DataNode", "schedule": {"ref": "TheSchedule"}, "filePath": "s3://bucket_name", "myCustomField": "This is a custom value in a custom field.",


AWS Data Pipeline Developer GuideUser-Defined Fields

"my_customFieldReference": {"ref":"AnotherPipelineComponent"}},

A custom field must have a name prefixed with the word "my" in all lower-case letters, followed by acapital letter or underscore character, such as the preceding example "myCustomField". A user-definedfield can be both a string value or a reference to another pipeline component as shown by the precedingexample "my_customFieldReference".

NoteOn user-defined fields, AWS Data Pipeline only checks for valid references to other pipelinecomponents, not any custom field string values that you add.

ExpressionsExpressions enable you to share a value across related objects. Expressions are processed by the AWSData Pipeline web service at runtime, ensuring that all expressions are substituted with the value of theexpression.

Expressions are delimited by: "#{" and "}".You can use an expressions in any pipeline definition objectwhere a string is legal.

The following expression calls one of the AWS Data Pipeline functions. For more information, seeExpression Evaluation (p. 155).

#{format(myDateTime,'YYYY-MM-dd hh:mm:ss')}

Referencing Fields and ObjectsTo reference a field on the current object in an expression, use the node keyword. This keyword isavailable with alarm and precondition objects.

In the following example, the path field references the id field in the same object to form a file name.The value of path evaluates to "s3://mybucket/ExampleDataNode.csv".

{ "id" : "ExampleDataNode", "type" : "S3DataNode", "schedule" : {"ref" : "ExampleSchedule"}, "filePath" : "s3://mybucket/#{node.filePath}.csv", "precondition" : {"ref" : "ExampleCondition"}, "onFail" : {"ref" : "FailureNotify"}}

You can use an expression to reference objects that include another object, such as an alarm orprecondition object, using the node keyword. For example, the precondition object "ExampleCondition"is referenced by the previously described "ExampleDataNode" object, so "ExampleCondition" can referencefield values of "ExampleDataNode" using the node keyword. In the following example, the value of pathevaluates to "s3://mybucket/ExampleDataNode.csv".

{ "id" : "ExampleCondition", "type" : "Exists"}


AWS Data Pipeline Developer GuideExpressions

NoteYou can create pipelines that have dependencies, such as tasks in your pipeline that dependon the work of other systems or tasks. If your pipeline requires certain resources, add thosedependencies to the pipeline using preconditions that you associate with data nodes and tasksso your pipelines are easier to debug and more resilient. Additionally, keep your dependencieswithin a single pipeline when possible, because cross-pipeline troubleshooting is difficult.

As another example, you can use an expression to refer to the date and time range created by a Scheduleobject. For example, the message field uses the @scheduledStartTime and @scheduledEndTimeruntime fields from the Schedule object that is referenced by the data node or activity that referencesthis object in its onFail field.

{ "id" : "FailureNotify", "type" : "SnsAlarm", "subject" : "Failed to run pipeline component", "message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.", "topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic"},

Saving the Pipeline Definition FileAfter you have completed the pipeline definition objects in your pipeline definition file, save the pipelinedefinition file using UTF-8 encoding.

You must submit the pipeline definition file to the AWS Data Pipeline web service. There are two primaryways to submit a pipeline definition file, using the AWS Data Pipeline command line interface or usingthe AWS Data Pipeline console.

Example Pipeline DefinitionsThis section contains a collection of example pipelines that you can quickly use for various scenarios,once you are familiar with AWS Data Pipeline. For more detailed, step by step instructions for creatingand using pipelines, we recommend that you read one or more of the detailed tutorials available in thisguide, for example Tutorial: Copy Data From a MySQL Table to Amazon S3 (p. 40) and Tutorial:Import/Export Data in Amazon DynamoDB With Amazon EMR and Hive (p. 69).

Copy SQL Data to a CSV File in Amazon S3This example pipeline definition shows how to set a precondition or dependency on the existence of datagathered within a specific hour of time before copying the data (rows) from a table in a SQL database toa CSV (comma-separated values) file in an Amazon S3 bucket. The prerequisites and steps listed in thisexample pipeline definition are based on a MySQL database and table created using Amazon RDS.

PrerequisitesTo set up and test this example pipeline definition, see Get Started with Amazon RDS to complete thefollowing steps:

1. Sign up for Amazon RDS.

2. Launch a MySQL DB instance.

3. Authorize access.


AWS Data Pipeline Developer GuideSaving the Pipeline Definition File

http://docs.aws.amazon.com/AmazonRDS/latest/GettingStartedGuide/Welcome.html

4. Connect to the MySQL DB instance.

NoteThe database name in this example, mydatabase, is the same as the one in the AmazonRDS Getting Started Guide.

After you can connect to your MySQL DB Instance, then use the MySQL command line client to:

1. Use your test MySQL database and create a table named adEvents.

USE myDatabase;CREATE TABLE IF NOT EXISTS adEvents (eventTime DATETIME, eventId INT, siteName VARCHAR(100));

2. Insert test data values to the newly created table named adEvents.

INSERT INTO adEvents (eventTime, eventId, siteName) values ('2012-06-17 10:00:00', 100, 'Sports');INSERT INTO adEvents (eventTime, eventId, siteName) values ('2012-06-17 10:00:00', 200, 'News');INSERT INTO adEvents (eventTime, eventId, siteName) values ('2012-06-17 10:00:00', 300, 'Finance');

Example Pipeline DefinitionThe eight pipeline definition objects in this example are defined with the objectives of having:

• A default object with values that can be used by all subsequent objects

• A precondition object that resolves to true when the data exists in a referencing object

• A schedule object that specifies beginning and ending dates and duration or period of time in areferencing object

• A data input or source object with MySQL connection information, query string, and referencing to theprecondition and schedule objects

• A data output or destination object pointing to your specified Amazon S3 bucket

• An activity object for copying from MySQL to Amazon S3

• Amazon SNS notification objects used for signalling success and failure of a referencing activity object

{ "objects" : [ { "id" : "Default", "onFail" : {"ref" : "FailureNotify"}, "onSuccess" : {"ref" : "SuccessNotify"}, "maximumRetries" : "3", "workerGroup" : "myWorkerGroup" }, { "id" : "Ready", "type" : "Exists" }, {


AWS Data Pipeline Developer GuideCopy SQL Data to a CSV File in Amazon S3

"id" : "CopyPeriod", "type" : "Schedule", "startDateTime" : "2012-06-13T10:00:00", "endDateTime" : "2012-06-13T11:00:00", "period" : "1 hour" }, { "id" : "SqlTable", "type" : "MySqlDataNode", "schedule" : {"ref" : "CopyPeriod"}, "table" : "adEvents", "username": "user_name", "*password": "my_password", "connectionString": "jdbc:mysql://mysqlinstance-rds.example.us-east-1.rds.amazonaws.com:3306/database_name", "selectQuery" : "select * from #{table} where eventTime >= '#{@scheduled StartTime.format('YYYY-MM-dd HH:mm:ss')}' and eventTime < '#{@scheduledEnd Time.format('YYYY-MM-dd HH:mm:ss')}'", "precondition" : {"ref" : "Ready"} }, { "id" : "OutputData", "type" : "S3DataNode", "schedule" : {"ref" : "CopyPeriod"}, "filePath" : "s3://S3BucketNameHere/#{@scheduledStartTime}.csv" }, { "id" : "mySqlToS3", "type" : "CopyActivity", "schedule" : {"ref" : "CopyPeriod"}, "input" : {"ref" : "SqlTable"}, "output" : {"ref" : "OutputData"}, "onSuccess" : {"ref" : "SuccessNotify"} }, { "id" : "SuccessNotify", "type" : "SnsAlarm", "subject" : "Pipeline component succeeded", "message": "Success for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.", "topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic" }, { "id" : "FailureNotify", "type" : "SnsAlarm", "subject" : "Failed to run pipeline component", "message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.", "topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic" } ]}

Launch an Amazon EMR Job FlowThis example pipeline definition provisions an Amazon EMR cluster and runs a job step on that clusterone time a day, governed by the existence of the specified Amazon S3 path.


AWS Data Pipeline Developer GuideLaunch an Amazon EMR Job Flow

Example Pipeline DefinitionThe value for workerGroup should match the value that you specified for Task Runner.

Replace myOutputPath, myLogPath.

{ "objects" : [ { "id" : "Default", "onFail" : {"ref" : "FailureNotify"}, "maximumRetries" : "3", "workerGroup: "myWorkerGroup" }, { "id" : "Daily", "type" : "Schedule", "period" : "1 day", "startDateTime: "2012-06-26T00:00:00", "endDateTime" : "2012-06-27T00:00:00" }, { "id" : "InputData", "type" : "S3DataNode", "schedule" : {"ref" : "Daily"}, "filePath" : "s3://myBucket/#{@scheduledEndTime.format('YYYY-MM-dd')}", "precondition" : {"ref" : "Ready"} }, { "id" : "Ready", "type" : "S3DirectoryNotEmpty", "prefix" : "#{node.filePath}", }, { "id" : "MyCluster", "type" : "EmrCluster", "masterInstanceType" : "m1.small", "schedule" : {"ref" : "Daily"}, "enableDebugging" : "true", "logUri": "s3://myLogPath/logs" }, { "id" : "MyEmrActivity", "type" : "EmrActivity", "input" : {"ref" : "InputData"}, "schedule" : {"ref" : "Daily"}, "onSuccess" : "SuccessNotify", "runsOn" : {"ref" : "MyCluster"}, "preStepCommand" : "echo Starting #{id} for day #{@scheduledStartTime} >> /tmp/stepCommand.txt", "postStepCommand" : "echo Ending #{id} for day #{@scheduledStartTime} >> /tmp/stepCommand.txt", "step" : "/home/hadoop/contrib/streaming/hadoop-streaming.jar,-in put,s3n://elasticmapreduce/samples/wordcount/input,-output,s3://myOutputPath/word count/output/,-mapper,s3n://elasticmapreduce/samples/wordcount/wordSplitter.py,-reducer,aggregate" },


AWS Data Pipeline Developer GuideLaunch an Amazon EMR Job Flow

{ "id" : "SuccessNotify", "type" : "SnsAlarm", "subject" : "Pipeline component succeeded", "message": "Success for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.", "topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic" }, { "id" : "FailureNotify", "type" : "SnsAlarm", "subject" : "Failed to run pipeline component", "message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.", "topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic" } ]}

Run a Script on a ScheduleThis example pipeline definition runs an arbitrary command line script set on a 'time-based' schedule.This is a 'time based' schedule and not a 'dependency based' schedule in the sense that 'MyProcess' isscheduled to run based on clock time, not based on the availability of data sources that are inputs to'MyProcess'. The schedule object 'Period' in this case defines a schedule used by activity 'MyProcess'such that 'MyProcess' will be scheduled to execute every hour, beginning at startDatetime.The intervalcould have been minutes, hours, days, weeks, or months and be made different by chaining the periodfields on object 'Period'.

NoteWhen a schedule's startDateTime is in the past, AWS Data Pipeline backfills your pipelineand begins scheduling runs immediately beginning at startDateTime. For testing/development,use a relatively short interval for startDateTime..endDateTime. If not, AWS Data Pipelineattempts to queue up and schedule all runs of your pipeline for that interval.

Example Pipeline Definition

{ "objects" : [ { "id" : "Default", "onFail" : {"ref" : "FailureNotify"}, "maximumRetries" : "3", "workerGroup" : "myWorkerGroup" }, { "id" : "Period", "type" : "Schedule", "period" : "1 hour", "startDateTime" : "2012-01-13T20:00:00", "endDateTime" : "2012-01-13T21:00:00" }, { "id" : "MyProcess", "type" : "ShellCommandActivity", "onSuccess" : {"ref" : "SuccessNotify"},


AWS Data Pipeline Developer GuideRun a Script on a Schedule

"command" : "/home/myScriptPath/myScript.sh #{@scheduledStartTime} #{@scheduledEndTime}", "schedule": {"ref" : "Period"}, "stderr" : "/tmp/stderr:#{@scheduledStartTime}", "stdout" : "/tmp/stdout:#{@scheduledStartTime}" }, { "id" : "SuccessNotify", "type" : "SnsAlarm", "subject" : "Pipeline component succeeded", "message": "Success for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.", "topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic" }, { "id" : "FailureNotify", "type" : "SnsAlarm", "subject" : "Failed to run pipeline component", "message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.", "topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic" } ]}

Chain Multiple Activities and Roll Up DataThis example pipeline definition demonstrates the following:

• Chaining multiple activities in a graph of dependencies based on inputs and outputs.

• Rolling up data from smaller granularities (such as 15 minute buckets) into a larger granularity (suchas 1 hour buckets).

This pipeline defines a schedule named 'CopyPeriod', which describes 15 minute time intervals originatingat UTC time 2012-01-17T00:00:00 and a schedule named 'HourlyPeriod', which describes 1 hour timeintervals originating at UTC time 2012-01-17T00:00:00.

'InputData' describes files of this form:

• s3://myBucket/demo/2012-01-17T00:00:00.csv





Every 15 minute interval (specified by @scheduledStartTime..@scheduledEndTime), activity'CopyMinuteData' checks for Amazon S3 file s3://myBucket/demo/#{@scheduledStartTime}.csv and whenit is found, copies the file to s3://myBucket/demo/#{@scheduledEndTime}.csv, per the definition of outputobject 'OutputMinuteData'.

Similarly, for every hour's worth of 'OutputMinuteData' Amazon S3 files found to exist (four 15-minutefiles in this case), activity 'CopyHourlyData' runs and writes the output to an hourly file defined by theexpression s3://myBucket/demo/hourly/#{@scheduledEndTime}.csv in 'HourlyData'.


AWS Data Pipeline Developer GuideChain Multiple Activities and Roll Up Data

Finally, when the Amazon S3 file described by s3://myBucket/demo/hourly/#{@scheduledEndTime}.csvin 'HourlyData' is found to exist, AWS Data Pipeline runs the script described by activity 'ShellOut'.


{ "objects" " [ { "id" : "Default", "onFail" : {"ref" : "FailureNotify"}, "maximumRetries" : "3", "workerGroup" : "myWorkerGroup" }, { "id" : "CopyPeriod", "type" : "Schedule", "period" : "15 minutes", "startDateTime" : "2012-01-17T00:00:00", "endDateTime" : "2012-01-17T02:00:00" }, { "id" : "InputData", "type" : "S3DataNode", "schedule" : {"ref" : "CopyPeriod"}, "filePath" : "s3://myBucket/demo/#{@scheduledStartTime}.csv", "precondition" : {"ref" : "Ready"} }, { "id" : "OutputMinuteData", "type" : "S3DataNode", "schedule" : {"ref" : "CopyPeriod"}, "filePath" : "s3://myBucket/demo/#{@scheduledEndTime}.csv" }, { "id" : "Ready", "type" : "Exists", }, { "id" : "CopyMinuteData", "type" : "CopyActivity", "schedule" : {"ref" : "CopyPeriod"}, "input" : {"ref" : "InputData"}, "output" : {"ref" : "OutputMinuteData"} }, { "id" : "HourlyPeriod", "type" : "Schedule", "period" : "1 hour", "startDateTime" : "2012-01-17T00:00:00", "endDateTime" : "2012-01-17T02:00:00" }, { "id" : "CopyHourlyData", "type" : "CopyActivity", "schedule" : {"ref" : "HourlyPeriod"}, "input" : {"ref" : "OutputMinuteData"}, "output" : {"ref" : "HourlyData"}


AWS Data Pipeline Developer GuideChain Multiple Activities and Roll Up Data

}, { "id" : "HourlyData", "type" : "S3DataNode", "schedule" : {"ref" : "HourlyPeriod"}, "filePath" : "s3://myBucket/demo/hourly/#{@scheduledEndTime}.csv" }, { "id" : "ShellOut", "type" : "ShellCommandActivity", "input" : {"ref" : "HourlyData"}, "command" : "/home/userName/xxx.sh #{@scheduledStartTime} #{@scheduledEnd Time}", "schedule" : {"ref" : "HourlyPeriod"}, "stderr" : "/tmp/stderr:#{@scheduledStartTime}", "stdout" : "/tmp/stdout:#{@scheduledStartTime}", "onSuccess" : {"ref" : "SuccessNotify"} }, { "id" : "SuccessNotify", "type" : "SnsAlarm", "subject" : "Pipeline component succeeded", "message": "Success for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.", "topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic" }, { "id" : "FailureNotify", "type" : "SnsAlarm", "subject" : "Failed to run pipeline component", "message": "Error for interval #{node.@scheduledStartTime}..#{node.@sched uledEndTime}.", "topicArn":"arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic" } ]}

Copy Data from Amazon S3 to MySQLThis example pipeline definition automatically creates an Amazon EC2 instance that will copy the specifieddata from a CSV file in Amazon S3 into a MySQL database table. For simplicity, the structure of theexample MySQL insert statement assumes that you have a CSV input file with two columns of data thatyou are writing into a MySQL database table that has two matching columns of the appropriate data type.If you have data of a different scope, you would modify the MySQL statement to include additional datacolumns or data types.


{ "objects": [ { "id": "Default", "logUri": "s3://testbucket/error_log", "schedule": { "ref": "MySchedule"


AWS Data Pipeline Developer GuideCopy Data from Amazon S3 to MySQL

} }, { "id": "MySchedule", "type": "Schedule", "startDateTime": "2012-11-26T00:00:00", "endDateTime": "2012-11-27T00:00:00", "period": "1 day" }, { "id": "MyS3Input", "filePath": "s3://testbucket/input_data_file.csv", "type": "S3DataNode" }, { "id": "MyCopyActivity", "input": { "ref": "MyS3Input" }, "output": { "ref": "MyDatabaseNode" }, "type": "CopyActivity", "runsOn": { "ref": "MyEC2Resource" } }, { "id": "MyEC2Resource", "type": "Ec2Resource", "actionOnTaskFailure": "terminate", "actionOnResourceFailure": "retryAll", "maximumRetries": "1", "role": "test-role", "resourceRole": "test-role", "instanceType": "m1.medium", "instanceCount": "1", "securityGroups": [ "test-group", "default" ], "keyPair": "test-pair" }, { "id": "MyDatabaseNode", "type": "MySqlDataNode", "table": "table_name", "username": "user_name", "*password": "my_password", "connectionString": "jdbc:mysql://mysqlinstance-rds.example.us-east-1.rds.amazonaws.com:3306/database_name", "insertQuery": "insert into #{table} (column1_ name, column2_name) values (?, ?);"

} ]}


AWS Data Pipeline Developer GuideCopy Data from Amazon S3 to MySQL

This example has the following fields defined in the MySqlDataNode:

idUser-defined identifier for the MySQL database, which is a label for your reference only.

typeMySqlDataNode type that matches the kind of location for our data, which is an Amazon RDS instanceusing MySQL in this example.

tableName of the database table that contains the data to copy. Replace table_name with the name ofyour database table.

usernameUser name of the database account that has sufficient permission to retrieve data from the databasetable. Replace user_name with the name of your user account.

*passwordPassword for the database account with the asterisk prefix to indicate that AWS Data Pipeline mustencrypt the password value. Replace my_password with the correct password for your user account.

connectionStringJDBC connection string for CopyActivity to connect to the database.

insertQueryA valid SQL SELECT query that specifies which data to copy from the database table. Note that#{table} is a variable that re-uses the table name provided by the "table" variable in the precedinglines of the JSON file.

Extract Apache Web Log Data from Amazon S3using HiveThis example pipeline definition automatically creates an Amazon EMR cluster to extract data from Apacheweb logs in Amazon S3 to a CSV file in Amazon S3 using Hive.


{ "objects": [ { "startDateTime": "2012-05-04T00:00:00", "id": "MyEmrResourcePeriod", "period": "1 day", "type": "Schedule", "endDateTime": "2012-05-05T00:00:00" }, { "id": "MyHiveActivity", "type": "HiveActivity", "schedule": { "ref": "MyEmrResourcePeriod" }, "runsOn": { "ref": "MyEmrResource" }, "input": { "ref": "MyInputData" }, "output": {


AWS Data Pipeline Developer GuideExtract Apache Web Log Data from Amazon S3 using

Hive

"ref": "MyOutputData" }, "hiveScript": "INSERT OVERWRITE TABLE ${output1} select host,user,time,request,status,size from ${input1};" }, { "schedule": { "ref": "MyEmrResourcePeriod" }, "masterInstanceType": "m1.small", "coreInstanceType": "m1.small", "enableDebugging": "true", "keyPair": "test-pair", "id": "MyEmrResource", "coreInstanceCount": "1", "actionOnTaskFailure": "continue", "maximumRetries": "1", "type": "EmrCluster", "actionOnResourceFailure": "retryAll", "terminateAfter": "10 hour" }, { "id": "MyInputData", "type": "S3DataNode", "schedule": { "ref": "MyEmrResourcePeriod" }, "directoryPath": "s3://test-hive/input-access-logs", "dataFormat": { "ref": "MyInputDataType" } }, { "id": "MyOutputData", "type": "S3DataNode", "schedule": { "ref": "MyEmrResourcePeriod" }, "directoryPath": "s3://test-hive/output-access-logs", "dataFormat": { "ref": "MyOutputDataType" } }, { "id": "MyOutputDataType", "type": "Custom", "columnSeparator": "\t", "recordSeparator": "\n", "column": [ "host STRING", "user STRING", "time STRING", "request STRING", "status STRING", "size STRING" ] }, {


AWS Data Pipeline Developer GuideExtract Apache Web Log Data from Amazon S3 using

Hive

"id": "MyInputDataType", "type": "RegEx", "inputRegEx": "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?", "outputFormat": "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s", "column": [ "host STRING", "identity STRING", "user STRING", "time STRING", "request STRING", "status STRING", "size STRING", "referer STRING", "agent STRING" ] } ]}

Extract Amazon S3 Data (CSV/TSV) to Amazon S3using HiveThis example pipeline definition creates an Amazon EMR cluster to extract data from Apache web logsin Amazon S3 to a CSV file in Amazon S3 using Hive.

NoteYou can accommodate tab-delimited (TSV) data files similarly to how this sample demonstratesusing comma-delimited (CSV) files, if you change the MyInputDataType andMyOutputDataType type field to "TSV" instead of "CSV".


{ "objects": [ { "startDateTime": "2012-05-04T00:00:00", "id": "MyEmrResourcePeriod", "period": "1 day", "type": "Schedule", "endDateTime": "2012-05-05T00:00:00" }, { "id": "MyHiveActivity", "maximumRetries": "10", "type": "HiveActivity", "schedule": { "ref": "MyEmrResourcePeriod" }, "runsOn": { "ref": "MyEmrResource" }, "input": { "ref": "MyInputData"


AWS Data Pipeline Developer GuideExtract Amazon S3 Data (CSV/TSV) to Amazon S3 using

Hive

}, "output": { "ref": "MyOutputData" }, "hiveScript": "INSERT OVERWRITE TABLE ${output1} select * from ${input1};"

}, { "schedule": { "ref": "MyEmrResourcePeriod" }, "masterInstanceType": "m1.small", "coreInstanceType": "m1.small", "enableDebugging": "true", "keyPair": "test-pair", "id": "MyEmrResource", "coreInstanceCount": "1", "actionOnTaskFailure": "continue", "maximumRetries": "2", "type": "EmrCluster", "actionOnResourceFailure": "retryAll", "terminateAfter": "10 hour" }, { "id": "MyInputData", "type": "S3DataNode", "schedule": { "ref": "MyEmrResourcePeriod" }, "directoryPath": "s3://test-hive/input", "dataFormat": { "ref": "MyInputDataType" } }, { "id": "MyOutputData", "type": "S3DataNode", "schedule": { "ref": "MyEmrResourcePeriod" }, "directoryPath": "s3://test-hive/output", "dataFormat": { "ref": "MyOutputDataType" } }, { "id": "MyOutputDataType", "type": "CSV", "column": [ "Name STRING", "Age STRING", "Surname STRING" ] }, { "id": "MyInputDataType", "type": "CSV", "column": [


AWS Data Pipeline Developer GuideExtract Amazon S3 Data (CSV/TSV) to Amazon S3 using

Hive

"Name STRING", "Age STRING", "Surname STRING" ] } ]}

Extract Amazon S3 Data (Custom Format) toAmazon S3 using HiveThis example pipeline definition creates an Amazon EMR cluster to extract data from Amazon S3 withHive, using a custom file format specified by the columnSeparator and recordSeparator fields.


{ "objects": [ { "startDateTime": "2012-05-04T00:00:00", "id": "MyEmrResourcePeriod", "period": "1 day", "type": "Schedule", "endDateTime": "2012-05-05T00:00:00" }, { "id": "MyHiveActivity", "type": "HiveActivity", "schedule": { "ref": "MyEmrResourcePeriod" }, "runsOn": { "ref": "MyEmrResource" }, "input": { "ref": "MyInputData" }, "output": { "ref": "MyOutputData" }, "hiveScript": "INSERT OVERWRITE TABLE ${output1} select * from ${input1};"

}, { "schedule": { "ref": "MyEmrResourcePeriod" }, "masterInstanceType": "m1.small", "coreInstanceType": "m1.small", "enableDebugging": "true", "keyPair": "test-pair", "id": "MyEmrResource", "coreInstanceCount": "1", "actionOnTaskFailure": "continue",


AWS Data Pipeline Developer GuideExtract Amazon S3 Data (Custom Format) to Amazon

S3 using Hive

"maximumRetries": "1", "type": "EmrCluster", "actionOnResourceFailure": "retryAll", "terminateAfter": "10 hour" }, { "id": "MyInputData", "type": "S3DataNode", "schedule": { "ref": "MyEmrResourcePeriod" }, "directoryPath": "s3://test-hive/input", "dataFormat": { "ref": "MyInputDataType" } }, { "id": "MyOutputData", "type": "S3DataNode", "schedule": { "ref": "MyEmrResourcePeriod" }, "directoryPath": "s3://test-hive/output-custom", "dataFormat": { "ref": "MyOutputDataType" } }, { "id": "MyOutputDataType", "type": "Custom", "columnSeparator": ",", "recordSeparator": "\n", "column": [ "Name STRING", "Age STRING", "Surname STRING" ] }, { "id": "MyInputDataType", "type": "Custom", "columnSeparator": ",", "recordSeparator": "\n", "column": [ "Name STRING", "Age STRING", "Surname STRING" ] } ]}

Simple Data TypesThe following types of data can be set as field values.


AWS Data Pipeline Developer GuideSimple Data Types

Topics

• DateTime (p. 154)

• Numeric (p. 154)

• Expression Evaluation (p. 154)

• Object References (p. 154)

• Period (p. 154)

• String (p. 155)

DateTimeAWS Data Pipeline supports the date and time expressed in "YYYY-MM-DDTHH:MM:SS" format inUTC/GMT only. The following example sets the startDateTime field of a Schedule object to1/15/2012, 11:59 p.m., in the UTC/GMT timezone.

"startDateTime" : "2012-01-15T23:59:00"

NumericAWS Data Pipeline supports both integers and floating-point values.

Expression EvaluationAWS Data Pipeline provides a set of functions that you can use to calculate the value of a field. For moreinformation about these functions, see Expression Evaluation (p. 155). The following example uses themakeDate function to set the startDateTime field of a Schedule object to "2011-05-24T0:00:00"GMT/UTC.

"startDateTime" : "makeDate(2011,5,24)"

Object ReferencesAn object in the pipeline definition. This can either be the current object, the name of an object definedelsewhere in the pipeline, or an object that lists the current object in a field, referenced by the nodekeyword. For more information about node, see Referencing Fields and Objects (p. 138). For moreinformation about the pipeline object types, see Objects (p. 161).

PeriodIndicates how often a scheduled event should run. It's expressed in the format "N[years|months|weeks|days|hours|minutes]", where N is a positive integer value.

The minimum period is 15 minutes and the maximum period is 3 years.

The following example sets the period field of the Schedule object to 3 hours. This creates a schedulethat runs every three hours.

"period" : "3 hours"


AWS Data Pipeline Developer GuideDateTime

StringStandard string values. Strings must be surrounded by double quotes (").You can use the slash character(\) to escape characters in a strings. Multiline strings are not supported.

The following examples show examples of valid string values for the id field.

"id" : "My Data Object"

"id" : "My \"Data\" Object"

Strings can also contain expressions that evaluate to string values. These are inserted into the string,and are delimited with: "#{" and "}". The following example uses an expression to insert the name of thecurrent object into a path.

"filePath" : "s3://myBucket/#{name}.csv"

For more information about using expressions, see Referencing Fields and Objects (p. 138) and ExpressionEvaluation (p. 155).

Expression EvaluationThe following functions are provided by AWS Data Pipeline.You can use them to evaluate field values.

Topics

• Mathematical Functions (p. 155)

• String Functions (p. 156)

• Date and Time Functions (p. 156)

Mathematical FunctionsThe following functions are available for working with numerical values.

DescriptionFunction

Addition.

Example: #{1 + 2}

Result: 3

+

Subtraction.

Example: #{1 - 2}

Result: -1

-

Multiplication.

Example: #{1 * 2}

Result: 2

*


AWS Data Pipeline Developer GuideString

DescriptionFunction

Division. If you divide two integers, the result istruncated.

Example: #{1 / 2}, Result: 0

Example: #{1.0 / 2}, Result: .5

/

Exponent.

Example: #{2 ^ 2}

Result: 4.0

^

String FunctionsThe following functions are available for working with string values.

DescriptionFunction

Concatenation. Non-string valuesare first converted to strings.

Example: #{"hel" + "lo"}

Result: "hello"

+

Date and Time FunctionsThe following functions are available for working with DateTime values. For the examples, the value ofmyDateTime is May 24, 2011 @ 5:10 pm GMT.

DescriptionFunction

Gets the minute of the DateTimevalue as an integer.

Example:#{minute(myDateTime)}

Result: 10

int minute(DateTime myDateTime)

Gets the hour of the DateTimevalue as an integer.

Example:#{hour(myDateTime)}

Result: 17

int hour(DateTime myDateTime)


AWS Data Pipeline Developer GuideString Functions

DescriptionFunction

Gets the day of the DateTimevalue as an integer.

Example:#{day(myDateTime)}

Result: 24

int day(DateTime myDateTime)

Gets the day of the year of theDateTime value as an integer.

Example:#{dayOfYear(myDateTime)}

Result: 144

int dayOfYear(DateTime myDateTime)

Gets the month of the DateTimevalue as an integer.

Example:#{month(myDateTime)}

Result: 5

int month(DateTime myDateTime)

Gets the year of the DateTimevalue as an integer.

Example:#{year(myDateTime)}

Result: 2011

int year(DateTime myDateTime)

Creates a String object that is theresult of converting the specifiedDateTime using the specifiedformat string.

Example:#{format(myDateTime,'YYYY-MM-ddhh:mm:ss z')}

Result:"2011-05-24T17:10:00UTC"

String format(DateTime myDateTime,String format)

Creates a DateTime object withthe same date and time, but in thespecified time zone, and takingdaylight savings time into account.For more information about timezones, seehttp://joda-time.sourceforge.net/timezones.html.

Example:#{inTimeZone(myDateTime,'America/Los_Angeles')}

Result:"2011-05-24T10:10:00America/Los_Angeles"

DateTime inTimeZone(DateTime myDateTime,String zone)


AWS Data Pipeline Developer GuideDate and Time Functions

http://joda-time.sourceforge.net/timezones.html

DescriptionFunction

Creates a DateTime object, inUTC, with the specified year,month, and day, at midnight.

Example:#{makeDate(2011,5,24)}

Result:"2011-05-24T0:00:00z"

DateTime makeDate(int year,int month,int day)

Creates a DateTime object, inUTC, with the specified year,month, day, hour, and minute.

Example:#{makeDateTime(2011,5,24,14,21)}

Result:"2011-05-24T14:21:00z"

DateTime makeDateTime(int year,int month,int day,inthour,int minute)

Creates a DateTime object for thenext midnight, relative to thespecified DateTime.

Example:#{midnight(myDateTime)}

Result:"2011-05-24T0:00:00z"

DateTime midnight(DateTime myDateTime)

Creates a DateTime object for theprevious day, relative to thespecified DateTime. The result isthe same as minusDays(1).

Example:#{yesterday(myDateTime)}

Result:"2011-05-23T17:10:00z"

DateTime yesterday(DateTime myDateTime)

Creates a DateTime object for theprevious Sunday, relative to thespecified DateTime. If thespecified DateTime is a Sunday,the result is the specifiedDateTime.

Example:#{sunday(myDateTime)}

Result:"2011-05-22 17:10:00UTC"

DateTime sunday(DateTime myDateTime)



DescriptionFunction

Creates a DateTime object for thestart of the month in the specifiedDateTime.

Example:#{firstOfMonth(myDateTime)}

Result:"2011-05-01T17:10:00z"

DateTime firstOfMonth(DateTime myDateTime)

Creates a DateTime object that isthe result of subtracting thespecified number of minutes fromthe specified DateTime.

Example:#{minusMinutes(myDateTime,1)}

Result:"2011-05-24T17:09:00z"

DateTime minusMinutes(DateTime myDateTime,intminutesToSub)

Creates a DateTime object that isthe result of subtracting thespecified number of hours fromthe specified DateTime.

Example:#{minusHours(myDateTime,1)}

Result:"2011-05-24T16:10:00z"

DateTime minusHours(DateTime myDateTime,inthoursToSub)

Creates a DateTime object that isthe result of subtracting thespecified number of days from thespecified DateTime.

Example:#{minusDays(myDateTime,1)}

Result:"2011-05-23T17:10:00z"

DateTime minusDays(DateTime myDateTime,intdaysToSub)

Creates a DateTime object that isthe result of subtracting thespecified number of weeks fromthe specified DateTime.

Example:#{minusWeeks(myDateTime,1)}

Result:"2011-05-17T17:10:00z"

DateTime minusWeeks(DateTime myDateTime,intweeksToSub)



DescriptionFunction

Creates a DateTime object that isthe result of subtracting thespecified number of months fromthe specified DateTime.

Example:#{minusMonths(myDateTime,1)}

Result:"2011-04-24T17:10:00z"

DateTime minusMonths(DateTime myDateTime,intmonthsToSub)

Creates a DateTime object that isthe result of subtracting thespecified number of years fromthe specified DateTime.

Example:#{minusYears(myDateTime,1)}

Result:"2010-05-24T17:10:00z"

minusYears(DateTime myDateTime,int yearsToSub)

Creates a DateTime object that isthe result of adding the specifiednumber of minutes to the specifiedDateTime.

Example:#{plusMinutes(myDateTime,1)}

Result: "2011-05-2417:11:00z"

DateTime plusMinutes(DateTime myDateTime,intminutesToAdd)

Creates a DateTime object that isthe result of adding the specifiednumber of hours to the specifiedDateTime.

Example:#{plusHours(myDateTime,1)}

Result:"2011-05-24T18:10:00z"

DateTime plusHours(DateTime myDateTime,inthoursToAdd)

Creates a DateTime object that isthe result of adding the specifiednumber of days to the specifiedDateTime.

Example:#{plusDays(myDateTime,1)}

Result:"2011-05-25T17:10:00z"

DateTime plusDays(DateTime myDateTime,int daysToAdd)



DescriptionFunction

Creates a DateTime object that isthe result of adding the specifiednumber of weeks to the specifiedDateTime.

Example:#{plusWeeks(myDateTime,1)}

Result:"2011-05-31T17:10:00z"

DateTime plusWeeks(DateTime myDateTime,intweeksToAdd)

Creates a DateTime object that isthe result of adding the specifiednumber of months to the specifiedDateTime.

Example:#{plusMonths(myDateTime,1)}

Result:"2011-06-24T17:10:00z"

DateTime plusMonths(DateTime myDateTime,intmonthsToAdd)

Creates a DateTime object that isthe result of adding the specifiednumber of years to the specifiedDateTime.

Example:#{plusYears(myDateTime,1)}

Result:"2012-05-24T17:10:00z"

DateTime plusYears(DateTime myDateTime,intyearsToAdd)

ObjectsThis section describes the objects that you can use in your pipeline definition file.

Object CategoriesThe following is a list of AWS Data Pipeline objects by category.

Schedule


Data node

• S3DataNode (p. 165)

• MySqlDataNode (p. 169)

Activity

• ShellCommandActivity (p. 176)


AWS Data Pipeline Developer GuideObjects

• CopyActivity (p. 180)

• EmrActivity (p. 184)

Precondition

• ShellCommandPrecondition (p. 192)

• Exists (p. 194)

• RdsSqlPrecondition (p. 203)

• DynamoDBTableExists (p. 204)

• DynamoDBDataExists (p. 204)

Computational resource

• EmrCluster (p. 209)

Alarm

• SnsAlarm (p. 213)

Object HierarchyThe following is the object hierarchy for AWS Data Pipeline.

ImportantYou can only create objects of the types that are listed in the previous section.


AWS Data Pipeline Developer GuideObject Hierarchy

ScheduleDefines the timing of a scheduled event, such as when an activity runs.

NoteWhen a schedule's startDateTime is in the past, AWS Data Pipeline will backfill your pipelineand begin scheduling runs immediately beginning at startDateTime. For testing/development,use a relatively short interval for startDateTime..endDateTime. If not, AWS Data Pipelineattempts to queue up and schedule all runs of your pipeline for that interval.



SyntaxThe following slots are included in all objects.

RequiredTypeDescriptionName

YesStringThe ID of the object. IDs must be uniquewithin a pipeline definition.

id

NoStringThe optional, user-defined label of the object.If you do not provide a name for an object ina pipeline definition, AWS Data Pipelineautomatically duplicates the value of id.

name

YesStringThe type of object. Use one of the predefinedAWS Data Pipeline object types.

type

NoStringThe parent of the object.parent

NoString (read-only)The sphere of an object denotes its place inthe pipeline lifecycle, such as Pipeline,Component, Instance, or Attempt.

@sphere

This object includes the following fields.


YesString, in DateTimeorDateTimeWithZoneformat

The date and time to start the scheduledruns.

startDateTime

NoString, in DateTimeorDateTimeWithZoneformat

The date and time to end the scheduled runs.The default behavior is to schedule runs untilthe pipeline is shut down.

endDateTime

YesStringHow often the pipeline should run. Theformat is "N[minutes|hours|days|weeks|months]", whereN is a number followed by one of the timespecifiers. For example, "15 minutes", runsthe pipeline every 15 minutes.

The minimum period is 15 minutes and themaximum period is 3 years.

period

NoDateTime (read-only)The date and time that the scheduled runactually started. This value is added to theobject by the schedule. By convention,activities treat the start as inclusive.

@scheduledStartTime

NoDateTime (read-only)The date and time that the scheduled runactually ended. This value is added to theobject by the schedule. By convention,activities treat the end as exclusive.

@scheduledEndTime



ExampleThe following is an example of this object type. It defines a schedule of every hour starting at 00:00:00hours on 2012-09-01 and ending at 00:00:00 hours on 2012-10-01. The first period ends at 01:00:00 on2012-09-01.

{ "id" : "Hourly", "type" : "Schedule", "period" : "1 hours", "startDateTime" : "2012-09-01T00:00:00", "endDateTime" : "2012-10-01T00:00:00"}

S3DataNodeDefines a data node using Amazon S3.

NoteWhen you use an S3DataNode as input to a CopyActivity, only CSV data format is supported.




id


name


type



@sphere



NoStringThe path to the object in Amazon S3 as aURI, for example:s3://my-bucket/my-key-for-file.

filePath

NoStringAmazon S3 directory path as a URI:s3://my-bucket/my-key-for-directory.

directoryPath


AWS Data Pipeline Developer GuideS3DataNode


NoStringThe type of compression for the datadescribed by the S3DataNode. none is nocompression and gzip is compressed withthe gzip algorithm. This field is onlysupported when you use S3DataNode witha CopyActivity.

compression

YesStringThe format of the data described by theS3DataNode. This field is only supportedwhen you use S3DataNode with aHiveActivity.

dataFormat

This object includes the following slots from the DataNode object.


NoSnsAlarm (p. 213)object reference

An action to run when the current instancefails.

onFail


An email alarm to use when the object's runsucceeds.

onSuccess

NoObject referenceA condition that must be met for the datanode to be valid. To specify multipleconditions, add multiple preconditionslots. A data node is not ready until all itsconditions are met.

precondition

YesSchedule (p. 163)object reference

A schedule of the object. A common use isto specify a time schedule that correlates tothe schedule for the object.

This slot overrides the schedule slotincluded from SchedulableObject, whichis optional.

schedule

NoAllowed values are“cron” or “timeseries”.Defaults to"timeseries".

Schedule type allows you to specify whetherthe objects in your pipeline definition shouldbe scheduled at the beginning of interval orend of the interval. Time Series StyleScheduling means instances are scheduledat the end of each interval and Cron StyleScheduling means instances are scheduledat the beginning of each interval.

scheduleType

This object includes the following slots from RunnableObject.


NoStringThe worker group. This is used for routingtasks.

workerGroup

YesPeriodThe timeout duration between two retryattempts. The default is 10 minutes.

retryDelay




NoIntegerThe maximum number of times to retry theaction.

maximumRetries


The email alarm to use when the object's runis late.

onLateNotify

NoBooleanIndicates whether all pending or unscheduledtasks should be killed if they are late.

onLateKill

NoPeriodThe period in which the object run must start.If the activity does not start within thescheduled start time plus this time interval,it is considered late.

lateAfterTimeout

NoPeriodThe period for successive calls from TaskRunner to the ReportTaskProgress API. IfTask Runner, or other code that isprocessing the tasks, does not reportprogress within the specified period, theactivity can be retried.

reportProgressTimeout

NoPeriodThe timeout for an activity. If an activity doesnot complete within the start time plus thistime interval, AWS Data Pipeline marks theattempt as failed and your retry settingsdetermine the next steps taken.

attemptTimeout

NoStringThe location in Amazon S3 to store log filesgenerated by Task Runner when performingwork for this object.

logUri

NoDateTimeThe last time that Task Runner, or other codethat is processing the tasks, called theReportTaskProgress API.

@reportProgressTime

NoSchedulable objectreference

Record of the currently scheduled instanceobjects

@activeInstances

NoDateTime (read-only)The last run of the object. This is a runtimeslot.

@lastRun

NoObject reference(read-only)

The currently scheduled instance objects.This is a runtime slot.

@scheduledPhysicalObjects

NoString (read-only)The status of this object. This is a runtimeslot. Possible values are: pending,checking_preconditions, running,waiting_on_runner, successful, andfailed.

@status

NoInteger (read-only)The number of attempted runs remainingbefore setting the status of this object tofailed. This is a runtime slot.

@triesLeft

NoDateTime (read-only)The date and time that the scheduled runactually started. This is a runtime slot.

@actualStartTime

NoDateTime (read-only)The date and time that the scheduled runactually ended. This is a runtime slot.

@actualEndTime




NoString (read-only)If the object failed, the error code. This is aruntime slot.

errorCode

NoString (read-only)If the object failed, the error message. Thisis a runtime slot.

errorMessage

NoString (read-only)If the object failed, the error stack trace.errorStackTrace

NoDateTimeThe date and time that the run wasscheduled to start.

@scheduledStartTime

NoDateTimeThe date and time that the run wasscheduled to end.

@scheduledEndTime


The component from which this instance iscreated.

@componentParent


The latest attempt on the given instance.@headAttempt


The resource instance on which the givenactivity/precondition attempt is being run.

@resource

NoStringThe status most recently reported from theactivity.

activityStatus

This object includes the following slots from SchedulableObject.


NoSchedule (p. 163)object reference


schedule



scheduleType

NoResource objectreference

The computational resource to run theactivity or command. For example, anAmazon EC2 instance or Amazon EMRcluster.

runsOn

ExampleThe following is an example of this object type. This object references another object that you'd definein the same pipeline definition file. CopyPeriod is a Schedule object.



{ "id" : "OutputData", "type" : "S3DataNode", "schedule" : {"ref" : "CopyPeriod"}, "filePath" : "s3://myBucket/#{@scheduledStartTime}.csv"}

See Also• MySqlDataNode (p. 169)

MySqlDataNodeDefines a data node using MySQL.




id


name


type



@sphere

This object includes the following slots from the SqlDataNode object.


YesStringThe name of the table in the MySQLdatabase. To specify multiple tables, addmultiple table slots.

table

NoStringThe JDBC connection string to access thedatabase.

connectionString

NoStringA SQL statement to fetch data from the table.selectQuery

NoStringA SQL statement to insert data into the table.insertQuery


AWS Data Pipeline Developer GuideMySqlDataNode





onFail



onSuccess


precondition




schedule



scheduleType




workerGroup


retryDelay


maximumRetries



onLateNotify


onLateKill


lateAfterTimeout







attemptTimeout


logUri


@reportProgressTime



@activeInstances


@lastRun





@status


@triesLeft


@actualStartTime


@actualEndTime


errorCode


errorMessage



@scheduledStartTime


@scheduledEndTime






@componentParent





@resource


activityStatus

ExampleThe following is an example of this object type. This object references two other objects that you'd definein the same pipeline definition file.CopyPeriod is a Schedule object and Ready is a precondition object.

{ "id" : "Sql Table", "type" : "MySqlDataNode", "schedule" : {"ref" : "CopyPeriod"}, "table" : "adEvents", "username": "user_name", "*password": "my_password", "connectionString": "jdbc:mysql://mysqlinstance-rds.example.us-east-1.rds.amazonaws.com:3306/database_name", "selectQuery" : "select * from #{table} where eventTime >= '#{@scheduledStart Time.format('YYYY-MM-dd HH:mm:ss')}' and eventTime < '#{@scheduledEnd Time.format('YYYY-MM-dd HH:mm:ss')}'", "precondition" : {"ref" : "Ready"}}

See Also• S3DataNode (p. 165)

DynamoDBDataNodeDefines a data node using Amazon DynamoDB, which is specified as an input to a HiveActivity orEMRActivity.

NoteThe DynamoDBDataNode does not support the Exists precondition.




id


AWS Data Pipeline Developer GuideDynamoDBDataNode



name


type



@sphere



YesStringThe DynamoDB table.tableName





onFail



onSuccess


precondition




schedule



scheduleType






workerGroup


retryDelay


maximumRetries



onLateNotify


onLateKill


lateAfterTimeout




attemptTimeout


logUri


@reportProgressTime



@activeInstances


@lastRun





@status





@triesLeft


@actualStartTime


@actualEndTime


errorCode


errorMessage



@scheduledStartTime


@scheduledEndTime



@componentParent





@resource


activityStatus





schedule



scheduleType



runsOn



ExampleThe following is an example of this object type. This object references two other objects that you'd definein the same pipeline definition file.CopyPeriod is a Schedule object and Ready is a precondition object.

{ "id" : "MyDynamoDBTable", "type" : "DynamoDBDataNode", "schedule" : {"ref" : "CopyPeriod"}, "tableName" : "adEvents", "precondition" : {"ref" : "Ready"}}

ShellCommandActivityRuns a command or script.You can use ShellCommandActivity to run time-series or cron-like scheduledtasks.

When the stage field is set to true and used with an S3DataNode, ShellCommandActivity supports theconcept of staging data, which means that you can move data from Amazon S3 to a stage location, suchas Amazon EC2 or your local environment, perform work on the data using scripts and theShellCommandActivity, and move it back to Amazon S3. In this case, when your shell command isconnected to an input S3DataNode, your shell scripts to operate directly on the data using ${input1},${input2}, etc. referring to the ShellCommandActivity input fields. Similarly, output from theshell-command can be staged in an output directory to be automatically pushed to Amazon S3, referredto by ${output1}, ${output2}, etc. These expressions can pass as command-line arguments to theshell-command for you to use in data transformation logic.




id


name


type



@sphere


AWS Data Pipeline Developer GuideShellCommandActivity



YesStringThe command to run. This value and anyassociated parameters must function in theenvironment from which you are running theTask Runner.

command

NoStringThe file that receives redirected output fromthe command that is run.

stdout

NoStringThe file that receives redirected system errormessages from the command that is run.

stderr

NoData node objectreference

The input data source. To specify multipledata sources, add multiple input fields.

input


The location for the output. To specifymultiple locations, add multiple outputfields.

output

NoA valid S3 URIAn Amazon S3 URI path for a file todownload and run as a shell command. Onlyone scriptUri or command field should bepresent.

scriptUri

NoBooleanDetermines whether staging is enabled andallows your shell commands to have accessto the staged-data variables, such as${input1}, ${output1}, etc.

stage

NoteYou must specify a command value or a scriptUri value, but both are not required.

This object includes the following slots from the Activity object.




onFail



onSuccess

NoObject referenceA condition that must be met before theobject can run. To specify multipleconditions, add multiple preconditionslots. The activity cannot run until all itsconditions are met.

precondition




schedule






workerGroup


retryDelay


maximumRetries



onLateNotify


onLateKill


lateAfterTimeout




attemptTimeout


logUri


@reportProgressTime



@activeInstances


@lastRun





@status





@triesLeft


@actualStartTime


@actualEndTime


errorCode


errorMessage



@scheduledStartTime


@scheduledEndTime



@componentParent





@resource


activityStatus





schedule



scheduleType



runsOn



ExampleThe following is an example of this object type.

{ "id" : "CreateDirectory", "type" : "ShellCommandActivity", "command" : "mkdir new-directory"}

See Also• CopyActivity (p. 180)


CopyActivityCopies data from one location to another. The copy operation is performed record by record.

ImportantWhen you use an S3DataNode as input for CopyActivity, you can only use a Unix/Linux variantof the CSV data file format, which means that CopyActivity has specific limitations to its CSVsupport:

• The separator must be the "," (comma) character.

• The records will not be quoted.

• The default escape character will be ASCII value 92 (backslash).

• The end of record identifier will be ASCII value 10 (or "\n").

WarningWindows-based systems typically use a different end of record character sequence: a carriagereturn and line feed together (ASCII value 13 and ASCII value 10).You must accommodate thisdifference using an additional mechanism, such as a pre-copy script to modify the input data, toensure that CopyActivity can properly detect the end of a record, otherwise the CopyActivity willfail repeatedly.

WarningAdditionally, you may encounter repeated CopyActivity failures if you supply compressed datafiles as input, but do not specify this using the compression field. In this case, CopyActivity willnot properly detect the end of record character and the operation will fail.




id


AWS Data Pipeline Developer GuideCopyActivity



name


type



@sphere



YesData node objectreference


input

YesData node objectreference


output





onFail



onSuccess


precondition




schedule






workerGroup


retryDelay


maximumRetries



onLateNotify


onLateKill


lateAfterTimeout




attemptTimeout


logUri


@reportProgressTime



@activeInstances


@lastRun





@status





@triesLeft


@actualStartTime


@actualEndTime


errorCode


errorMessage



@scheduledStartTime


@scheduledEndTime



@componentParent





@resource


activityStatus





schedule



scheduleType



runsOn



ExampleThe following is an example of this object type. This object references three other objects that you woulddefine in the same pipeline definition file. CopyPeriod is a Schedule object and InputData andOutputData are data node objects.

{ "id" : "S3ToS3Copy", "type" : "CopyActivity", "schedule" : {"ref" : "CopyPeriod"}, "input" : {"ref" : "InputData"}, "output" : {"ref" : "OutputData"}}

See Also• ShellCommandActivity (p. 176)


EmrActivityRuns an Amazon EMR job flow.




id


name


type



@sphere



YesEmrCluster (p. 209)object reference

Details about the Amazon EMR job flow.runsOn


AWS Data Pipeline Developer GuideEmrActivity


YesStringOne or more steps for the job flow to run. Tospecify multiple steps, up to 255, addmultiple step fields.

step

NoStringShell scripts to be run before any steps arerun. To specify multiple scripts, up to 255,add multiple preStepCommand fields.

preStepCommand

NoStringShell scripts to be run after all steps arefinished. To specify multiple scripts, up to255, add multiple postStepCommand fields.

postStepCommand



input



output





onFail



onSuccess


precondition




schedule




workerGroup


retryDelay


maximumRetries






onLateNotify


onLateKill


lateAfterTimeout




attemptTimeout


logUri


@reportProgressTime



@activeInstances


@lastRun





@status


@triesLeft


@actualStartTime


@actualEndTime


errorCode





errorMessage



@scheduledStartTime


@scheduledEndTime



@componentParent





@resource


activityStatus





schedule



scheduleType



runsOn

ExampleThe following is an example of this object type. This object references three other objects that you woulddefine in the same pipeline definition file. MyEmrCluster is an EmrCluster object and MyS3Input andMyS3Output are S3DataNode objects.

{ "id" : "MyEmrActivity", "type" : "EmrActivity",



"runsOn" : {"ref" : "MyEmrCluster"}, "preStepCommand" : "scp remoteFiles localFiles", "step" : "s3://myBucket/myPath/myStep.jar,firstArg,secondArg", "step" : "s3://myBucket/myPath/myOtherStep.jar,anotherArg", "postStepCommand" : "scp localFiles remoteFiles", "input" : {"ref" : "MyS3Input"}, "output" : {"ref" : "MyS3Output"}}


• CopyActivity (p. 180)

• EmrCluster (p. 209)

HiveActivityRuns a Hive query on an Amazon EMR cluster. HiveActivity makes it easier to set up an EMR activityand automatically creates Hive tables based on input data coming in from either Amazon S3 or AmazonRDS. All you need to specify is the HiveQL to execute on the source data. AWS Data Pipeline automaticallycreates Hive tables with ${input1}, ${input2}, etc. based on the input slots in the HiveActivity. ForS3 inputs, the dataFormat field is used to create the Hive column names. For MySQL (RDS) inputs andthe column names for the SQL query are used to create the Hive column names.




id


name


type



@sphere



NoStringThe location of the Hive script to run. ex:s3://script location

scriptUri


AWS Data Pipeline Developer GuideHiveActivity


NoStringThe Hive script to run.hiveScript

NoteYou must specify a hiveScript value or a scriptUri value, but both are not required.





onFail



onSuccess


precondition




schedule




workerGroup


retryDelay


maximumRetries



onLateNotify


onLateKill


lateAfterTimeout







attemptTimeout


logUri


@reportProgressTime



@activeInstances


@lastRun





@status


@triesLeft


@actualStartTime


@actualEndTime


errorCode


errorMessage



@scheduledStartTime


@scheduledEndTime






@componentParent





@resource


activityStatus





schedule



scheduleType



runsOn

ExampleThe following is an example of this object type. This object references three other objects that you woulddefine in the same pipeline definition file. CopyPeriod is a Schedule object and InputData andOutputData are data node objects.

{ "id" : "S3ToS3Copy", "type" : "CopyActivity", "schedule" : {"ref" : "CopyPeriod"}, "input" : {"ref" : "InputData"}, "output" : {"ref" : "OutputData"}}





ShellCommandPreconditionA Unix/Linux shell command that can be executed as a precondition.




id


name


type



@sphere



YesStringThe command to run. This value and anyassociated parameters must function in theenvironment from which you are running theTask Runner.

command

NoA valid S3 URIAn Amazon S3 URI path for a file todownload and run as a shell command. Onlyone scriptUri or command field shouldbe present.

scriptUri

This object includes the following slots from the Precondition object.


NoIntegerSpecifies the maximum number of times thata precondition is retried.

preconditionMaximumRetries


The activity or data node for which thisprecondition is being checked. This is aruntime slot.

node


AWS Data Pipeline Developer GuideShellCommandPrecondition




workerGroup


retryDelay


maximumRetries



onLateNotify


onLateKill


lateAfterTimeout




attemptTimeout


logUri


@reportProgressTime



@activeInstances


@lastRun





@status


AWS Data Pipeline Developer GuideShellCommandPrecondition



@triesLeft


@actualStartTime


@actualEndTime


errorCode


errorMessage



@scheduledStartTime


@scheduledEndTime



@componentParent





@resource


activityStatus

ExampleThe following is an example of this object type.

{ "id" : "VerifyDataReadiness", "type" : "ShellCommandPrecondition", "command" : "perl check-data-ready.pl"}


• Exists (p. 194)

ExistsChecks whether a data node object exists.


AWS Data Pipeline Developer GuideExists




id


name


type



@sphere







node




workerGroup


retryDelay


maximumRetries



onLateNotify


onLateKill


lateAfterTimeout







attemptTimeout


logUri


@reportProgressTime



@activeInstances


@lastRun





@status


@triesLeft


@actualStartTime


@actualEndTime


errorCode


errorMessage



@scheduledStartTime


@scheduledEndTime






@componentParent





@resource


activityStatus

ExampleThe following is an example of this object type. The InputData object references this object, Ready,plus another object that you'd define in the same pipeline definition file. CopyPeriod is a Scheduleobject.

{ "id" : "InputData", "type" : "S3DataNode", "schedule" : {"ref" : "CopyPeriod"}, "filePath" : "s3://test/InputData/#{@scheduledStartTime.format('YYYY-MM-dd-hh:mm')}.csv", "precondition" : {"ref" : "Ready"}},{ "id" : "Ready", "type" : "Exists"}

See Also• ShellCommandPrecondition (p. 192)

S3KeyExistsChecks whether a key exists in an Amazon S3 data node.




id


name


AWS Data Pipeline Developer GuideS3KeyExists



type



@sphere







node



YesStringAmazon S3 key to check for existence.s3Key




workerGroup


retryDelay


maximumRetries



onLateNotify


onLateKill


lateAfterTimeout







attemptTimeout


logUri


@reportProgressTime



@activeInstances


@lastRun





@status


@triesLeft


@actualStartTime


@actualEndTime


errorCode


errorMessage



@scheduledStartTime


@scheduledEndTime






@componentParent





@resource


activityStatus


S3PrefixNotEmptyA precondition to check that the Amazon S3 objects with the given prefix (represented as a URI) arepresent.




id


name


type



@sphere






AWS Data Pipeline Developer GuideS3PrefixNotEmpty




node



YesStringThe Amazon S3 prefix to check for existenceof objects.

s3Prefix




workerGroup


retryDelay


maximumRetries



onLateNotify


onLateKill


lateAfterTimeout




attemptTimeout


logUri


@reportProgressTime






@activeInstances


@lastRun





@status


@triesLeft


@actualStartTime


@actualEndTime


errorCode


errorMessage



@scheduledStartTime


@scheduledEndTime



@componentParent





@resource


activityStatus

ExampleThe following is an example of this object type using required, optional, and expression fields.

{ "id": "InputReady",



"type": "S3PrefixNotEmpty", "role": "test-role", "s3Prefix": "#{node.filePath}"}


RdsSqlPreconditionA precondition that executes a query to verify the readiness of data within Amazon RDS. Specifiedconditions are combined by a logical AND operation.

ImportantYou must grant permissions to Task Runner to access Amazon RDS using an RdsSqlPreconditionas described by Grant Amazon RDS Permissions to Task Runner (p. 23)




id


name


type



@sphere



YesStringA valid SQL query that should return one rowwith one column (scalar value).

query

YesStringThe user name to connect with.username

YesStringThe password to connect with. The asteriskinstructs AWS Data Pipeline to encrypt thepassword.

*password

YesStringThe InstanceId to connect to.rdsInstanceId


AWS Data Pipeline Developer GuideRdsSqlPrecondition


YesStringThe logical database to connect to.database

NoIntegerThis precondition is true if the value returnedby the query is equal to this value.

equalTo

NoIntegerThis precondition is true if the value returnedby the query is less than this value.

lessThan

NoIntegerThis precondition is true if the value returnedby the query is greater than this value.

greaterThan

NoBooleanThis precondition is true if the Boolean valuereturned by the query is equal to true.

isTrue

DynamoDBTableExistsA precondition to check that the Amazon DynamoDB table exists.




id


name


type



@sphere



YesStringThe Amazon DynamoDB table to check.tableName

DynamoDBDataExists"A precondition to check that data exists in a Amazon DynamoDB table.


AWS Data Pipeline Developer GuideDynamoDBTableExists




id


name


type



@sphere



YesStringThe Amazon DynamoDB table to check.tableName

Ec2ResourceRepresents the configuration of an Amazon EMR job flow.




id


name


type



@sphere


AWS Data Pipeline Developer GuideEc2Resource



NoStringThe type of EC2 instance to use for theresource pool. The default value ism1.small.

instanceType

NoIntegerThe number of instances to use for theresource pool. The default value is 1.

instanceCount

NoIntegerThe minimum number of EC2 instances forthe pool. The default value is 1.

minInstanceCount

NoStringThe EC2 security group to use for theinstances in the resource pool.

securityGroups

NoStringThe AMI version to use for the EC2instances. The default value isami-1624987f, which we recommendusing. For more information, see AmazonMachine Images (AMIs) .

imageId

NoStringThe Amazon EC2 key pair to use to log ontothe EC2 instance. The default action is notto attach a key pair to the EC2 instance.

keyPair

YesStringThe IAM role to use to create the EC2instance.

role

YesStringThe IAM role to use to control the resourcesthat the EC2 instance can access.

resourceRole

This object includes the following slots from the Resource object.


YesPeriodThe number of hours to wait beforeterminating the resource.

terminateAfter

YesPeriodThe unique identifier for the resource.@resourceId

NoStringThe current status of the resource, such aschecking_preconditions, creating,shutting_down, running, failed, timed_out,cancelled, or paused.

@resourceStatus

NoStringThe reason for the failure to create theresource.

@failureReason

NoDateTimeThe time when this resource was created.@resourceCreationTime




workerGroup



https://aws.amazon.com/amis

https://aws.amazon.com/amis



retryDelay


maximumRetries



onLateNotify


onLateKill


lateAfterTimeout




attemptTimeout


logUri


@reportProgressTime



@activeInstances


@lastRun





@status


@triesLeft


@actualStartTime





@actualEndTime


errorCode


errorMessage



@scheduledStartTime


@scheduledEndTime



@componentParent





@resource


activityStatus





schedule



scheduleType



runsOn

ExampleThe following is an example of this object type. It launches an EC2 instance and shows some optionalfields set.



{ "id": "MyEC2Resource", "type": "Ec2Resource", "actionOnTaskFailure": "terminate", "actionOnResourceFailure": "retryAll", "maximumRetries": "1", "role": "test-role", "resourceRole": "test-role", "instanceType": "m1.medium", "instanceCount": "1", "securityGroups": [ "test-group", "default" ], "keyPair": "test-pair"}

EmrClusterRepresents the configuration of an Amazon EMR job flow. This object is used by EmrActivity (p. 184) tolaunch a job flow.




id


name


type



@sphere



NoStringThe type of EC2 instance to use for themaster node.The default value is m1.small.

masterInstanceType

NoStringThe type of EC2 instance to use for corenodes. The default value is m1.small.

coreInstanceType


AWS Data Pipeline Developer GuideEmrCluster


NoStringThe number of core nodes to use for the jobflow. The default value is 1.

coreInstanceCount

NoStringThe type of EC2 instance to use for tasknodes.

taskInstanceType

NoStringThe number of task nodes to use for the jobflow. The default value is 1.

taskInstanceCount

NoStringThe Amazon EC2 key pair to use to log ontothe master node of the job flow. The defaultaction is not to attach a key pair to the jobflow.

keyPair

NoStringThe version of Hadoop to use in the job flow.The default value is 0.20. For moreinformation about the Hadoop versionssupported by Amazon EMR, see SupportedHadoop Versions .

hadoopVersion

NoString arrayAn action to run when the job flow starts.Youcan specify comma-separated arguments.To specify multiple actions, up to 255, addmultiple bootstrapAction fields. Thedefault behavior is to start the job flowwithout any bootstrap actions.

bootstrapAction

NoStringEnables debugging on the job flow.enableDebugging

NoStringThe location in Amazon S3 to store log filesfrom the job flow.

logUri

This object includes the following slots from the Resource object.


YesPeriodThe number of hours to wait beforeterminating the resource.

terminateAfter

YesPeriodThe unique identifier for the resource.@resourceId

NoStringThe current status of the resource, such aschecking_preconditions, creating,shutting_down, running, failed, timed_out,cancelled, or paused.

@resourceStatus

NoStringThe reason for the failure to create theresource.

@failureReason

NoDateTimeThe time when this resource was created.@resourceCreationTime



http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/usingemr_config_supportedversions.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/usingemr_config_supportedversions.html




workerGroup


retryDelay


maximumRetries



onLateNotify


onLateKill


lateAfterTimeout




attemptTimeout


logUri


@reportProgressTime



@activeInstances


@lastRun





@status





@triesLeft


@actualStartTime


@actualEndTime


errorCode


errorMessage



@scheduledStartTime


@scheduledEndTime



@componentParent





@resource


activityStatus





schedule



scheduleType



runsOn



ExampleThe following is an example of this object type. It launches an *Amazon EMR job flow using AMI version1.0 and Hadoop 0.20.

{ "id" : "MyEmrCluster", "type" : "EmrCluster", "hadoopVersion" : "0.20", "keypair" : "myKeyPair", "masterInstanceType" : "m1.xlarge", "coreInstanceType" : "m1.small", "coreInstanceCount" : "10", "instanceTaskType" : "m1.small", "instanceTaskCount": "10", "bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-ha doop,arg1,arg2,arg3", "bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-other-stuff,arg1,arg2"}

See Also• EmrActivity (p. 184)

SnsAlarmSends an Amazon SNS notification message when an activity fails or finishes successfully.




id


name


type



@sphere


AWS Data Pipeline Developer GuideSnsAlarm



YesStringThe subject line of the Amazon SNSnotification message.

subject

YesStringThe body text of the Amazon SNSnotification.

message

YesStringThe destination Amazon SNS topic ARN forthe message.

topicArn

This object includes the following slots from the Action object.



The node for which this action is beingperformed. This is a runtime slot.

node

ExampleThe following is an example of this object type. The values for node.input and node.output comefrom the data node or activity that references this object in its onSuccess field.

{ "id" : "SuccessNotify", "type" : "SnsAlarm", "topicArn" : "arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic", "subject" : "COPY SUCCESS: #{node.@scheduledStartTime}", "message" : "Files were copied from #{node.input} to #{node.output}."}


AWS Data Pipeline Developer GuideSnsAlarm

Command Line Reference

Before you read this section, you should be familiar with Using the Command Line Interface (p. 121).

This section is a detailed reference of the AWS Data Pipeline command line interface (CLI) commandsand parameters to interact with AWS Data Pipeline.

You can combine commands on a single command line. Commands are processed from left to right.Youcan use the --create and --id commands anywhere on the command line, but not together, and notmore than once.

--cancel

DescriptionCancels one or more specified objects from within a pipeline that is either currently running or ranpreviously.

To see the status of the canceled pipeline object, use --list-runs.

SyntaxOn Linux/Unix/Mac OS:

./datapipeline --cancel object_id --id pipeline_id [Common Options]

On Windows:

ruby datapipeline --cancel object_id --id pipeline_id [Common Options]


AWS Data Pipeline Developer Guide--cancel

Options

RequiredDescriptionName

YesThe identifier of the object to cancel.You canspecify the name of a single object, or acomma-separated list of object identifiers.

Example: o-06198791C436IEXAMPLE

object_id

YesThe identifier of the pipeline.

Example: --id df-00627471SOVYZEXAMPLE

--id pipeline_id

Common OptionsFor more information, see Common Options for AWS Data Pipeline Commands (p. 229).

OutputNone.

ExamplesThe following example demonstrates how to list the objects of a previously run or currently running pipeline.Next, the example cancels an object of the pipeline. Finally, the example lists the results of the canceledobject.


./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE

./datapipeline --id df-00627471SOVYZEXAMPLE --cancel o-06198791C436IEXAMPLE


On Windows:

ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE

ruby datapipeline --id df-00627471SOVYZEXAMPLE --cancel o-06198791C436IEXAMPLE


Related Commands• --delete (p. 218)

• --list-pipelines (p. 221)

• --list-runs (p. 222)


AWS Data Pipeline Developer GuideOptions

--create

DescriptionCreates a data pipeline with the specified name, but does not activate the pipeline.

There is a limit of 20 pipelines per AWS account.

To specify a pipeline definition file when you create the pipeline, use this command with the --put (p. 224)command.


./datapipeline --create name [Common Options]

On Windows:

ruby datapipeline --create name [Common Options]

Options


YesThe name of the pipeline.

Example: my-pipeline

name


Output

Pipeline with name 'name' and id 'df-xxxxxxxxxxxxxxxxxxxx' created.df-xxxxxxxxxxxxxxxxxxxx

The identifier of the newly created pipeline (df-xxxxxxxxxxxxxxxxxxxx).You must specify this identifierwith the --id command whenever you issue a command that operates on the corresponding pipeline.

ExamplesThe following example creates the first pipeline without specifying a pipeline definition file, and createsthe second pipeline with a pipeline definition file.



AWS Data Pipeline Developer Guide--create

./datapipeline --create my-first-pipeline

./datapipeline --create my-second-pipeline --put my-pipeline-file.json

On Windows:

ruby datapipeline --create my-first-pipeline

ruby datapipeline --create my-second-pipeline --put my-pipeline-file.json

Related Commands• --delete (p. 218)


• --put (p. 224)

--delete

DescriptionStops the specified data pipeline, and cancels its future runs.

This command removes the pipeline definition file and run history. This action is irreversible; you can'trestart a deleted pipeline.


./datapipeline --delete --id pipeline_id [Common Options]

On Windows:

ruby datapipeline --delete --id pipeline_id [Common Options]

Options




--id pipeline_id



AWS Data Pipeline Developer GuideRelated Commands

Output

State of pipeline id 'df-xxxxxxxxxxxxxxxxxxxx' is currently 'state'Deleted pipeline 'df-xxxxxxxxxxxxxxxxxxxx'

A message indicating that the pipeline was successfully deleted.

ExamplesThe following example deletes the pipeline with the identifier df-00627471SOVYZEXAMPLE.


./datapipeline --delete --id df-00627471SOVYZEXAMPLE

On Windows:

ruby datapipeline --delete --id df-00627471SOVYZEXAMPLE

Related Commands• --create (p. 217)


--get, --g

DescriptionGets the pipeline definition file for the specified data pipeline and saves it to a file. If no file is specified,the file contents are written to standard output.


./datapipeline --get pipeline_definition_file --id pipeline_id --version pipeline_version [Common Options]

On Windows:

ruby datapipeline --get pipeline_definition_file --id pipeline_id --version pipeline_version [Common Options]


AWS Data Pipeline Developer GuideOutput

Options




--id pipeline_id

NoThe full path to the output file that receives thepipeline definition.

Default: standard output

Example: my-pipeline.json

pipeline_definition_file

NoThe version name of the pipeline.

Example: --version active

Example: --version latest

--versionpipeline_version


OutputIf an output file is specified, the output is a pipeline definition file; otherwise, the contents of the pipelinedefinition are written to standard output.

ExamplesThe first two command writes the definition to standard output (usually the terminal screen), and thesecond command writes the pipeline definition to the file my-pipeline.json.


./datapipeline --get --id df-00627471SOVYZEXAMPLE

./datapipeline --get my-pipeline.json --id df-00627471SOVYZEXAMPLE

On Windows:

ruby datapipeline --get --id df-00627471SOVYZEXAMPLE

ruby datapipeline --get my-pipeline.json --id df-00627471SOVYZEXAMPLE


• --put (p. 224)



--help, --h

DescriptionDisplays information about the commands provided by the CLI.


./datapipeline --help

On Windows:

ruby datapipeline --help

OptionsNone.

OutputA list of the commands used by the CLI, printed to standard output (typically the terminal window).

--list-pipelines

DescriptionLists the pipelines that you have permission to access.



On Windows:

ruby datapipeline --list-pipelines

OptionsNone.


AWS Data Pipeline Developer Guide--help, --h


• --list-runs (p. 222)

--list-runs

DescriptionLists the times the specified pipeline has run.You can optionally filter the complete list of results to includeonly the runs you are interested in.


./datapipeline --list-runs --id pipeline_id [filter] [Common Options]

On Windows:

ruby datapipeline --list-runs --id pipeline_id [filter] [Common Options]

Options


YesThe identifier of the pipeline.--id pipeline_id

NoFilters the list to include only runs in the specifiedstatuses.

The valid statuses are as follows: waiting,pending, cancelled, running, finished,failed, waiting_for_runnerandchecking_preconditions.

Example: --status running

You can combine statuses as a comma-separatedlist.

Example: --statuspending,checking_preconditions

--status code

NoFilters the list to include only runs in the failedstate that started during the last 2 days and werescheduled to end within the last 15 days.

--failed

NoFilters the list to include only runs in the runningstate that started during the last 2 days and werescheduled to end within the last 15 days.

--running




NoFilters the list to include only runs that startedwithin the specified interval.

--start-intervaldate1,date2

NoFilters the list to include only runs that arescheduled to start within the specified interval.

--schedule-intervaldate1,date2


OutputA list of the times the specified pipeline has run and the status of each run.You can filter this list by theoptions you specify when you run the command.

ExamplesThe first command lists all the runs for the specified pipeline. The other commands show how to filter thecomplete list of runs using different options.



./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --status PENDING

./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --start-interval 2011-11-29T06:07:21,2011-12-06T06:07:21

./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --schedule-interval 2011-11-29T06:07:21,2011-12-06T06:07:21

./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --failed

./datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --running

On Windows:


ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --status PENDING

ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --start-interval 2011-11-29T06:07:21,2011-12-06T06:07:21

ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --schedule-interval 2011-11-29T06:07:21,2011-12-06T06:07:21

ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --failed

ruby datapipeline --list-runs --id df-00627471SOVYZEXAMPLE --running


AWS Data Pipeline Developer GuideCommon Options

Related Commands• --list-pipelines (p. 221)

--put

DescriptionUploads a pipeline definition file to AWS Data Pipeline for a new or existing pipeline, but does not activatethe pipeline. Use the --activate parameter in a separate command when you want the pipeline to begin.

To specify a pipeline definition file at the time that you create the pipeline, use this command with the--create (p. 217) command.


./datapipeline --activate --id pipeline_id [Common Options]

On Windows:

ruby datapipeline --activate --id pipeline_id [Common Options]

Options


ConditionalThe identifier of the pipeline.

You must specify the identifier of the pipeline whenupdating an existing pipeline with a new pipelinedefinition file.


--id pipeline_id


OutputA response indicating that the new definition was successfully loaded, or, in the case where you are alsousing the --create (p. 217) command, an indication that the new pipeline was successfully activated.



ExamplesThe following examples show how to use --put to create a new pipeline (example one) and how to use--put and --id to add a definition file to a pipeline (example two) or update a preexisting pipelinedefinition file of a pipeline (example three).


./datapipeline --create my-pipeline --put my-pipeline-definition.json

./datapipeline --id df-00627471SOVYZEXAMPLE --put a-pipeline-definition.json

./datapipeline --id df-00627471SOVYZEXAMPLE --put my-updated-pipeline-defini tion.json

On Windows:

ruby datapipeline --create my-pipeline --put my-pipeline-definition.json

ruby datapipeline --id df-00627471SOVYZEXAMPLE --put a-pipeline-definition.json

ruby datapipeline --id df-00627471SOVYZEXAMPLE --put my-updated-pipeline-definition.json


• --get, --g (p. 219)

--activate

DescriptionStarts a new or existing pipeline.

To specify a pipeline definition file at the time that you create the pipeline, use this command with the--create (p. 217) command.


./datapipeline --put pipeline_definition_file --id pipeline_id [Common Options]

On Windows:

ruby datapipeline --put pipeline_definition_file --id pipeline_id [Common Op tions]


AWS Data Pipeline Developer GuideExamples

Options


YesThe name of the pipeline definition file.

Example: pipeline-definition-file.json


ConditionalThe identifier of the pipeline.

You must specify the identifier of the pipeline whenupdating an existing pipeline with a new pipelinedefinition file.


--id pipeline_id


OutputA response indicating that the new definition was successfully loaded, or, in the case where you are alsousing the --create (p. 217) command, an indication that the new pipeline was successfully created.

ExamplesThe following examples show how to use --put to create a new pipeline (example one) and how to use--put and --id to add a definition file to a pipeline (example two) or update a preexisting pipelinedefinition file of a pipeline (example three).


./datapipeline --create my-pipeline --put my-pipeline-definition.json

./datapipeline --id df-00627471SOVYZEXAMPLE --put a-pipeline-definition.json

./datapipeline --id df-00627471SOVYZEXAMPLE --put my-updated-pipeline-defini tion.json

On Windows:

ruby datapipeline --create my-pipeline --put my-pipeline-definition.json

ruby datapipeline --id df-00627471SOVYZEXAMPLE --put a-pipeline-definition.json

ruby datapipeline --id df-00627471SOVYZEXAMPLE --put my-updated-pipeline-definition.json




• --get, --g (p. 219)

--rerun

DescriptionReruns one or more specified objects from within a pipeline that is either currently running or has previouslyrun. Resets the retry count of the object and then runs the object. It also tries to cancel the current attemptif an attempt is running.


./datapipeline --rerun object_id --id pipeline_id [Common Options]

On Windows:

ruby datapipeline --rerun object_id --id pipeline_id [Common Options]

Noteobject_id can be a comma separated list.

Options


YesThe identifier of the object.

Example: o-06198791C436IEXAMPLE

object_id



--id pipeline_id


OutputNone. To see the status of the object set to rerun, use --list-runs.

ExamplesReruns the specified object in the indicated pipeline.



AWS Data Pipeline Developer Guide--rerun

./datapipeline --rerun o-06198791C436IEXAMPLE --id df-00627471SOVYZEXAMPLE

On Windows:

ruby datapipeline --rerun o-06198791C436IEXAMPLE --id df-00627471SOVYZEXAMPLE

Related Commands• --list-runs (p. 222)


--validate

DescriptionValidates the pipeline definition for correct syntax. Also performs additional checks, such as a check forcircular dependencies.


./datapipeline --validate pipeline_definition_file

On Windows:

ruby datapipeline --validate pipeline_definition_file

Options


YesThe full path to the output file that receives thepipeline definition.

Default: standard output

Example: my-pipeline.json




Common Options for AWS Data PipelineCommands

The following set of options are accepted by most of the commands described in this guide.


ConditionalThe access key ID associated with your AWS account.

If you specify --access-key, you must also specify--secret-key.

This option is required if you aren't using a JSON credentialsfile (see --credentials).

Example: --access-key AKIAIOSFODNN7EXAMPLE

For more information, see Setting Credentials for the AWSData Pipeline Command Line Interface.

--access-keyaws_access_key

ConditionalThe location of the JSON file with your AWS credentials.

You don't need to set this option if the JSON file is namedcredentials.json, and it exists in either your user homedirectory or the directory where the AWS Data Pipeline CLIis installed. The CLI automatically finds the JSON file if itexists in either location.

If you specify a credentials file (either using this option or byincluding credentials.json in one of its two supportedlocations), you don't need to use the --access-key and--secret-key options.

Example: TBD

For more information, see Setting Credentials for the AWSData Pipeline Command Line Interface.

--credentialsjson_file

ConditionalThe URL of the AWS Data Pipeline endpoint that the CLIshould use to contact the web service.

If you specify an endpoint both in a JSON file and with thiscommand line option, the CLI ignores the endpoint set withthis command line option.

Example: TBD

--endpoint url

Use the specified pipeline identifier.


--id pipeline_id

The field limit for the pagination of objects.

Example: TBD

--limit limit




ConditionalThe secret access key associated with your AWS account.

If you specify --secret-key, you must also specify--access-key.

This option is required if you aren't using a JSON credentialsfile (see --credentials).

Example: --secret-keywJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

For more information, see Setting Credentials for the AmazonAWS Data Pipeline Command Line Interface.

--secret-keyaws_secret_key

NoThe number of seconds for the AWS Data Pipeline client towait before timing out the http connection to the AWS DataPipeline web service.

Example: --timeout 120

--timeout seconds

NoPrints detailed debugging output.--t, --trace

NoPrints verbose output. This is useful for debugging.--v, --verbose



Program AWS Data Pipeline

Topics

• Make an HTTP Request to AWS Data Pipeline (p. 231)

• Actions in AWS Data Pipeline (p. 234)

Make an HTTP Request to AWS Data PipelineIf you don't use one of the AWS SDKs, you can perform AWS Data Pipeline operations over HTTP usingthe POST request method. The POST method requires you to specify the operation in the header of therequest and provide the data for the operation in JSON format in the body of the request.

HTTP Header ContentsAWS Data Pipeline requires the following information in the header of an HTTP request:

• host The AWS Data Pipeline endpoint. For information about endpoints, see Regions and Endpoints.

• x-amz-dateYou must provide the time stamp in either the HTTP Date header or the AWS x-amz-dateheader. (Some HTTP client libraries don't let you set the Date header.) When an x-amz-date headeris present, the system ignores any Date header during the request authentication.

The date must be specified in one of the following three formats, as specified in the HTTP/1.1 RFC:

• Sun, 06 Nov 1994 08:49:37 GMT (RFC 822, updated by RFC 1123)

• Sunday, 06-Nov-94 08:49:37 GMT (RFC 850, obsoleted by RFC 1036)

• Sun Nov 6 08:49:37 1994 (ANSI C asctime() format)

• Authorization The set of authorization parameters that AWS uses to ensure the validity andauthenticity of the request. For more information about constructing this header, go to Signature Version4 Signing Process.

• x-amz-target The destination service of the request and the operation for the data, in the format:<<serviceName>>_<<API version>>.<<operationName>>

For example, DataPipeline_20121129.ActivatePipeline

• content-type Specifies JSON and the version. For example, Content-Type:application/x-amz-json-1.0

The following is an example header for an HTTP request to activate a pipeline.


AWS Data Pipeline Developer GuideMake an HTTP Request to AWS Data Pipeline

http://docs.aws.amazon.com/general/latest/gr/rande.html

http://docs.aws.amazon.com/general/latest/gr/signature-version-4.html

http://docs.aws.amazon.com/general/latest/gr/signature-version-4.html

POST / HTTP/1.1host: datapipeline.us-east-1.amazonaws.comx-amz-date: Mon, 12 Nov 2012 17:49:52 GMTx-amz-target: DataPipeline_20121129.ActivatePipelineAuthorization: AuthParamsContent-Type: application/x-amz-json-1.1Content-Length: 39Connection: Keep-Alive

HTTP Body ContentThe body of an HTTP request contains the data for the operation specified in the header of the HTTPrequest. The data must be formatted according to the JSON data schema for each AWS Data PipelineAPI. The AWS Data Pipeline JSON data schema defines the types of data and parameters (such ascomparison operators and enumeration constants) available for each operation.

Format the Body of an HTTP requestUse the JSON data format to convey data values and data structure, simultaneously. Elements can benested within other elements by using bracket notation.The following example shows a request for puttinga pipeline definition consisting of three objects and their corresponding slots.

{"pipelineId": "df-06372391ZG65EXAMPLE", "pipelineObjects": [ {"id": "Default", "name": "Default", "slots": [ {"key": "workerGroup", "stringValue": "MyWorkerGroup"} ] }, {"id": "Schedule", "name": "Schedule", "slots": [ {"key": "startDateTime", "stringValue": "2012-09-25T17:00:00"}, {"key": "type", "stringValue": "Schedule"}, {"key": "period", "stringValue": "1 hour"}, {"key": "endDateTime", "stringValue": "2012-09-25T18:00:00"} ] }, {"id": "SayHello", "name": "SayHello", "slots": [ {"key": "type",


AWS Data Pipeline Developer GuideHTTP Body Content

"stringValue": "ShellCommandActivity"}, {"key": "command", "stringValue": "echo hello"}, {"key": "parent", "refValue": "Default"}, {"key": "schedule", "refValue": "Schedule"}

] } ]}

Handle the HTTP ResponseHere are some important headers in the HTTP response, and how you should handle them in yourapplication:

• HTTP/1.1—This header is followed by a status code. A code value of 200 indicates a successfuloperation. Any other value indicates an error.

• x-amzn-RequestId—This header contains a request ID that you can use if you need to troubleshoota request with AWS Data Pipeline. An example of a request ID isK2QH8DNOU907N97FNA2GDLL8OBVV4KQNSO5AEMVJF66Q9ASUAAJG.

• x-amz-crc32—AWS Data Pipeline calculates a CRC32 checksum of the HTTP payload and returnsthis checksum in the x-amz-crc32 header.We recommend that you compute your own CRC32 checksumon the client side and compare it with the x-amz-crc32 header; if the checksums do not match, it mightindicate that the data was corrupted in transit. If this happens, you should retry your request.

AWS SDK users do not need to manually perform this verification, because the SDKs compute thechecksum of each reply from Amazon DynamoDB and automatically retry if a mismatch is detected.

Sample AWS Data Pipeline JSON Request and ResponseThe following examples show a request for creating a new pipeline.Then it shows the AWS Data Pipelineresponse, including the pipeline identifier of the newly created pipeline.

HTTP POST Request

POST / HTTP/1.1host: datapipeline.us-east-1.amazonaws.comx-amz-date: Mon, 12 Nov 2012 17:49:52 GMTx-amz-target: DataPipeline_20121129.CreatePipelineAuthorization: AuthParamsContent-Type: application/x-amz-json-1.1Content-Length: 50Connection: Keep-Alive

{"name": "MyPipeline", "uniqueId": "12345ABCDEFG"}


AWS Data Pipeline Developer GuideHTTP Body Content

AWS Data Pipeline Response

HTTP/1.1 200 x-amzn-RequestId: b16911ce-0774-11e2-af6f-6bc7a6be60d9x-amz-crc32: 2215946753Content-Type: application/x-amz-json-1.0Content-Length: 2Date: Mon, 16 Jan 2012 17:50:53 GMT

{"pipelineId": "df-06372391ZG65EXAMPLE"}

Actions in AWS Data Pipeline• ActivatePipeline

• CreatePipeline

• DeletePipeline

• DescribeObjects

• DescribePipelines

• GetPipelineDefinition

• ListPipelines

• PollForTask

• PutPipelineDefinition

• QueryObjects

• ReportTaskProgress

• SetStatus

• SetTaskStatus

• ValidatePipelineDefinition


AWS Data Pipeline Developer GuideActions in AWS Data Pipeline

AWS Task Runner Reference

Topics

• Install Task Runner (p. 235)

• Start Task Runner (p. 235)

• Verify Task Runner (p. 236)

• Setting Credentials for Task Runner (p. 236)

• Task Runner Threading (p. 236)

• Long Running Preconditions (p. 236)

• Task Runner Configuration Options (p. 237)

Task Runner is a task agent application that polls AWS Data Pipeline for scheduled tasks and executesthem on Amazon EC2 instances, Amazon EMR clusters, or other computational resources, reportingstatus as it does so. Depending on your application, you may choose to:

• Have AWS Data Pipeline install and manage one or more Task Runner applications for you oncomputational resources managed by the web service. In this case, you do not need to install orconfigure Task Runner.

• Manually install and configure Task Runner on a computational resource such as a long-running EC2instance or a physical server. To do so, use the following procedures.

• Manually install and configure a custom task agent instead of Task Runner. The procedures for doingso will depend on the implementation of the custom task agent.

Install Task RunnerTo install Task Runner, download TaskRunner-1.0.jar from Task Runner download and copy it intoa folder. Additionally, download mysql-connector-java-5.1.18-bin.jar fromhttp://dev.mysql.com/usingmysql/java/ and copy it into the same folder where you install Task Runner.

Start Task RunnerIn a new command prompt window that is set to the directory where you installed Task Runner, start TaskRunner with the following command.


AWS Data Pipeline Developer GuideInstall Task Runner

https://s3.amazonaws.com/datapipeline-us-east-1/software/latest/TaskRunner/

http://dev.mysql.com/usingmysql/java/

WarningIf you close the terminal window, or interrupt the command with CTRL+C, Task Runner stops,which halts the pipeline runs.

java -jar TaskRunner-1.0.jar --config ~/credentials.json --workerGroup=myWork erGroup

The --config option points to your credentials file. The --workerGroup option specifies the name ofyour worker group, which must be the same value as specified in your pipeline for tasks to be processed.When Task Runner is active, it prints the path to where log files are written in the terminal window. Thefollowing is an example.

Logging to /myComputerName/.../dist/output/logs

Verify Task RunnerThe easiest way to verify that Task Runner is working is to check whether it is writing log files. The logfiles are stored in the directory where you started Task Runner.

When you check the logs, make sure you that are checking logs for the current date and time. TaskRunner creates a new log file each hour, where the hour from midnight to 1am is 00. So the format of thelog file name is TaskRunner.log.YYYY-MM-DD-HH, where HH runs from 00 to 23, in UDT. To savestorage space, any log files older than eight hours are compressed with GZip.

Setting Credentials for Task RunnerIn order to connect to the AWS Data Pipeline web service to process your commands, configure TaskRunner with an AWS account that has permissions to create and/or manage data pipelines.

To Set Your Credentials Implicitly with a JSON File

• Create a JSON file named credentials.json in the directory where you installed Task Runner. Forinformation about what to include in the JSON file, see Create a Credentials File (p. 18).

Task Runner ThreadingTask Runner activities and preconditions are single threaded. This means it can handle one work itemper thread. By default, it has one activity thread and two precondition threads. If you are installing TaskRunner and believe it will need to handle more than that at a single time, you need to increase the--activities and --preconditions values.

Long Running PreconditionsFor performance reasons, pipeline retry logic for preconditions happens in Task Runner and AWS DataPipeline supplies Task Runner with preconditions only once per 30 minute period.Task Runner will honorthe retryDelay field that you define on preconditions.You can configure "preconditionTimeout" slot tolimit the precondition retry period.


AWS Data Pipeline Developer GuideVerify Task Runner

Task Runner Configuration OptionsThese are the configuration options available from the command line when you launch Task Runner.

DescriptionCommand Line Parameter

Displays command line help.--help

Path and file name of your credentials.json file--config

The AWS access ID for Task Runner to use whenmaking requests

--accessId

The AWS secret key for Task Runner to use whenmaking requests

--secretKey

The AWS Data Pipeline service endpoint to use--endpoint

The name of the worker group that Task Runnerwill retrieve work for

--workerGroup

The Task Runner directory for output files--output

The Task Runner directory for local log files. If it isnot absolute, it will be relative to output. Default is'logs'.

--log

The Task Runner directory for staging files. If it isnot absolute, it will be relative to output. Default is'staging'.

--staging

The Task Runner directory for temporary files. If itis not absolute, it will be relative to output. Defaultis 'tmp'.

--temp

Number of activity threads to run simultaneously,defaults to 1.

--activities

Number of precondition threads to runsimultaneously, defaults to 2.

--preconditions

The suffix to use for preconditions. Defaults to"precondition".

--pcSuffix


AWS Data Pipeline Developer GuideTask Runner Configuration Options

Web Service Limits

To ensure there is capacity for all users of the AWS Data Pipeline service, the web service imposes limitson the amount of resources you can allocate and the rate at which you can allocate them.

Account LimitsThe following limits apply to a single AWS account. If you require additional capacity, you can contactAmazon Web Services to increase your capacity.

AdjustableLimitAttribute

Yes20Number of pipelines

Yes50Number of pipelinecomponents per pipeline

Yes50Number of fields perpipeline component

Yes256Number of UTF8 bytesper field name oridentifier

No10240Number of UTF8 bytesper field

No15,360, including the names of fields.Number of UTF8 bytesper pipeline component

No1 per 5 minutesRate of creation of ainstance from a pipelinecomponent

Yes5Number of runninginstances of a pipelinecomponent

No5 per taskRetries of a pipelineactivity


AWS Data Pipeline Developer GuideAccount Limits

http://aws.amazon.com/contact-us/data-pipeline-limit-increase/


AdjustableLimitAttribute

No2 minutesMinimum delay betweenretry attempts

No15 minutesMinimum schedulinginterval

No32Maximum number ofrollups into a singleobject

Web Service Call LimitsAWS Data Pipeline limits the rate at which you can call the web service API. These limits also apply toAWS Data Pipeline agents that call the web service API on your behalf, such as the console, CLI, andTask Runner.

The following limits apply to a single AWS account. This means the total usage on the account, includingthat by IAM users, cannot exceed these limits.

The burst rate lets you save up web service calls during periods of inactivity and expend them all in ashort amount of time. For example, CreatePipeline has a regular rate of 1 call each 5 seconds. If youdon't call the service for 30 seconds, you will have 6 calls saved up.You could then call the web service6 times in a second. Because this is below the burst limit and keeps your average calls at the regular ratelimit, your calls are not be throttled.

If you exceed the rate limit and the burst limit, your web service call fails and returns a throttling exception.The default implementation of a worker, Task Runner, automatically retries API calls that fail with athrottling exception, with a back off so that subsequent attempts to call the API occur at increasinglylonger intervals. If you write a worker, we recommend that you implement similar retry logic.

These limits are applied against an individual AWS account.

Burst limitRegular rate limitAPI

10 calls1 call per 5 secondsActivatePipeline

10 calls1 call per 5 secondsCreatePipeline

10 calls1 call per 5 secondsDeletePipeline

10 calls1 call per 2 secondsDescribeObjects

10 calls1 call per 5 secondsDescribePipelines

10 calls1 call per 5 secondsGetPipelineDefinition

10 calls1 call per 2 secondsPollForTask

10 calls1 call per 5 secondsListPipelines

10 calls1 call per 5 secondsPutPipelineDefinition

10 calls1 call per 2 secondsQueryObjects

10 calls1 call per 2 secondsReportProgress

10 calls2 call per secondSetTaskStatus


AWS Data Pipeline Developer GuideWeb Service Call Limits

Burst limitRegular rate limitAPI

10 calls1 call per 5 secondsSetStatus

10 calls1 call per 5 secondsReportTaskRunnerHeartbeat

10 calls1 call per 5 secondsValidatePipelineDefinition

Scaling ConsiderationsAWS Data Pipeline scales to accommodate a huge number of concurrent tasks and you can configureit to automatically create the resources necessary to handle large workloads.These automatically-createdresources are under your control and count against your AWS account resource limits. For example, ifyou configure AWS Data Pipeline to automatically create a 20-node EMR cluster to process data andyour AWS account has an EC2 instance limit set to 20, you may inadvertently exhaust your availablebackfill resources. As a result, consider these resource restrictions in your design or increase your accountlimits accordingly.

If you require additional capacity, you can contact Amazon Web Services to increase your capacity.


AWS Data Pipeline Developer GuideScaling Considerations


AWS Data Pipeline Resources

Topics

The following table lists related resources to help you use AWS Data Pipeline.

DescriptionResource

Describes AWS Data Pipeline operations, errors, and datastructures.

AWS Data Pipeline API Reference

Covers the top 20 questions developers ask about thisproduct.

AWS Data Pipeline Technical FAQ

Provide a high-level overview of the current release. Theyspecifically note any new features, corrections, and knownissues.

Release Notes

A central starting point to find documentation, codesamples, release notes, and other information to help youbuild innovative applications with AWS.

AWS Developer Resource Center

The AWS Data Pipeline console.AWS Management Console

A community-based forum for developers to discusstechnical questions related to Amazon Web Services.

Discussion Forums

The home page for AWS Technical Support, includingaccess to our Developer Forums, Technical FAQs, ServiceStatus page, and Premium Support.

AWS Support Center

The primary web page for information about AWS PremiumSupport, a one-on-one, fast-response support channel tohelp you build and run applications on AWS InfrastructureServices.

AWS Premium Support

The primary web page for information about AWS DataPipeline.

AWS Data Pipeline Product Information

A form for questions about your AWS account, includingbilling.

Contact Us



http://docs.aws.amazon.com/datapipeline/latest/API/

http://aws.amazon.com/datapipeline/faqs/

http://aws.amazon.com/releasenotes/Elastic-DataPipeline

http://aws.amazon.com/resources

http://aws.amazon.com/console/

http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52

http://developer.amazonwebservices.com/connect/support.jspa

http://aws.amazon.com/premiumsupport

http://aws.amazon.com/datapipeline

http://aws.amazon.com/contact-us/

DescriptionResource

Detailed information about the copyright and trademarkusage at Amazon.com and other topics.

Terms of Use



http://www.amazon.com/gp/help/customer/display.html/104-5054883-7838319?nodeId=508088

Document History

This documentation is associated with the 2012-10-29 version of AWS Data Pipeline.

Release DateDescriptionChange

20 December2012

This release is the initial release of the AWS Data PipelineDeveloper Guide.

Guide revision



aws data pipeline€¦ · copy data from amazon s3 to mysql..... 146 extract apache web log data...

Documents