AWS Data Pipeline
AWS Data Pipeline, a web service that automates and schedules data movement and processing activities in AWS, is AWS Data Pipeline
AWS Data Pipeline help define data-driven workflows
AWS Data Pipeline is a cloud-based and on-premises storage system that allows developers to access their data whenever they need it. It also allows them to export the data in the format they require.
AWS Data Pipeline allows for you to quickly create a pipeline that defines a dependent chain between data sources, destinations, as well as predefined or custom data processing actions.
The pipeline performs regular processing activities such as distributed data copies, SQL transforms and EMR applications according to a set schedule. It also regularly executes custom scripts against destinations like DynamoDB, S3, RDS, and DynamoDB.
Data Pipeline executes the scheduling, retry and failure logic for the workflows. This is a highly scalable, fully managed service.
Distributed, fault-tolerant, and highly available
Managed workflow orchestration for data-driven workflows
Service for infrastructure management, which will provision and terminate resources according to requirements
Provides dependency resolution
Can be scheduled
Gives control over retries (including frequency and number).
Native integration with S3, DynamoDB and RDS, EMR, EC2 or Redshift
Support for both AWS-based and external resources on-premise AWS Data Pipeline Concepts
Pipeline definition allows the business logic to communicate to the AWS Data Pipeline
The definition of a pipeline is the location of data (Data Nodes), activities that will be performed, the schedule, the resources to run them, per-conditions to perform the actions, and attempts.
Pipeline components are the business logic of the pipe and are represented by the various sections of a pipeline definition.
Pipeline components define the data sources, activities and schedules for the workflow.
AWS Data Pipeline runs a pipeline. It compiles the components of the pipeline to create an actionable set of instances.
Data Pipeline is a robust and durable data management system that retries failed operations based on frequency and a defined number of retriesTask runners
A task runner can be an application that polls AWS Data Pipeline to find tasks and then executes those tasks.
Once Task Runner has been installed and configured, it polls AWS Data Pipeline to determine tasks that are associated with activated pipelines
Task Runner assigns a task to Task Runner and performs the task. It then reports its status to AWS Data Pipeline.
A task is a discrete unit of work that Data Pipeline shares with a task runner. It differs from a pipeline which defines activities and resources that typically yields many tasks
Tasks can either be performed on the AWS Data Pipeline managed resources or user managed resourcesData Nodes
Data Node is the location and type data that a pipeline activity uses to determine its source (input), or destination (output).
Data pipeline supports S3, Redshift and DynamoDB data nodesDatabases
Data Pipeline supports Redshift, RDS, and JDBC databases
An activity is a component of a pipeline that defines the work to be done.
Data Pipeline provides pre defined activities for common scenarios like sql transformation, data movement, hive queries etc
Activities are extensible and can be used to run own custom scripts to support endless combinationsPreconditions
Precondition is a component of a pipeline that contains conditional statements. These statements must be met (evaluated as True) before an activity can proceed.
A pipeline supportsSystem-managed preconditionsare run by the AWS Data Pipeline web service on your behalf and do not require a computational resource
Includes source data and keys checks for e.g. DynamoDB data: table exists, S3 key exists, or prefix is not e
AWS Data Pipeline