redefining etl pipelines with apache technologies to accelerate decision-making and execution for...
TRANSCRIPT
![Page 1: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/1.jpg)
Redefining ETL Pipelines with Apache
Technologies to Accelerate Decision
Making for Clinical Trials
Eran Withana
![Page 2: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/2.jpg)
www.comprehend.com
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
![Page 3: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/3.jpg)
www.comprehend.com
Open Source
Member, PMC member and committer of ASF
Apache Axis2, Web Services, Synapse,
Airavata
Education
PhD in Computer Science from Indiana
University
Software engineer at Comprehend Systems
About me …
![Page 4: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/4.jpg)
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
![Page 5: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/5.jpg)
www.comprehend.com
Clinical Trials – Lay of the land
Number of Drugs in Development Worldwide
(Source: CenterWatch Drugs in Clinical Trial
Database 2014)
Source: http://www.phrma.org/innovation/clinical-trials
![Page 6: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/6.jpg)
www.comprehend.com
Clinical Trials – Lay of the Land
Multiple Stakeholders• Study Managers • Program Managers• Monitors• Data Managers• Bio-statisticians• Executives• Medical Affairs• Regulatory• Vendors• CROs• CRAs
Sites
Labs
Patients
Safety
EDC
Reports
● Latent● Fragmented
Data
PV DataExcel
SponsorContract Research Organization (CRO)Sites and Investigators
![Page 7: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/7.jpg)
www.comprehend.com
For decades, clinical development
was primarily paper-based.
![Page 8: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/8.jpg)
www.comprehend.com
Various Software and Practices Used in Each Layer
medidata
CROs and SIs
Technologies
![Page 9: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/9.jpg)
www.comprehend.com
Clinical Trials with Centralized Monitoring
Clinical Operations
Sites
Labs
Patients
● Consolidated● Real-time● Self-Service● Mobile
Clinical Analytics &
Collaboration
Data
Safety
EDC
PV DataExcel
![Page 10: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/10.jpg)
www.comprehend.com
Providing up-to-date answers
Executives Medical Review
CRAs Data Management
Clinical Operations
EDC
CTMS
Safety
ePro
Other
Web
Ad-Hoc
Mobile
Collab
![Page 11: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/11.jpg)
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
![Page 12: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/12.jpg)
www.comprehend.com
FDA, HIPAA ComplianceMetadata/Database structure synchronization
Less frequent (once a day)
Data SynchronizationMore frequent (multiple times a day)
Ability to plugin various data sourcesRAVE, MERGE, BioClinica, File Imports, DB-to-DB
Synchs
Real time event propagationsAdverse events (AEs) - the need for early
identification
Business Requirements
![Page 13: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/13.jpg)
www.comprehend.com
Hardware agnostic for resiliency and better utilization
Repeatable deployments
Real time processing and real time events
Fault Tolerance
In flight and end state metrics for alerting and monitoring
Flexible and pluggable adapter architecture
Time travelAudit trails
Report generations
Technical Requirements
![Page 14: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/14.jpg)
www.comprehend.com
Events all the way
Shared event bus for multiple consumers
Use of language agnostic data
representations (via protobuf)
Automatic datacenter resources
management (Mesos/Marathon/Docker)
Core Design Principles
![Page 15: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/15.jpg)
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
![Page 16: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/16.jpg)
www.comprehend.com
• Data processing Apache Storm and Trident, Apache
Spark and Spark Streaming, Samza, Summingbird, Scalding, Apache Falcon, Azkaban
• Coordination and Configuration Management Apache Zookeeper, Redis, Apache
Curator
• Event Queue Apache Kafka
• Scheduling Chronos, Apache Mesos, Marathon,
Apache Aurora
• Database Synchronization Liquibase, Flyway DB
• Data Representations Apache Thrift, protobuf, Avro
• Deployments Ansible
• File Management Apache HDFS
• Monitoring and alerting Graphite, StatsD
• Database PostgreSQL, Apache Spark
• Resource Isolation LXC, Docker
Technologies Evaluated
![Page 17: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/17.jpg)
www.comprehend.com
Data Processing Technology Evaluation
Criteria Storm + Trident
Spark + Streaming
Samza Summingbird Scalding Falcon Chronos Aurora Azkaban
DAG Support
Y DAGScheduler Y Y Y Y Y N Y
DAG Nodes Resiliency
Y Y Y Y Y Y Y N Y
Event Driven
Y Y Y Y N N N N N
Timed Execution
Y Y Y Y Y Y Y Y
DAG Extension
Y Y Y Y Y Y Y Y Y
Inflight and end state metrics
Y Y Y Y Y Y Y Y Y
Hardware Agnostic
Y Y Y Y Y Y Y Y Y
![Page 18: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/18.jpg)
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
![Page 19: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/19.jpg)
www.comprehend.com
High Level Architecture
![Page 20: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/20.jpg)
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
![Page 21: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/21.jpg)
www.comprehend.com
Bare Metal Boxes
Partitioned using LXC containers
Use of Mesos to do the resource
allocations as needed for jobs
Managing Hardware
![Page 22: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/22.jpg)
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
![Page 23: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/23.jpg)
www.comprehend.com
Ansible
Repeatable deployments
Password management
Inventory management
(nodes, dev/staging/production)
Deployments
![Page 24: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/24.jpg)
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
![Page 25: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/25.jpg)
www.comprehend.com
Adapters – High Level
![Page 26: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/26.jpg)
• Syncher is for DB structural changes Syncher creates a database schema
from the source information
Runs a generic database diff and applies those to the target database
• Seeder is for data synchronization
Uses the database schema created by Syncher
• Seeders gets jobs from Syncher or
Timed scheduler
Data Adapters
![Page 27: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/27.jpg)
• Coordination and
Configuration
through Zookeeper
Job configuration
Connection information
Distributed locking and counters
Metric Maintenance
Last successful run
Data Adapters – Coordination and Configuration
![Page 28: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/28.jpg)
www.comprehend.com
Data Adapters - Implementation
![Page 29: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/29.jpg)
www.comprehend.com
Syncher Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
Schema changes to the database fails in the middle• Transaction rollback
Seeder Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
If seeding fails midway• Storm retries tuples• Failing tuples are moved to an error queue
Table and row level failues• Option to skip the tables/rows but send a report at the end
Effect on “live” tables during data synchronizations• Option to use transactions or• Use temporary tables and swap with original upon completion
Failure Modes
![Page 30: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/30.jpg)
www.comprehend.com
Can bring in data from more data sources and more studies effectively
Run real time reports on studies and configure alerts (future)
Can configure refreshes as needed by each use case
Can throttle input and output sources at study/customer level
Ability to onboard new customers and deploy new studies with minimal human intervention
What Have We Gained
![Page 31: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/31.jpg)
www.comprehend.com
A generic framework which
eases integration with new data sources
• For each new source, implement a method to create a
virtual schema and to get data for a given table
can scale and fault tolerant
has generic monitoring and alerting
eases maintenance since its mostly generic code
notification of important events through messages
runs on any hardware
What Have We Gained
![Page 32: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/32.jpg)
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
![Page 33: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/33.jpg)
www.comprehend.com
AccessibilityCustomers must be able to drop files securely (SFTP like
functionality)
Ability to access resources through URLsData storage
Scalability and RedundancyScale-out by adding nodes
Resilience against loss of nodes, data centers and replication
MiscellaneousAccess control over read/write
Performance/usage/resource utilization monitoring
Distributed File System - Requirements
![Page 34: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/34.jpg)
www.comprehend.com
Two name nodes running in HA mode, co-located with two journal nodes
Third journal node on a separate node
Data nodes on all bare metal nodes
Mounting HDFS with FUSE and enabling SFTP through OS level features
Automatic failover through DNS and HA Proxy
HDFS with High Availability Mode
![Page 35: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/35.jpg)
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
![Page 36: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/36.jpg)
www.comprehend.com
Regulatory requirementsData encryption requirements for clinical data
Audit trails
Data qualitySource system constraintsCoordination between Synchers and Seeders
Distributed locks and counters
Automatic fail over when a name node fails in HDFSHDFS HA mode stores active name node in ZK as a
java serialized object, yikes !!
Challenges
![Page 37: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/37.jpg)
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
![Page 38: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/38.jpg)
www.comprehend.com
Time travel
Ability to go back in time and run reports at any
given point of time
Trail of data
Containerization
In-memory query execution with Apache
Spark
Future Work
![Page 39: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/39.jpg)
www.comprehend.com
Team
![Page 40: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials](https://reader030.vdocuments.us/reader030/viewer/2022020307/55a697d41a28ab7c2d8b4787/html5/thumbnails/40.jpg)
www.comprehend.com
Thank You !!
Questions …