[text]
This Guidance demonstrates how to set up an end-to-end framework to analyze multimodal healthcare and life sciences (HCLS) data. It analyzes this data using purpose-built health care and life sciences services (such as AWS HealthOmics, AWS HealthLake, AWS HealthImaging) and machine learning (ML) and analytics services (such as Amazon SageMaker, Amazon Athena, and Amazon QuickSight). It ingests raw HCLS data formats like variant call format (VCF), Fast Healthcare Interoperability Resources (FHIR), and Digital Imaging and Communications in Medicine (DICOM), and provides a zero-extract, transform, load (ETL) architecture to customers who want to run their data analysis at scale on AWS.
The architectures shows how to store, transform, and analyze linked genomic, clinical, and medical imaging data of patients. The effectiveness of the Guidance is demonstrated on a coherent synthetic patient dataset with multiple disease scenarios, released by MITRE and available on AWS Registry of Open Data. It then trains an ML model for predicting patient outcomes. It also includes an interactive dashboard for visualizing summary statistics of data and ML model reports that can be customized based on the user persona.
Please note: [Disclaimer]
Architecture Diagram
[text]
Step 1
Ingest genomic data from Amazon Simple Storage Service (Amazon S3) or Registry of Open Data on AWS (RODA) to AWS HealthOmics.
Use HealthOmics Reference store for reference genome data, such as Fast-All (FASTA), and HealthOmics Sequence store for sequence data, such as FASTQ, Binary Alignment Map (BAM), and Compressed Reference-oriented Alignment Map (CRAM).
Use HealthOmics Variant store for variant call format (VCF) files and HealthOmics Annotation store for annotation files. To run private or Ready2Run workflows, use HealthOmics Workflows.
Step 2
Ingest Fast Healthcare Interoperability Resources (FHIR) data to AWS HealthLake.
Step 3
Ingest Digital Imaging and Communications in Medicine (DICOM) images to AWS HealthImaging and read into insight toolkit (ITK) image object in-memory through API calls.
Step 4
View tables from HealthOmics and HealthLake as resources in AWS Lake Formation.
Step 5
Query the tables with Amazon Athena.
Step 6
Generate brain masking with the Medical Open Network for AI (MONAI) segmentation model. Use Amazon SageMaker Preprocessing to parallelize radiomic feature computation for each image representation.
Step 7
Build visualization dashboards with Amazon QuickSight.
Step 8
Store the multimodal feature set in Amazon SageMaker Feature Store.
Step 9
Build and train ML models on multimodal features with SageMaker AutoGluon-Tabular.
Step 10
Deploy the model as an endpoint for real-time inference.
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
HealthOmics integrates with Amazon EventBridge and provides notifications for actions like Variant or Annotation store creation and delete in addition to start and completion of data import jobs. You can overlay rules and handling targets onto this Guidance to monitor and respond to any incidents that may occur, such as repeated import failures.
-
Security
HealthImaging enforces the use of AWS Key Management Service (AWS KMS) encryption as it will not allow the creation of an unencrypted datastore. In addition to this, encryption at rest and transit are supported by HealthOmics, HealthLake, Amazon SageMaker, Athena, QuickSight, Lake Formation, and Amazon S3. This Guidance uses AWS-owned keys, but customers are able to bring their own keys if needed.
-
Reliability
When deploying this Guidance in an environment with pre-existing HealthOmics resources, you should be aware of HealthOmics Analytics quotas. This Guidance creates 1 Variant store and 1 Annotation store. By default, HealthOmics has a limit of 10 Variant stores and 10 Annotation stores. There are also default limits on the number of import jobs to HealthOmics Analytics stores and the file sizes they can handle. The default limit is 5 concurrent Variant or Annotation store import jobs. This Guidance uses 1 Variant import job and 1 Annotation import job. Variant import jobs have a default limit of 1,000 sources, each with a limit of 20 GB. The example variant data used by this Guidance consists of about 800 Variant files, each about 1 GB. Annotation import jobs have a default limit of 1 source, each with a limit of 20 GB in size. The example annotation data in this Guidance is a single file that is about 10 GB.
-
Performance Efficiency
The data in HealthLake is automatically available through Lake Formation. This allows customers to create organizational units (OUs) of users and then grant row and column-level access to those users depending on their data access requirements.
-
Cost Optimization
HealthLake automatically transforms the clinical data stored in your data catalog to run SQL queries on the data. This eliminates the need for exporting data and paying for data transfer costs for HealthLake data.
-
Sustainability
By establishing a centralized data lake for all modalities, this Guidance removes the need to create redundant data. Data stores provided by HealthLake, HealthOmics, and HealthImaging become the single source of truth for each of their respective data types. Lake Formation can govern and filter each data type to provide users with the appropriate access to data without duplication. Similarly, you can create common database constructs, such as “views” in Athena to support multiple analysis use cases without data replication.
Implementation Resources
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content
Guidance for Multi-Omics and Multi-Modal Data Integration and Analysis on AWS
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.