This Guidance helps researchers run a diverse catalog of protein folding and design algorithms on AWS Batch. Knowing the physical structure of proteins is an important part of the drug discovery process. Machine learning (ML) algorithms significantly reduce the cost and time needed to generate usable protein structures.
These systems have also inspired development of artificial intelligence (AI)-driven algorithms for de novo protein design and protein-ligand interaction analysis. This Guidance will allow researchers to quickly add support for new protein analysis algorithms while optimizing cost and maintaining performance.
Architecture Diagram
[Architecture diagram description]
Step 1
AWS CloudFormation deploys the infrastructure in your AWS account.
Step 2
AWS CodeBuild builds the containers necessary to run algorithms, such as AlphaFold and OpenFold.
All of the analysis algorithms are packaged as Docker containers and stored using Amazon Elastic Container Registry (Amazon ECR) in the deployment account. This helps ensure that all usage information remains private.
Step 3
AWS Lambda triggers the download of model artifacts and reference data to an Amazon FSx for Lustre file system.
Step 4
Define and submit analysis jobs from an Amazon SageMaker notebook instance or other Python environment.
Step 5
AWS Batch manages job scheduling and orchestration.
Step 6
Jobs run in general or accelerated compute environments based on the vCPU, memory, and GPU requirements.
Step 7
Jobs write outputs and results to an encrypted Amazon Simple Storage Service (Amazon S3) bucket.
Step 8
Users download job outputs to visualize the results or for downstream analysis.
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
Customers deploy architecture components using CloudFormation. Solution changes are tested and deployed using GitLab pipelines. Customers can submit jobs and process the results through a Python software development kit (SDK), including jobs from Jupyter notebooks. Jobs write all results and metrics to Amazon S3.
-
Security
All analysis jobs run within private subnets and use minimal AWS Identity and Access Management (IAM) policies to manage access to AWS services. All data is encrypted at rest and in transit. Amazon S3 data transfer occurs through a VPC endpoint.
-
Reliability
Analysis algorithms are split into independent containers and Python classes for modular execution and updates. AWS Batch automatically provides job retry logic. Job inputs and outputs are stored in Amazon S3. Additionally, the CloudFormation template provisions an attached data repository for the FSx file system to rapidly restore reference data.
-
Performance Efficiency
Protein folding algorithms require large sequence databases for data preparation and can take several minutes or hours to finish. AWS Batch supports FSx for Lustre mounts and extended run times. Both AWS Batch and Amazon FSx for Lustre support HPC use cases, such as protein folding with high input/output (IO) requirements.
-
Cost Optimization
AWS Batch will automatically de-provision compute resources when jobs are finished. Customers can leverage Amazon Elastic Compute Cloud (Amazon EC2) Spot instances (which offer up to a 90% discount compared to On-Demand instances) and AWS Graviton-enabled instance types for some jobs. AWS Graviton instances are optimized for cloud workloads and can deliver up to 40% better price performance over comparable current generation x86-based instances.
-
Sustainability
AWS Batch automatically scales compute resources to handle jobs in a managed queue. This architecture includes benchmarking results and default parameters to minimize hardware resources.
Implementation Resources
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content
Predicting protein structures at scale using AWS Batch
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.