Overview We developed a freely available, easy to run
implementation of bcbio-nextgen on Amazon Web Services (AWS) using
Docker. bcbio is a community developed tool providing validated and
scalable variant calling and RNA-seq analysis. The AWS
implementation automates all of the steps of building a cluster,
attaching high performance shared filesystems, and running an
analysis. This makes bcbio readily available to the research
community without the need to install and configure a local
installation. The entire installation bootstraps from standard
Linux AMIs, enabling adjustment of the tools, genome data and
installation without needing to prepare custom AMIs. The
implementation uses Elasticluster to provision and configure the
cluster. We automate the process with the boto Python interface to
AWS and Ansible scripts. bcbio-vm isolates code and tools inside a
Docker container allowing runs on any remote machine with a
download of the Docker image and access to the shared filesystem.
Analyses run directly from S3 buckets, with automatic streaming
download of input data and upload of final processed data. We
provide timing benchmarks for running a full variant calling
analysis using bcbio on AWS. The benchmark dataset was a cancer
tumor/normal evaluation, from the ICGC-TCGA DREAM challenge, with
100x coverage in exome regions. We compared the results of running
this dataset on 2 different networked filesystems: Lustre and NFS.
We also show benchmarks for an RNA-seq dataset using inputs from
the Sequencing Quality Control (SEQC) project. We developed bcbio
on AWS and ran these timing benchmarks thanks to work with great
partners. A collaboration with Biogen and Rudy Tanzi's group at MGH
funded the development of bcbio on AWS. A second collaboration with
Intel Health and Life Sciences and AstraZenenca funded the somatic
variant calling benchmarking work. We're thankful for all the
relationships that make this work possible: John Morrissey
automated the process of starting a bcbio cluster on AWS and
attaching a Lustre filesystem. He also automated the approach to
generating graphs of resource usage from collectl stats and
provided critical front line testing and improvements to all the
components of the bcbio AWS interaction. Kristina Kermanshahche and
Robert Read at Intel provided great support helping us get the
Lustre ICEL CloudFormation templates running. Ronen Artzi, Michael
Heimlich, and Justin Johnson at AstraZenenca setup Lustre, Gluster
and NFS benchmarks using a bcbio StarCluster instance. This initial
validation was essential for convincing us of the value of moving
to a shared filesystem on AWS. Jason Tetrault, Karl Gutwin and Hank
Wu at Biogen provided valuable feedback, suggestions and resources
for developing bcbio on AWS. Glen Otero parsed the collectl data
and provided graphs, which gave us a detailed look into the
potential causes of bottlenecks we found in the timings. James
Cuff, Paul Edmon and the team at Harvard FAS research computing
built and administered the Regal Lustre setup used for local
testing. John Kern and other members of the bcbio community tested,
debugged and helped identify issues with the implementation.
Community feedback and contributions are essential to bcbio
development. Architecture The implementation provides both a
practical way to run large scale variant calling and RNA-seq
analysis, as well as a flexible backend architecture suitable for
production quality runs. This writeup might feel a bit like a black
triangle moment since I also wrote about running bcbio on AWS three
years ago. That implementation was a demonstration for small scale
usage rather than a production ready system. We now have a setup we
can support and run on large scale projects thanks to numerous
changes in the backend architecture: Amazon, and cloud based
providers in general, now provide high end filesystems and
networking. Our AWS runs are fast because they use SSD backend
storage, fast networking connectivity and high end processors that
would be difficult to invest in for a local cluster. Renting these
is economically feasible now that we have an approach to provision
resources, run the analysis, and tear everything down. The
dichotomy between local cluster hardware and cloud hardware will
continue to expand with upcoming improvements in compute (Haswell
processors) and storage (16Tb EBS SSD volumes). Isolating all of
the software and code inside Docker containers enables rapid
deployment of fixes and improvements. From an open source support
perspective, Amazon provides a consistent cluster environment we
have full control over, limiting the space of potential system
specific issues. From a researcher's perspective, this will allow
use of bcbio without needing to spend time installing and testing
locally. The setup runs from standard Linux base images using
Ansible scripts and Elasticluster. This means we no longer need to
build and update AMIs for changes in the architecture or code. This
simplifies testing and pushing fixes, letting us spend less time on
support and more on development. Since we avoid having a pre-built
AMI, the process of building and running bcbio on AWS is fully
auditable for both security and scientific accuracy. Finally, it
provides a path to support bcbio on container specific management
services like Amazon's EC2 container service. All long term data
storage happens in Amazon's S3 object store, including both
analysis specific data as well as general reference genome data.
Downloading reference data for an analysis on demand removes the
requirement to maintain large shared EBS volumes. On the analysis
side, you maintain only the input files and high value output files
in S3, removing the intermediates upon completion of the analysis.
Removing the need to manage storage of EBS volumes also provides a
cost savings ($0.03/Gb/month for S3 versus $0.10+/Gb/month for EBS)
and allows the option of archiving in Glacier for long term
storage. All of these architectural changes provide a setup that is
easier to maintain and scale over time. Our goal moving ahead is to
provide a researcher friendly interface to setting up and running
analyses. We hope to achieve that through the in-development Common
Workflow Language from Galaxy, Arvados, Seven Bridges, Taverna and
the open bioinformatics community. Variant calling –
benchmarking AWS versus local We benchmarked somatic variant
calling in two environments: on the elasticluster Docker AWS
implementation and on local Harvard FAS machines. AWS processing
was twice as fast as a local run. The gains occur in disk IO
intensive steps like alignment post-processing. AWS offers the
opportunity to rent SSD backed storage and obtain a 10GigE
connected cluster without contention for network resources. Our
local test machines have an in-production Lustre filesystem
attached to a large highly utilized cluster provided by Harvard FAS
research computing. At this scale Lustre and NFS have similar
throughput, with Lustre outperforming NFS during IO intensive steps
like alignment, post-processing and large BAM file merging. From
previous benchmarking work we'll need to process additional samples
in parallel to fully stress the shared filesystem and differentiate
Lustre versus NFS performance. However, the resource plots at this
scale show potential future bottlenecks during alignment,
post-processing and other IO intensive steps. Generally, having
Lustre scaled across 4 LUNs per object storage server (OSS) enables
better distribution of disk and network resources. AWS runs use two
c3.8xlarge instances clustered in a single placement group,
providing 32 cores and 60Gb of memory per machine. Our local run
was comparable with two compute machines, each with 32 cores and
128Gb of memory, connected to a Lustre filesystem. The benchmark is
a cancer tumor/normal evaluation consisting of alignment,
recalibration, realignment and variant detection with four
different callers. The input is a tumor/normal pair from the the
ICGC-TCGA DREAM challenge with 100x coverage in exome regions. Here
are the times, in hours, for each benchmark: AWS (Lustre)
AWS (NFS) Local (Lustre) Total 4:42 5:05 10:30 genome data
preparation 0:04 0:10 alignment preparation 0:12 0:15
alignment 0:29 0:52 0:53 callable regions 0:44 0:44 1:25 alignment
post-processing 0:13 0:21 4:36 variant calling 2:35 2:05 2:36
variant post-processing 0:05 0:03 0:22 prepped BAM merging 0:03
0:18 0:06 validation 0:05 0:05 0:09 population database 0:06 0:07
0:09 To provide more insight into the timing differences between
these benchmarks, we automated collection and plotting of resource
usage on AWS runs. Variant calling – resource usage plots
bcbio retrieves collectl usage statistics from the server and
prepares graphs of CPU, memory, disk and network usage. These plots
allow in-depth insight into limiting factors during individual
steps in the workflow. We'll highlight some interesting comparisons
between NFS and Lustre during the variant calling benchmarking.
During this benchmark, the two critical resources were CPU usage
and disk IO on the shared filesystems. We also measure memory usage
but that was not a limiting factor with these analyses. In addition
to the comparisons highlighted below, we have the full set of
resource usage graphs available for each run: Variant calling with
NFS on AWS Variant calling with Lustre on AWS RNA-seq on a single
machine on AWS CPU These plots compare CPU usage during processing
steps for Lustre and NFS. The largest differences between the two
runs are in the alignment, alignment post-processing and variant
calling steps: NFS Lustre For alignment and alignment
post-processing the Lustre runs show more stable CPU usage. NFS
specifically spends more time in the CPU wait state (red line)
during IO intensive steps. On larger scale projects this may become
a limiting factor in processing throughput. The variant calling
step was slower on Lustre than NFS, with inconsistent CPU usage.
We'll have to investigate this slowdown further, since no other
metrics point to an obvious bottleneck. Shared filesystem network
usage and IO These plots compare network usage during processing
for Lustre and NFS. We use this as a consistent proxy for the
performance of the shared filesystem and disk IO (the NFS plots do
have directly measured disk IO for comparison purposes). NFS Lustre
The biggest difference in the IO intensive steps is that Lustre
network usage is smoother compared to the spiky NFS input/output,
due to spreading out read/writes over multiple disks. Including
more processes with additional read/writes will help determine how
these differences translate to scaling on larger numbers of
simultaneous samples. RNA-seq benchmarking We also ran an RNA-seq
analysis using 4 samples from the Sequencing Quality Control (SEQC)
project. Each sample has 15 million 100bp paired reads. bcbio
handled trimming, alignment with STAR, and quantitation with DEXSeq
and Cufflinks. We ran on a single AWS c3.8xlarge machines with 32
cores, 60Gb of memory, and attached SSD storage. RNA-seq
optimization in bcbio is at an earlier stage than variant calling.
We've done work to speed up trimming and aligning, but haven't yet
optimized the expression and count steps. The analysis runs quickly
in 6 1/2 hours, but there is still room for further optimization,
and this is a nice example of how we can use benchmarking plots to
identify targets for additional work: Total 6:25 genome data
preparation 0:32 adapter trimming 0:32 alignment 0:24 estimate
expression 3:41 quality control 1:16 The RNA-seq collectl plots
show the cause of the slower steps during expression estimation and
quality control. Here is CPU usage over the run: The low CPU usage
during the first 2 hours of expression estimation corresponds to
DEXSeq running serially over the 4 samples. In contrast with
Cufflinks, which parallelizes over all 32 cores, DEXSeq runs in a
single core. We could run these steps in parallel by using
multiprocessing to launch the jobs, split by sample. Similarly, the
QC steps could benefit from parallel processing. Alternatively,
we're looking at validating other approaches for doing
quantification like eXpress. These are the type of benchmarking and
validation steps that are continually ongoing in the development of
bcbio pipelines. Reproducing the analysis The process to launch the
cluster and an NFS or optional Lustre shared filesystem is fully
automated and documented. It sets up permissions, VPCs, clusters
and shared filesystems from a basic AWS account, so requires
minimal manual work. bcbio_vm.py has commands to: Add an IAM user,
a VPC and create the Elasticluster config. Launch a cluster and
bootstrap with the latest bcbio code and data. Create and mount a
Lustre filesystem attached to the cluster. Terminate the cluster
and Lustre stack upon completion. The processing handles download
of input data from S3 and upload back to S3 on finalization. We
store data encrypted on S3 and manage access using IAM instance
profiles. The examples below show how to run both a somatic variant
calling evaluation and an RNA-seq evaluation. Running the somatic
variant calling evaluation This analysis performs evaluation of
variant calling using tumor/normal somatic sample from the DREAM
challenge. To run, prepare an S3 bucket to run the analysis from.
Copy the configuration file to your own personal bucket and add a
GATK jar. You can use the AWS console or any available S3 client to
do this. For example, using the AWS command line client: