rawdog

planet.donarmstrong.com

Friday, 19 December 2014

Friday, 19 December 2014

Feeds

03:17 PM

03:17 PM

<	December 2014	>
Mon	Tue	Wed	Thu	Fri	Sat	Sun
01	02	03	04	05	06	07
08	09	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31	01	02	03	04

Benchmarking variation and RNA-seq analyses on Amazon Web Services with Docker (local copy) [Blue Collar Bioinformatics] 03:17 PM, Friday, 19 December 2014 03:00 AM, Thursday, 11 November 2021

Overview We developed a freely available, easy to run implementation of bcbio-nextgen on Amazon Web Services (AWS) using Docker. bcbio is a community developed tool providing validated and scalable variant calling and RNA-seq analysis. The AWS implementation automates all of the steps of building a cluster, attaching high performance shared filesystems, and running an analysis. This makes bcbio readily available to the research community without the need to install and configure a local installation. The entire installation bootstraps from standard Linux AMIs, enabling adjustment of the tools, genome data and installation without needing to prepare custom AMIs. The implementation uses Elasticluster to provision and configure the cluster. We automate the process with the boto Python interface to AWS and Ansible scripts. bcbio-vm isolates code and tools inside a Docker container allowing runs on any remote machine with a download of the Docker image and access to the shared filesystem. Analyses run directly from S3 buckets, with automatic streaming download of input data and upload of final processed data. We provide timing benchmarks for running a full variant calling analysis using bcbio on AWS. The benchmark dataset was a cancer tumor/normal evaluation, from the ICGC-TCGA DREAM challenge, with 100x coverage in exome regions. We compared the results of running this dataset on 2 different networked filesystems: Lustre and NFS. We also show benchmarks for an RNA-seq dataset using inputs from the Sequencing Quality Control (SEQC) project. We developed bcbio on AWS and ran these timing benchmarks thanks to work with great partners. A collaboration with Biogen and Rudy Tanzi's group at MGH funded the development of bcbio on AWS. A second collaboration with Intel Health and Life Sciences and AstraZenenca funded the somatic variant calling benchmarking work. We're thankful for all the relationships that make this work possible: John Morrissey automated the process of starting a bcbio cluster on AWS and attaching a Lustre filesystem. He also automated the approach to generating graphs of resource usage from collectl stats and provided critical front line testing and improvements to all the components of the bcbio AWS interaction. Kristina Kermanshahche and Robert Read at Intel provided great support helping us get the Lustre ICEL CloudFormation templates running. Ronen Artzi, Michael Heimlich, and Justin Johnson at AstraZenenca setup Lustre, Gluster and NFS benchmarks using a bcbio StarCluster instance. This initial validation was essential for convincing us of the value of moving to a shared filesystem on AWS. Jason Tetrault, Karl Gutwin and Hank Wu at Biogen provided valuable feedback, suggestions and resources for developing bcbio on AWS. Glen Otero parsed the collectl data and provided graphs, which gave us a detailed look into the potential causes of bottlenecks we found in the timings. James Cuff, Paul Edmon and the team at Harvard FAS research computing built and administered the Regal Lustre setup used for local testing. John Kern and other members of the bcbio community tested, debugged and helped identify issues with the implementation. Community feedback and contributions are essential to bcbio development. Architecture The implementation provides both a practical way to run large scale variant calling and RNA-seq analysis, as well as a flexible backend architecture suitable for production quality runs. This writeup might feel a bit like a black triangle moment since I also wrote about running bcbio on AWS three years ago. That implementation was a demonstration for small scale usage rather than a production ready system. We now have a setup we can support and run on large scale projects thanks to numerous changes in the backend architecture: Amazon, and cloud based providers in general, now provide high end filesystems and networking. Our AWS runs are fast because they use SSD backend storage, fast networking connectivity and high end processors that would be difficult to invest in for a local cluster. Renting these is economically feasible now that we have an approach to provision resources, run the analysis, and tear everything down. The dichotomy between local cluster hardware and cloud hardware will continue to expand with upcoming improvements in compute (Haswell processors) and storage (16Tb EBS SSD volumes). Isolating all of the software and code inside Docker containers enables rapid deployment of fixes and improvements. From an open source support perspective, Amazon provides a consistent cluster environment we have full control over, limiting the space of potential system specific issues. From a researcher's perspective, this will allow use of bcbio without needing to spend time installing and testing locally. The setup runs from standard Linux base images using Ansible scripts and Elasticluster. This means we no longer need to build and update AMIs for changes in the architecture or code. This simplifies testing and pushing fixes, letting us spend less time on support and more on development. Since we avoid having a pre-built AMI, the process of building and running bcbio on AWS is fully auditable for both security and scientific accuracy. Finally, it provides a path to support bcbio on container specific management services like Amazon's EC2 container service. All long term data storage happens in Amazon's S3 object store, including both analysis specific data as well as general reference genome data. Downloading reference data for an analysis on demand removes the requirement to maintain large shared EBS volumes. On the analysis side, you maintain only the input files and high value output files in S3, removing the intermediates upon completion of the analysis. Removing the need to manage storage of EBS volumes also provides a cost savings ($0.03/Gb/month for S3 versus $0.10+/Gb/month for EBS) and allows the option of archiving in Glacier for long term storage. All of these architectural changes provide a setup that is easier to maintain and scale over time. Our goal moving ahead is to provide a researcher friendly interface to setting up and running analyses. We hope to achieve that through the in-development Common Workflow Language from Galaxy, Arvados, Seven Bridges, Taverna and the open bioinformatics community. Variant calling – benchmarking AWS versus local We benchmarked somatic variant calling in two environments: on the elasticluster Docker AWS implementation and on local Harvard FAS machines. AWS processing was twice as fast as a local run. The gains occur in disk IO intensive steps like alignment post-processing. AWS offers the opportunity to rent SSD backed storage and obtain a 10GigE connected cluster without contention for network resources. Our local test machines have an in-production Lustre filesystem attached to a large highly utilized cluster provided by Harvard FAS research computing. At this scale Lustre and NFS have similar throughput, with Lustre outperforming NFS during IO intensive steps like alignment, post-processing and large BAM file merging. From previous benchmarking work we'll need to process additional samples in parallel to fully stress the shared filesystem and differentiate Lustre versus NFS performance. However, the resource plots at this scale show potential future bottlenecks during alignment, post-processing and other IO intensive steps. Generally, having Lustre scaled across 4 LUNs per object storage server (OSS) enables better distribution of disk and network resources. AWS runs use two c3.8xlarge instances clustered in a single placement group, providing 32 cores and 60Gb of memory per machine. Our local run was comparable with two compute machines, each with 32 cores and 128Gb of memory, connected to a Lustre filesystem. The benchmark is a cancer tumor/normal evaluation consisting of alignment, recalibration, realignment and variant detection with four different callers. The input is a tumor/normal pair from the the ICGC-TCGA DREAM challenge with 100x coverage in exome regions. Here are the times, in hours, for each benchmark: AWS (Lustre) AWS (NFS) Local (Lustre) Total 4:42 5:05 10:30 genome data preparation 0:04 0:10 alignment preparation 0:12 0:15 alignment 0:29 0:52 0:53 callable regions 0:44 0:44 1:25 alignment post-processing 0:13 0:21 4:36 variant calling 2:35 2:05 2:36 variant post-processing 0:05 0:03 0:22 prepped BAM merging 0:03 0:18 0:06 validation 0:05 0:05 0:09 population database 0:06 0:07 0:09 To provide more insight into the timing differences between these benchmarks, we automated collection and plotting of resource usage on AWS runs. Variant calling – resource usage plots bcbio retrieves collectl usage statistics from the server and prepares graphs of CPU, memory, disk and network usage. These plots allow in-depth insight into limiting factors during individual steps in the workflow. We'll highlight some interesting comparisons between NFS and Lustre during the variant calling benchmarking. During this benchmark, the two critical resources were CPU usage and disk IO on the shared filesystems. We also measure memory usage but that was not a limiting factor with these analyses. In addition to the comparisons highlighted below, we have the full set of resource usage graphs available for each run: Variant calling with NFS on AWS Variant calling with Lustre on AWS RNA-seq on a single machine on AWS CPU These plots compare CPU usage during processing steps for Lustre and NFS. The largest differences between the two runs are in the alignment, alignment post-processing and variant calling steps: NFS Lustre For alignment and alignment post-processing the Lustre runs show more stable CPU usage. NFS specifically spends more time in the CPU wait state (red line) during IO intensive steps. On larger scale projects this may become a limiting factor in processing throughput. The variant calling step was slower on Lustre than NFS, with inconsistent CPU usage. We'll have to investigate this slowdown further, since no other metrics point to an obvious bottleneck. Shared filesystem network usage and IO These plots compare network usage during processing for Lustre and NFS. We use this as a consistent proxy for the performance of the shared filesystem and disk IO (the NFS plots do have directly measured disk IO for comparison purposes). NFS Lustre The biggest difference in the IO intensive steps is that Lustre network usage is smoother compared to the spiky NFS input/output, due to spreading out read/writes over multiple disks. Including more processes with additional read/writes will help determine how these differences translate to scaling on larger numbers of simultaneous samples. RNA-seq benchmarking We also ran an RNA-seq analysis using 4 samples from the Sequencing Quality Control (SEQC) project. Each sample has 15 million 100bp paired reads. bcbio handled trimming, alignment with STAR, and quantitation with DEXSeq and Cufflinks. We ran on a single AWS c3.8xlarge machines with 32 cores, 60Gb of memory, and attached SSD storage. RNA-seq optimization in bcbio is at an earlier stage than variant calling. We've done work to speed up trimming and aligning, but haven't yet optimized the expression and count steps. The analysis runs quickly in 6 1/2 hours, but there is still room for further optimization, and this is a nice example of how we can use benchmarking plots to identify targets for additional work: Total 6:25 genome data preparation 0:32 adapter trimming 0:32 alignment 0:24 estimate expression 3:41 quality control 1:16 The RNA-seq collectl plots show the cause of the slower steps during expression estimation and quality control. Here is CPU usage over the run: The low CPU usage during the first 2 hours of expression estimation corresponds to DEXSeq running serially over the 4 samples. In contrast with Cufflinks, which parallelizes over all 32 cores, DEXSeq runs in a single core. We could run these steps in parallel by using multiprocessing to launch the jobs, split by sample. Similarly, the QC steps could benefit from parallel processing. Alternatively, we're looking at validating other approaches for doing quantification like eXpress. These are the type of benchmarking and validation steps that are continually ongoing in the development of bcbio pipelines. Reproducing the analysis The process to launch the cluster and an NFS or optional Lustre shared filesystem is fully automated and documented. It sets up permissions, VPCs, clusters and shared filesystems from a basic AWS account, so requires minimal manual work. bcbio_vm.py has commands to: Add an IAM user, a VPC and create the Elasticluster config. Launch a cluster and bootstrap with the latest bcbio code and data. Create and mount a Lustre filesystem attached to the cluster. Terminate the cluster and Lustre stack upon completion. The processing handles download of input data from S3 and upload back to S3 on finalization. We store data encrypted on S3 and manage access using IAM instance profiles. The examples below show how to run both a somatic variant calling evaluation and an RNA-seq evaluation. Running the somatic variant calling evaluation This analysis performs evaluation of variant calling using tumor/normal somatic sample from the DREAM challenge. To run, prepare an S3 bucket to run the analysis from. Copy the configuration file to your own personal bucket and add a GATK jar. You can use the AWS console or any available S3 client to do this. For example, using the AWS command line client:

Feed	RSS	Last fetched	Next fetched after
	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
Bits of DNA	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
blogs.perl.org	XML	12:00 AM, Tuesday, 18 January 2022	12:15 AM, Tuesday, 18 January 2022
Blue Collar Bioinformatics	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
Boing Boing	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
Epistasis Blog	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
Futility Closet	XML	12:00 AM, Tuesday, 18 January 2022	12:15 AM, Tuesday, 18 January 2022
gCaptain	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
Hackaday	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
In between lines of code	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
InciWeb Incidents for California	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
LeafSpring	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
Living in an Ivory Basement	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
LWN.net	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
Mastering Emacs	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
Planet Debian	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
Planet Emacsen	XML	12:00 AM, Tuesday, 18 January 2022	12:15 AM, Tuesday, 18 January 2022
RNA-Seq Blog	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
RStudio Blog - Latest Comments	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
RWeekly.org - Blogs to Learn R from the Community	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
The Adventure Blog	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
The Allium	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022
Variance Explained	XML	12:00 AM, Tuesday, 18 January 2022	12:30 AM, Tuesday, 18 January 2022

December 2021

Mon	Tue	Wed	Thu	Fri	Sat	Sun
	January 2022
27	28	29	30	31	01	02
03	04	05	06	07	08	09
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31	01	02	03	04	05	06

Mon	Tue	Wed	Thu	Fri	Sat	Sun
	August 2021
26	27	28	29	30	31	01
02	03	04	05	06	07	08
09	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31	01	02	03	04	05

Mon	Tue	Wed	Thu	Fri	Sat	Sun
	May 2021
26	27	28	29	30	01	02
03	04	05	06	07	08	09
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31	01	02	03	04	05	06

Mon	Tue	Wed	Thu	Fri	Sat	Sun
	November 2020
26	27	28	29	30	31	01
02	03	04	05	06	07	08
09	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	01	02	03	04	05	06

Mon	Tue	Wed	Thu	Fri	Sat	Sun
	December 2019
25	26	27	28	29	30	01
02	03	04	05	06	07	08
09	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31	01	02	03	04	05