We recently posted a preprint entitled "Exploring neighborhoods
in large metagenome assembly graphs reveals hidden sequence
diversity" (biorxiv
link). This work is the first but hopefully by no
means the last product of the spacegraphcats
project, a collaboration between my lab, Blair Sullivan's
lab at NC State, and Jeff Heer's lab at UW Seattle. It is the first
fruit of two and a half years of intense work by one of the most
interdisciplinary teams I've ever had the pleasure of working with,
and I want to share some of the story!
The background
In January and February of 2016, I read two papers that had a
tremendous impact on my thinking - the MetaPalette
paper and the
mash paper (see my
blog post on k-mers and taxonomy for more background here). I
immediately jumped on these concepts, reimplementing
mash in Python and started playing around with it - this was
the start of sourmash.
I then spent a few months playing around with applying MinHash
to De Bruijn graphs of metagenomes. Since MinHash works with
k-mers, and De Bruijn graphs are just collections of k-mers, I
thought, hey, maybe we can do something cool with MinHash and De
Bruijn graph structure. So I spent a lot of time hacking on khmer
and sourmash and playing with various concepts.
(I still think the ideas I generated during this phase
are really promising, but after two years I have many, many, many
detailed reasons as to why most of them are naive and not only
don't work but can't work. Science sucks, man. But it did lead in
several excellent directions so the work is not wasted.)
The conclusion I reached from all this work was that we needed a
good way to select k-mers from neighborhoods within a
graph. Collections of shotgun reads were not so useful for this -
close bits of the graph are not local within the reads. We'd done
some work on partitioning graphs as part of Pell et al., 2014
but I knew that partitioning wouldn't work here. (Turns out
assembly graphs are generally completely connected, for
Reasons.)
Enter the Barnraising for Data-Intensive Discovery
In May 2016, I joined about 20 people at an event organized by
Blair Sullivan and Casey Greene, entitled the Barnraising for
Data-Intensive Discovery. The idea was to gather a bunch of
researchers in an isolated location for a week, define some
projects of mutual interest, and hack on them together in a spirit
of mutual assistance - hence "Barnraising". The invitees included
anyone funded by the Moore Data Driven Discovery program, and we
ended up with a nice mixture of computer scientists, biologists,
mathematicians, and other researchers.
The Barnraising was held at the Mount Desert Island Biological
Laboratory in Maine, and while the weather was absolutely miserable
(cold and rainy!) the environment inside was
pretty great! At the opening meeting I pitched my "hey I need to
cut graphs up into neighborhoods" idea and it intrigued several
people, including Blair and two people from her lab, Felix Reidl
and Michael O'Brien, a postdoc and grad student respectively. We
also recruited Dominik Moritz, a graduate student from Jeff Heer's
lab at UW (Jeff was also an Investigator, and the UW eScience
Institute was a Moore DDD Data Science Environment awardee).
(These four researchers ended up being four of the other five
authors on this paper!)
My initial (bad) idea, and the origin of the name
"spacegraphcats"
The idea I'd pitched to the larger group was that metagenome
assembly graphs should, by all rights, consist of neighborhoods of
more densely connected k-mers (representing individual species)
that were sparsely connected to other (unrelated) k-mers; and that
this sparse graph could, potentially, be cut into
species neighborhoods. I was interested in looking directly at the
assembly graph structure to make these cuts, without going through
the filter of outputting contigs (which do cut the graph,
but... not in a way that I trust :).
During the course of the project, the ideas got significantly
modified as it became clear that I didn't know that much about
graphs in general, or metagenome graphs, and that my intuition was
(in places) flatly wrong.
Now, you may wonder why the software mentioned in the paper is
called "spacegraphcats". Well, if you pass "sparse graph cuts"
through the mind of an idle German mathematician (coff FELIX coff)
then said mathematician may come up with the name "spacegraphcats".
And of course when looking for logos and related memes online,
well, "space" and "cats" make things easy. And thus
"spacegraphcats" was born.
Despite the changes in the project focus, the name stayed at
"spacegraphcats", 'cause it was cool... and why not?
What did we actually end up doing?
The whole project has a pretty broad scope, but for this first
paper, we ended up focusing in on a fairly targeted question: can
we improve the completeness of metagenome-assembled genomes, or
MAGs?
Briefly, MAGs are genomes computationally extracted from
metagenomes. The usual process of inferring MAGs is to assemble a
metagenomic sample, then group contigs together into putative
genomes based on sequence composition, estimated contig abundance,
and phylogenetic marker genes. The completeness of these "genome
bins" could then be estimated by the fraction of single copy genes
thought to be essential for life.
The process seems to work pretty well, and MAGs have become
incredibly popular, with well over 10,000 MAGs extracted in the
last few years (see the first paragraph of the spacegraphcats paper
for references).
But I suspected that there was more to be done. There are
several shortcomings of MAGs and the binning progress: the main one
(for me) is that generally the binning depends on contigs that come
out of DNA sequence assembly, which is a challenging and fragile
process. I believed that metagenomes probably contained more
sequence and more sequence variation than was represented
in the MAGs, and I thought we were probably underestimating the
content of MAGs by ignoring accessory genome content. But, while I
have reasons for these opinions, I don't (didn't) have specifics -
it is surprisingly difficult to nail down missing content in
complex communities!
So I was really, really interested in taking a look at what was
in the graph neighborhood of these bins. And after some significant
rephrasing of the original sparse graph cuts question by Felix,
Mike, Dominik, and Blair, we realized that a graph
neighborhood search function would be directly useful. The
idea would be to reach into the assembly graph with a set of query
k-mers (or reads) and extract the full set of k-mers proximal (in
the graph) to those queries.
What this led to was an implementation of an r-dominating set
algorithm, where a subset of nodes in the compact De Bruijn graph
are chosen such that every node in the cDBG is within
distance r of one of these chosen nodes. This forms what is known
as a dominating set, and it provides a kind of clustering over the
entire graph. These dominating nodes could then be used to define a
local neighborhood: we could go from a k-mer to a cDBG node to its
dominator to the dominated nodes, and retrieve everything that
belonged to a dominated nodes as part of that neighborhood.
Now, defining an r dominating set is NP-hard on large graphs,
but conveniently, Blair's group was interested in and knowledgeable
about efficiently calculating certain properties for graphs of
bounded
expansion. Graphs of bounded expansion permit certain
optimizations of otherwise difficult algorithms, and (mirabilis
visu) compact De Bruijn graphs turn out to be just this kind of
graph, because each node has at most 8 edges.
In the end, we were able to calculate r-dominating sets in
linear time for cDBGs because of this property. (Again,
see the paper for details.)
Implementation, implementation, implementation ...and real
data.
We actually walked out of the barnraising with a functioning
dominating set calculator that we could apply to real data. But we
had to build a lot of code infrastructure around this,
including -
- code to build a cDBG (since thankfully replaced with
BCALM);
- graph indexing that let us go from a k-mer to a cDBG node (we
ended up using the bbhash minimal perfect hash function for
this);
- graph indexing that let us go from a cDBG node to the set of
reads that had contributed k-mers to that node;
...all of which now resides in the spacegraphcats codebase.
The process of building the spacegraphcats program was heavily
iterative. I think we went through at least two rewrites of the
dominating set algorithm, which is now the least computationally
intensive part of the whole package. I spent more than 18 months
trying to get MinHash based search to work, only to essentially
prove that due to the rampant strain variation in real data sets,
MinHash based methods were too insensitive to find graph regions at
the necessary resolution. (More on this at some other time.)
Overall, I think we wrote 5-10x as much code as we ended up using
in the final paper.
So what did we find?
The two key biological findings of the paper, in my mind, are
these:
1) The graph neighborhoods of metagenome-assembled genomes
contain real content that almost certainly belongs in the bin, but
that was not included in the bin.
2) One of the reasons for this content to be missing is
that there is a lot of strain variation present in the
neighborhood, and this prevents assembly (and hence binning). It's
hard to estimate the magnitude of this effect but it's at least
10-20% in the data set we chose to benchmark with.
We do lots of other stuff in the paper, and it's chock full of
interesting ideas, but those are the two main takeaways for
biologists.
(If you're a computer scientist, you should be
salivating at the dominating set algorithm and implementation. But
this blog post is for biologists :)
What is particularly neat about the paper?
I think the underlying spacegraphcats algorithm, code, and way
of thinking is powerful and will be enabling in all sorts of
unexpected ways. But of course I'm not yet sure which parts are
going to actually be useful to others. We chose the binning work
for this paper because it was the part we could most concretely
discuss; it is definitely not the most interesting thing
about the spacegraphcats project overall.
Probably the coolest unexpected thing about this paper
was how well the Plass amino acid assembler worked on the
neighborhoods. And therein lies a bit of a story --
You see, we got to the point where everything was
working in ~February or March of 2018. We could do the
genome queries, we could get the reads from the neighborhood,
and... we were kinda stuck there. If (as we suspected) the reads
didn't assemble well due to strain variation, then what do you do
with them? We were thinking of running different nucleotide
assemblers, but it is rare that one assembler unambiguously
outperforms another assembler. And there wasn't much we could do
without an assembler here.
This point bears repeated emphasis. The strong underlying
hypothesis was that one of the major reasons sequence wasn't
being binned was because it wasn't assembling. But you can't
really do all that much with 100 bp reads other than try
to assemble them! And (as we suspected) the nucleotide assemblers
didn't perform all that well on the reads, when we measured it.
What could we do??
Assembling in amino acid space
So we started hunting around for ways to look at the reads that
would not involve running them through a nucleotide assembler. I'd
recently seen the Plass
paper, which talked about assembling in amino acid space, and
so I decided to give it a try; in theory, at least, amino acids
should be less affected by nucleotide strain variation, because all
of the synonymous codons would be collapsed.
Somewhat to my surprise, the Plass assembler worked wonderfully
well out of the box, and - in at least a few cases - dramatically
increased the estimated completeness of the bins. When we looked,
we confirmed that Plass was (as hoped) assembling many more protein
sequences. And we also saw a surprising amount of amino acid
variability in what was being assembled.
But what did it mean? WHAT DID IT MEAN??
Enter the final author!
Here we asked Taylor Reiter to dig in. Taylor is a graduate
student in my lab who joined the project to help with the
biological data analysis. She'd been working on assessing the
functional content of the bins since March 2018 or so, and when I
threw the Plass assemblies her way she tore into them.
What she discovered was that even with ridiculously
stringent error trimming on the underlying data (a hard k-mer trim
of abundance=5 on all the reads prior to assembly), we were seeing
many minor variants in really highly conserved genes like gyrA. I
think we still don't have a handle on the number and importance of
these variants, but at the nucleotide level there seem to be dozens
of variants. And, at least in a few neighborhoods, there seem to be
two or more high abundance amino acid contigs, suggesting two or
more significant strains.
And that's where we left the paper.
The paper concludes with the observation that there's a lot of
unseen variability in metagenomes, and that they are unseen in part
because of challenges with the tools that we use. It's not clear
how much biological impact this might have on MAG analysis, e.g. on
attempts to reconstruct metabolic potential, but indications are
that it has at least some. We're thinking about how to follow up on
this concretely.
In the meantime, with this first paper, we have produced a
reasonably robust tool, based on a strong and efficient algorithm,
that lets us systematically look at metagenomes in a way that we
basically haven't been able to do before. I'm pretty stoked about
this, in part because (as has been my experience with other papers
like this) I suspect that there will be many interesting
computational extensions that can be built on this foundation, by
people that are smarter than me about assembly graphs. I'm looking
forward to it!
--titus