Adina Howe was a postdoctoral fellow at Michigan State University with C. Titus Brown (now Associate Professor of Genetics at the University of California, Davis) when she found herself trying to assemble bacterial genomes from metagenomics data sets so large—nearly 400 billion bases worth—that the assembler software couldn’t keep up. “They would require hundreds of gigs of memory that we didn’t have,” she explains.
Typically, a raw next-generation sequencing data set is ~30 GB per sample—an order of magnitude smaller than Howe’s. Yet even at that size, a computing neophyte would encounter significant practical difficulties—moving the data from place-to-place or even figuring out how to open the files takes savvy. “They can’t open it any Microsoft product,” Howe notes, as the software would likely crash. And of course, processing the data—filtering out low quality sequences, for instance—would be even more difficult.
Fortunately, Brown had been developing tools to handle such problems for years. His lab’s flagship software is “khmer,” a tool that reduces sequences into a series of arbitrary-length “words” (that is, k-mers), simplifying tasks such as genome assembly.
To tackle Howe’s metagenomics problem, Brown’s team implemented a probabilistic data structure called a “Bloom filter,” which reduces the amount of memory required for sequences some 40- fold. As Howe explains it, this structure is like a faculty mailroom in which each box is shared by several instructors. By creating multiple “rooms” in which those faculty pairings are shuffled, it becomes statistically possible to determine how likely it is that any given individual has mail—in this case, an overlapping sequence to fit into a growing assembly—by checking to see if those different boxes are full.
“I go to my mailbox and I ask, ‘Hey, are any potential connecting sequences stored in my data structure?’ If not, I know that they don’t exist, and I don’t have to pursue that path any more; otherwise, I’m going to check my other ‘mailbox rooms’.”
In the end, Howe used this approach to produce an assembly of nearly 5.5 million protein-coding genes from her metagenomics data set.
“It took us a good frustrating year,” Howe, who now is an Assistant Professor of Agricultural and Biosystems Engineering at Iowa State University, recalls.
For the original story by Biotechniques, click here.