In previous posts I covered the basics of next-generation sequencing – library preparation, template preparation, and the sequencing methodology itself, whether by pyrophosphate detection, single base extension with reversible terminators, or probe addition by ligation. And single molecule sequencing’s attractiveness as a technology has been covered here, but here I’ll detail how the startup Pacific Biosciences does it’s magic. (For some additional commentary on the company and its prospects check here.)
Going (way) back to 2008’s Advances in Genome Biology and Technology in Marco Island, Florida (an annual technology conference that has been described by a friend as ‘the super bowl of genomics’), Stephen Turner made a remarkable announcement in his presentation – 5,000 to 25,000 base-pair reads, in real-time of 10 bases per second, equipment with no moving parts, a throughput of 100Gb per hour, or a full draft human genome in 15 minutes, and all at the low price of $100.
To put this in the 2008 context, Illumina at that same meeting announced a $100,000 human genome, and at that time a Genome Analyzer II was just announced, at a market-leading 3 Gb in 5 days, so that was a rate of 25 Mb/hour; PacBio was claiming 100 Gb/hour, a increase of some four thousand-fold. Again for context, in early 2008 the Illumina readlengths were only 35 bases, so 25,000 base pair readlengths were unheard of. And at such massive throughput (four thousand fold is a remarkable number), that PacBio talk was the one that garnered a huge amount of attention for a small startup company looking to raise more money and set expectations around a ‘2010 or 2011’ product launch.
And raise money they did, to the tune of a total of $370M by mid 2010, and the CEO Hugh Martin went on the record that he could “poach any talent he wanted” as they scaled up their development effort. Going public in late October of 2010, it raised $211M through the IPO, to fuel their commercialization effort. Reaching a high of $16.97 within days, it never rose above that, now trading close to $2 as of this writing. (Its market capitalization is currently less than its cash position, which is an indication of the stock market’s dim view of its future.) In 2008-2009 many industry ‘veterans’ from Affymetrix and Illumina moved to PacBio, and started to sell aggressively into the next-generation sequencing market.
In the meantime, the Genome Analyzer IIx upped the throughput on the order of 10-fold, going from 3Gb in 5 days to >30Gb (now at >60Gb with density and 125 base-pair readlength improvements), the HiSeq 2000 was released to the market in 2010 (throughput per run going from 200Gb initially to 300Gb to now 600Gb). SOLiD went through four iterations, from 2 to 3 to 3 Plus to 4 to the 5500xl (and now on the verge of another iteration, called Wildfire, an upgrade to the existing 5500xl systems). The per-Gb cost has plummeted concurrently, and on the HiSeq a human genome’s sequencing cost is on the order of $3,300 (this is only reagents, not including the ‘burdened’ cost of labor, amortization of the instrument, and other overhead). The 5500xl-w (Wildfire) will lower the cost of a human genome down to $2,200 (again only reagents), and the Ion Torrent Proton (with the Proton II chip set to release in the Spring of 2013) will lower that human genome cost again to about $1,500 or even less. (To be clear, Ion Torrent has promised the ‘$1,000 genome’ but not stated explicitly what that coverage will be, so in the interest of consistency I’ve assumed a 90Gb, 30x coverage human genome in the above calculations.)
Getting back to PacBio, it is in this context their instrument was launched, first with an early access program that started in 2010 with 10 customers, and in 2011 their commercial system. Comprised of a consumable flow cell, the basic premise of the single molecule sequencing technology is to anchor a single polymerase molecule at the bottom of a tiny micromachined well, called a Zero Mode Waveguide, or ZMW. These wells are a ‘nanophotonic confinement’ structure, whose dimensions are on the order of ~70nm by ~100nm, and whose volume is measured in zeptoliters (the SI unit for 10-21); as such as they are smaller than the wavelength of light, which is the basis of how it works. A strand of DNA is threaded to the polymerase (more about that later), the polymerase synthesizes its complement with labeled fluorescent nucleotides, and as the nucleotides are incorporated an individual color is detected. When another base is added to the growing strand, the original fluor is cleaved with the pyrophosphate group, and the detection of the fluorescence is as it occurs “in real time”, in contrast to the ‘extend and take an image’ process for the existing second-generation sequencing methods.
Instead of a CCD camera that takes pictures, here there is a CCD camera that takes video, as the DNA gets polymerized. Each feature will incorporate nucleotides at their own independent rate, and of course data collection speed is a problem that needs to be addressed. Using principles such as evanescent wave proximity effects, these single-molecule effects are observed readily.
From a technical perspective what PacBio has achieved is remarkable. Taking a modified polymerase (to handle the fluorescently-labeled nucleotide substrate), attaching it covalently to the bottom of thousands of wells (the early access ZMW’s had 3,000, the current commercial ones have 75,000), imaging and collecting the data and getting discrete signals to discriminate individual bases, is nothing short of technological magic.
Yet there have been several limitations. The instrument is expensive (on the order of $700,000) and very large (1800lbs, requiring in some cases modifications to a customer’s building, and I heard one story of an elevator needing to be changed to a larger one in a new building just so it could be transported to an upper floor). The throughput is limited, on the order of 90 Mb per run, and newer chemistry reportedly has higher yields of about double that. One of their early-access customers is a friend who tells me (somewhat sadly, I might add, due to higher expectations) that their system is operational only about 50% of the time. (‘Down-time’ is a fact of life with complex optical systems, which Ion Torrent nicely avoids due to its simplicity, and 10% to 20% downtime is not unusual; however 50% isn’t good by any measure.) On a cost per Gb measurement, it is very expensive – even with the relatively low running cost of about $100 per run, that equates to a $100K genome (compared to the $3.3K, $2.2K and $1.5K numbers above).
Other limitations are the input amount (500ng of genomic DNA), which for a single-molecule technology seems incongruent, except for the fact that adapters need to be ligated to the ends of the individual single molecules, and such processes just don’t work very well with very small sample inputs. And perhaps the most important limitation for the Pacific Biosciences’ technology, which is its error rate, on the order of 15%.
When discussing accuracy, the details can get arcane, but suffice it for now to state that current next generation sequencing platforms provide an error on the order of 1%, and each platform will have its unique biases and error profile. When SOLiD came out with its two-base encoding, promising much higher accuracy (on the order of 0.06% error), the market rewarded other attributes, mainly lower cost per gigabase, and higher throughput per day or per run. (Ease of use and ease of informatics also played important roles in the decisions made by customers.)
So on these dimensions, of lower costs and higher throughput, PacBio comes out behind, and on ease of use and ease of informatics for sake of discussion would be on par with the other platforms on the market. But with an error over ten-fold higher, the improvement in readlength is it’s only strength.
The research community has shown recently the use of long readlengths (the PacBio system now has an average readlength on the order of 2.5kb, compared to the next-gen systems from 50-125bp) in de novo assembly, using the more accurate short reads from the existing next-gen systems to correct errors in the long PacBio reads. So researchers will find utility there. But for the majority of applications, whether clinical targeted sequencing (where accuracy will trump the long readlengths), whole-exome sequencing (where the throughput and cost is simply prohibitive), or any other ‘typical’ next-gen application (such as RNA-Seq or ChIP-Seq) again the throughput and cost are prohibitive, in addition to the accuracy problem.
Now with a new CEO and new commercial leadership, they are hard at work repositioning the company for selling novel applications, and a recent release of software for the direct detection of 5meCytosine (called the ‘5th base’ for epigenetic inheritance) indicates one important application that could fill an important research need, to look at CpG islands and shores from native genomic DNA that would capture promoter and enhancer region’s methylation status directly, instead of the current methods of bisulfite treatment and sequencing that pose its own informatics challenges. Other novel applications can be rightly described as niche applications, as investigation into modified bases is currently a small if not tiny market, not due to these modified bases not being important (currently how important this is to genome biology is not clear), but rather the lack of tools to adequately measure these modified bases.
At one time the Pacific Biosciences’ third-generation system was described in the press as a ‘Hubble Telescope’ to view unseen vistas of the genome, but now is something more like a specialized tool for specialized applications. So instead of an amazing comprehensive tool to examine the mysterious depths, it is a niche tool to look at a narrow set of applications. Meanwhile the second-generation systems continue their progress, and existing PacBio customers patiently wait new uses to emerge.