DNA, Nature’s Best Storage Medium, Headed for the Data Center
Inside Microsoft’s effort to solve the world’s data storage capacity problem.
The continued growth in information we’re trying to store (from IoT sensor data to log files and photos) has already outpaced the capacity of some systems. CERN stores only a tenth of the 15PB of data it gets from the Large Hardon Collider each year on disk.
For many organizations, capacity may not be such a large problem; hard drive technology continues to improve, and much of the world’s data is still stored on tape. The storage issue we haven’t yet tackled is longevity – and that’s where storing data on artificial DNA may really shine.
A smear of DNA small enough to scrape up with your fingernail could store exabytes of information, but it’s the fact that it remains readable after thousands of years that really makes it interesting. Paper and microfilm can last 500 years or more, but digital media are hard to keep for even a few decades. Accelerated testing at higher temperatures shows that DNA will stay readable for 2,000 years if it’s stored at ten degrees centigrade (and for up to 2 million
years if it’s frozen); encapsulating it in spheres of silica means that humidity doesn’t affect it.
The format won’t get out of date like digital storage either. “We'll always be interested in reading DNA so we can be sure we'll always have the capability of reading it in the future -- because if we don't we'll have a real problem,” Karin Strauss, senior researcher in computer architecture at Microsoft Research and associate professor at the Department of Computer Science and Engineering at University of Washington, told Data Center Knowledge.
In the lab, researchers have been able to write and read text, photos, videos, and other files with 100 percent accuracy, and last year Microsoft bought ten million DNA molecules from Twist Bioscience to experiment with. But what does it take to turn that research into a real storage system, and when might you think about putting one in your data center?
Storing data in DNA means turning the bits in a file into the four bases in DNA -- mapping 00 to A, 01 to C, 10 to G, and 11 to T every time -- then synthesizing DNA molecules with those bases in the right order. Reading it means putting those molecules in a DNA sequencer, reading out the sequence of bases, and turning that back into bits. Today, there are some manual steps in that process, Strauss explained.
“There's software to do the first step of translating the bits to what bases we want; the next step is manufacturing the molecules. There’s a manual interface there, because we send Twist the file, and we get back the molecules; internally they have an automated process but they still need somebody to remove the DNA from the machine and ship us the molecules. The sequencers are all automated; you throw the molecules in, and it spits out the data. And then we have the rest of the data pipeline to decode the data.”
Microsoft and Twist are working with the University of Washington to turn that into a fully automated system. Strauss predicted the end result would be something that looked like a tape library, complete with the same kind of robotic arm (and maybe with cartridges of reagents you change like toner in a laser printer). Depending on how much parallelism you need – which comes down to how much data you want to write or read at the same time – “that’s likely to look like a few racks in the data center” she said.
Small as the DNA itself is, you can save more space by encapsulating more than one file in the same silica shell, which means chemically separating the DNA to get the file you want. Because sequencing is a batch process, you’re going to be reading back multiple files on the same sequencer anyway. Files are also encoded on multiple sequences of DNA, so the sequences are clustered together to get the full result. There’s a sequence number on each molecule; think of it like numbering the different sections that make up a large ZIP archive.
Reading DNA destroys it, but that’s only because that’s what the medical and biotech applications need. “When you sequence DNA, you don't want to reuse it, you don't want contamination; you just throw the whole thing away including all the reagents.” It would be possible to recover the DNA instead, but it's probably easier just to make more copies with the standard polymerase chain reaction, which is already used in the process to make sure you have enough copies of the different sequences to read; picking which sequences to copy gives you random access to the right section of a large file.
Those copies can introduce errors, so the system has error correction built in; in fact, that’s how it’s going to scale up from the megabytes that have been stored and decoded so far to the petabytes it needs to be able to deal with. “We are engineering the system, which allows us to cut some corners; we can tolerate more errors, which is what we're counting on to be able to improve these processes. We’ll make the processes more parallel, and they may become more imperfect, both the reading and the writing, but we can tolerate and compensate for that in other ways. We have control over the sequences, so we can encode the data in a way that can make it easier for us to decode them on the way out.”
The overhead of error correction is currently around 15 percent; “That's pretty manageable; ECC in servers is 12.5 percent, so this isn’t that far off.”
How Big and How Soon?The cost of DNA sequencing and synthesis are dropping faster than the price of digital media, especially when you factor in needing to rewrite tapes every five to ten years, but it’s still going to make sense only when you need to store data for a long time rather than a few years. Cloud providers will be interested, but so will some organizations who run their own data centers.
”The type of workload is definitely archival, at least at first,” Strauss said. “The type of users we've been seeing where this would make sense are where you need to keep the data by mandate, like hospitals and clinics, or there's legal data, pension data. They’re applications where you want to keep the data for a long time and put it away and not read it very repetitively. In the end, it’s bits you’re storing, and we can store any kind of bits.”
Video archiving is also a good fit, and even the way people look at photos on Facebook fits the model pretty well; every Halloween enough people look back at photos from the previous year for Facebook to spot the pattern. “That’s a perfect workload, because you could read them in advance, and by the time a user wants to look at it, it's already there.”
Currently the throughput of reading DNA isn’t that high. Of the two systems Strauss has worked on, one produces around 2 million reads in 24 hours (with most of the reads done in the first few hours), the other, more parallel system delivers around 400 million reads in 24 hours. But the density means you could get excellent bandwidth at a very low cost if you need to send the data a long distance, because you could fit an exabyte on something the size of a postcard.
“People ship hard drives today; in the future it might be DNA. You have trucks and planes moving hard drives around; it’s done that way because you get better throughput. With DNA you can expect it to be even better, because it’s a lot more compact, and you can easily make copies for distribution.”
If customers are interested, Strauss suggested we could see DNA storage in action relatively soon. “We think there is a good roadmap to getting this into operation, and we see no fundamental reasons why we wouldn’t be able to put it together. It's not going to be next year, but it's not going to be in ten years either; I think it will be somewhere in between.”