DNA-based data storage appears to offer solutions to some of the problems created by humanity’s ever-growing capacity to create data we want to hang on to. Compared to most other media, DNA offers phenomenal data densities. If stored in the right conditions, it doesn’t require any energy to maintain the data for centuries. And due to DNA’s centrality to biology, we’re always likely to maintain the ability to read it.
But DNA is not without its downsides. Right now, there’s no standard method of encoding bits in the pattern of bases of a DNA strand. Synthesizing specific sequences remains expensive. And accessing the data using current methods is slow and depletes the DNA being used for storage. Try to access the data too many times and you have to restore it in some way—a process that risks introducing errors.
A team based at MIT and the Broad Institute has decided to tackle some of these issues. In the process, the researchers have created a DNA-based image-storage system that is somewhere between a file system and a metadata-based database.
Recent systems for storing data in DNA (such as one we’ve covered) involve adding specific sequence tags to the stretches of DNA that contain data. To get the data you want, you simply add bits of DNA that can base-pair with the right tags and use them to amplify the full sequence. Think of it like tagging every image in a collection with an ID, then setting things up so that only one specific ID gets amplified.
This method is effective, but it’s limited in two ways. For one, the amplification step, done using a process called PCR, has limits on the size of the sequence that can be amplified. And each tag takes up some of that limited space, so adding more detailed tags (as might be needed for a complicated file system) cuts into the amount of space for data.
The other limit is that the PCR reaction that amplifies specific pieces of data-containing DNA consumes some of the original library of DNA. In other words, each time you pull out some data, you destroy piles of unrelated data. Access data often enough and you’ll end up burning through the entire repository. While there are ways to re-amplify everything, each time this is done, it increases the chance of introducing an error.
The new research has separated out the tag information from data storage. In addition, the researchers created a system where it’s possible to access just the DNA data you’re interested in and leave the rest of the data untouched, providing a greater longevity to the data storage.
The basic technology is based on the fact that DNA will stick to silicon-dioxide glass beads. This attraction is independent of the size of the DNA, so you can store arbitrarily large chunks of data using this system (in this case, the fragments were over 10 times the size of the typical chunk of DNA data storage used in the past). Just as importantly, no tags in the DNA were stored in the data, so there was no competition between data storage and file system information.
Once the DNA was on the surface of these beads, the researchers polymerized some additional silicon dioxide on top of it. This process coated the DNA and protected it from the environment. Using a fluorescent tag, the researchers confirmed that the system was efficient; essentially, all of the particles created this way contained DNA.
Only once this shell was in place did the researchers add tags, which were chemically linked to the outer shell. The tags were made of single-stranded DNA, and it was possible to have several distinct tags attached to a single glass shell.
While the researchers handled the process separately for each block of data, once everything was in place, the tagged glass spheres could be mixed into a single data library. While not as compact as the storage of pure DNA, the library still has the advantages of being stable for the long term and requiring no energy for maintenance.
But the fun part is accessing data. The researchers stored a keyword-associated collection of images in the DNA, with each keyword encoded in the DNA attached to the exterior of the glass shell. To use their example, an image of an orange pet cat would be associated with the keywords “orange,” “cat,” and “domestic,” while an image of a tiger would just have “orange” and “cat.”
Because these tags were single-stranded, it was possible to design a matching sequence that would base-pair with it to form a double helix. The tags were linked to differently colored fluorescent molecules so that any glass shells linked to the right tags would start glowing specific colors. We already have machines that use lasers to separate things based on what color they glow (normally, the machines are used to sort fluorescently tagged cells). In this machine, an orange domestic cat bead would glow at different wavelengths than an orange cat bead, so the house cat could be pulled out of the library.
The rest of the library would remain untouched, and so there’s no significant loss of data each time this process occurs. And, because the beads are denser than water, it’s easy to concentrate the data storage again simply by using a centrifuge to spin the unused portion of the library down to the bottom of a test tube.
Once isolated, a glass-etching solution was used to liberate the DNA, which could then be inserted into bacteria. The DNA used for storage was set up to allow bacteria to make lots of copies of it for reading the data.
DNA database—no, not that kind
One of the neat aspects of all of this is that it allows Boolean searches with multiple terms. By selecting for or against different tags one after the other, you can build up fairly complicated conditions: true for cat, false for domesticated, true for black, and so on. Labeling two tags with the same fluorescent color would give you the equivalent of a logical OR if you grab anything with that color.
Because each of these tags can be viewed as a piece of metadata about the image stored by the DNA, the collection of beads ends up acting as a metadata-driven image database.
While all of this represents a significant leap in complexity for DNA-based storage, it’s still, well, DNA-based storage. Which means it’s slow on a scale that makes even tape drives seem quick. The researchers calculate that, even if they crammed far more data into each glass bead, searches would start topping out at going through about 1GB of data a second. That would mean searching a petabyte of data would take a bit over two weeks.
And that’s just finding the right glass beads. Cracking them open and getting the DNA into bacteria, then doing the sequencing needed to actually determine what’s stored in the bead, would likely add a couple of days to the process.
But of course, nobody is suggesting that we use DNA storage because it’s quick; its good properties, as we mentioned up top, are in terms of energy use and data stability. We’d only store something in DNA if we’re convinced we won’t want to access it very often. Given that, any methods of making that access more functional and flexible are potentially valuable.