DNA has stored ‘data’ for millions of years, in the form of the 4 bases;
A, T, G, and C. Synthetic attempts to recreate this huge data storing
capability have promised the same levels of storage density and
stability, but are yet to compete with current optical/magnetic storage
devices due to the high error rates and large costs. A novel approach
has been found, storing the data in the backbone of native DNA instead
of the bases themselves, this greatly reduces costs, while reducing the
error rate to 0. This new technique is also capable of bitwise
random-access memory, opening the door to the world of molecular
computing.
In nature, DNA is found as a double-helix ladder compromising of
billions of molecular building blocks known as bases (A, T, G, and C),
coding instructions for proteins. The closely-packed nature of these
nucleotides enable DNA to code for huge quantities of data; one gram of
DNA is capable of holding 455 exabytes of data1 ( 1
exabyte = 1018 bytes), enough storage for all of the
data produced by every major tech company…with room to spare.
Traditional data storage methods are reaching their physical limits,
with hard-drives being restricted to 1 terabyte (1012bytes) per square inch2. With the exponential growth
of big data, alternative data storage methods such as DNA are starting
to be explored, made possible due to the huge advances in DNA sequencing
technology.
Current approaches of DNA synthesis-based storage
2–4assign segments of binary code to specific nucleotide sequences, these
are then synthesised
in vitro using enzymes to bind the
nucleotides together, iterating through many cycles to form the string
of DNA. The information encoded in the DNA can then be retrieved via
next generation sequencing (NGS) or third generation nanopore
sequencing. Nanopore sequencing entails feeding the string of DNA
through a tiny hole, much like a thread going through a needle, reading
each nucleotide as it passes through. Advancements in sequencing
technologies allow for large sequences to be read at a relatively low
cost; the total cost of the first human genome was ≈ £2.07
billion
2, whereas sequencing a full genome now costs
around £1000
2.
However, despite the promise of synthesis-based DNA storage, issues
arise during the synthesis step. The first issue is that adding one
nucleotide per cycle takes hours to complete a full sequence, making it
very slow and expensive compared to traditional optical and magnetic
writing techniques. Secondly, DNA synthesis is naturally a very error
prone process, with 1% of bases containing substitution (incorrect base
attached) or indel (insertion or deletion of a base)
errors3. This is a big issue when dealing with large
volumes of DNA as errors are inevitable, this can have dire
repercussions when dealing with compressed file formats. Finally, DNA
synthesis machinery is limited to creating relatively short sequences,
reducing the amount of readouts/ data, per fragment5.
Tabatabaei et al. 5 propose a novel method to
storing data using DNA, via a topological ‘nicking’ approach instead of
direct synthesis. Therefore, the data is not stored in the nucleotides
themselves, but the sugar-phosphate backbone of the DNA. This enables
you to use a known sequence of DNA which you can map to a reference
genome, bypassing the aforementioned issues associated with
synthesis-based storage.