Artificial Intelligence in Genome Assembly
Written by: Mark Titleman
Genomic sequencing has long been an expensive, time-consuming and labor-intensive endeavor
that has stalled related developments in science and medicine. It is a fact that major advances have
been made so that a single whole genome sequencing costs around $1000. This is only for certain
instances of personalized medicine, however; the pursuit of theoretical development in genetics
requires a vast amount of data to be collected and analyzed – a feat for which artificial intelligence is
thankfully well suited. The benefits of artificial intelligence to science, medicine and computing itself are
not only aspects of market and technological self-correction but – most interestingly – can in the most
ideal sense make all self-correction automated through generative AI. The connection between the
genomic sequencing process and developments in genetics in some way parallels this since the amount
of collected and analyzed data sufficient for self-correction in genetics is truly enormous. The process of
genomic sequencing experiences a substantial bottleneck at the assembly phase, when DNA fragments
comprising different sequences with myriad variants are aligned to a reference genome for
reconstruction. Less time-consuming genome assembly would thus affect AI-based sequencing most
positively, lowering overall labor and expenses, contributing to theoretical developments in genetics,
and perhaps providing sufficient data for self-correction in the form of improved accuracy in variant
calling and phenotypic expression studies.
While artificial intelligence is used for identifying genetic variants, predicting gene function and
interaction, and genome-based medical classification, in the pursuit of increased efficiency it serves a
vital role at every stage of the genomic sequencing underpinning these advances in analysis. AI can be
trained to correct errors in the sequencing process, scan enormous datasets in search of specific
variations, and for genome assembly via computation-intensive pattern recognition – the stage of
sequencing most susceptible to improvement via artificial intelligence. Genome assembly entails
aligning many thousands of DNA sequences for reconstruction of the genome, using algorithms to
identify and link overlapping pairs of sequences. New types of processors used in artificial intelligence
called intelligence or neural processing units (NPUs) have already been developed for the express
purpose of accelerating AI by incorporating machine learning models within their architecture. These
types of processors are capable of handling the enormous data and irregular computation associated
with both perfect alignment and unmatching sequences, expediting the assembly and sequencing
processes for potentially attaining self-correction in genetics, while new types of artificial intelligence
offer similar promise.
Genome assembly is needed in sequencing due to the short DNA read lengths generated by the
sequencing platforms, the amount of data generated, error rectification during sequencing, and
importantly to identify and localize repetitive sequences. Single reads forming ‘contigs’ via overlapping
regions and paired reads from both ends of the DNA molecule can provide sequencing data for
assembly. A consensus sequence is assembled heuristically by algorithm via comparison to a reference
genome or in some cases de novo. This process is error prone, particularly in cases of de novo
sequencing where errors in the order of the sequence reads as well as the reads themselves can create
profound inaccuracy.
The main AI solution to the use of heuristics in assembly has been machine learning. Reads are
obtained from DNA sequencers and submitted to an ML-reliant error detection or correction process.
For example, base frequency analysis involves calculating a custom frequency weight for every
nucleotide considering the frequency of each base along all positions within all reads for the same group
analyzed. The bases at positions whose weighted frequencies are below a certain threshold are
considered potentially erroneous and further assessed as either normal variation or error. A post-
assembly step verifies the entire genome sequence. Additionally there is promise for improvements in
the entire process through the construction of cohesive assemblers reliant solely on machine learning
techniques.
Generative artificial intelligence promises the next major leap forward in genome assembly
strategies, particularly for cases of de novo sequencing, in that it can generate new data in response to
prompts. Enormous data collection and analysis along with generative AI can not only lead to automated
self-correction in all fields, but in the field of genetics promises self-correction, eventual complete
accuracy, and application towards variant calling and phenotypic expression studies. Generative AI can
also in specific instances be used to design custom DNA probes to target regions of interest in a genome
or even to allow researchers to virtually explore the effects of mutations. While machine learning
analyzes sequence patterns and identifies repetitive elements to construct a complete genome
sequence, generative AI can achieve similar accuracy with far less input due to its use of transformer
architecture – text input is converted to a numerical representation or token and assigned as a vector to
be compared, within the context window, with other unmasked tokens for filtering the signal from the
noise. This form of highly efficient computation relies not on pattern recognition but on actively seeking
patterns, and is of utmost use in de novo sequencing where access to data is limited.
These advances in artificial intelligence used towards genome assembly will shorten the time
needed for genomic sequencing, ameliorating data collection and analysis for scientific and medical
development. Generative AI in particular will allow sufficient data to be collected for self-correction in
all fields, including genetics through sequencing and expedient assembly, with eventual improvements
in medicine and biology in the form of variant calling and phenotypic expression studies. Advances in
artificial intelligence and efficiency in the stages of genetic sequencing go hand in hand due to the
enormous data they require, with tremendous forthcoming benefits to science and medicine.
Bibliography
“Ai in Genome Sequencing - Transforming Life Sciences.” Kodexo Labs, 6 May 2024, kodexolabs.com/ai-in-genome sequencing/#:~:text=Once%20the%20genome%20sequence%20is,might%20be%20affecting%20 biological%20processes.
Padovani de Souza, Kleber, et al. “Machine learning meets genome assembly.” Briefings in Bioinformatics, vol. 20, no. 6, 17 Aug. 2018, pp. 2116–2129, https://doi.org/10.1093/bib/bby072.
Vrček, Lovro, et al. Geometric Deep Learning Framework for de Novo Genome Assembly, 13 Mar. 2024, https://doi.org/10.1101/2024.03.11.584353. Waldron, Patricia, et al. “Processor Made for AI Speeds up Genome Assembly.” Cornell Chronicle, 1 Nov. 2023, news.cornell.edu/stories/2023/11/processor-made-ai-speeds-genome-assembly.