Artificial Intelligence in Genome Assembly

Written by: Mark Titleman

Genomic sequencing has long been an expensive, time-consuming and labor-intensive endeavor

that has stalled related developments in science and medicine. It is a fact that major advances have

been made so that a single whole genome sequencing costs around $1000. This is only for certain

instances of personalized medicine, however; the pursuit of theoretical development in genetics

requires a vast amount of data to be collected and analyzed – a feat for which artificial intelligence is

thankfully well suited. The benefits of artificial intelligence to science, medicine and computing itself are

not only aspects of market and technological self-correction but – most interestingly – can in the most

ideal sense make all self-correction automated through generative AI. The connection between the

genomic sequencing process and developments in genetics in some way parallels this since the amount

of collected and analyzed data sufficient for self-correction in genetics is truly enormous. The process of

genomic sequencing experiences a substantial bottleneck at the assembly phase, when DNA fragments

comprising different sequences with myriad variants are aligned to a reference genome for

reconstruction. Less time-consuming genome assembly would thus affect AI-based sequencing most

positively, lowering overall labor and expenses, contributing to theoretical developments in genetics,

and perhaps providing sufficient data for self-correction in the form of improved accuracy in variant

calling and phenotypic expression studies.


While artificial intelligence is used for identifying genetic variants, predicting gene function and

interaction, and genome-based medical classification, in the pursuit of increased efficiency it serves a

vital role at every stage of the genomic sequencing underpinning these advances in analysis. AI can be

trained to correct errors in the sequencing process, scan enormous datasets in search of specific

variations, and for genome assembly via computation-intensive pattern recognition – the stage of

sequencing most susceptible to improvement via artificial intelligence. Genome assembly entails

aligning many thousands of DNA sequences for reconstruction of the genome, using algorithms to

identify and link overlapping pairs of sequences. New types of processors used in artificial intelligence

called intelligence or neural processing units (NPUs) have already been developed for the express

purpose of accelerating AI by incorporating machine learning models within their architecture. These

types of processors are capable of handling the enormous data and irregular computation associated

with both perfect alignment and unmatching sequences, expediting the assembly and sequencing

processes for potentially attaining self-correction in genetics, while new types of artificial intelligence

offer similar promise.

Genome assembly is needed in sequencing due to the short DNA read lengths generated by the

sequencing platforms, the amount of data generated, error rectification during sequencing, and

importantly to identify and localize repetitive sequences. Single reads forming ‘contigs’ via overlapping

regions and paired reads from both ends of the DNA molecule can provide sequencing data for

assembly. A consensus sequence is assembled heuristically by algorithm via comparison to a reference

genome or in some cases de novo. This process is error prone, particularly in cases of de novo

sequencing where errors in the order of the sequence reads as well as the reads themselves can create

profound inaccuracy.

Souza et al., 2019

The main AI solution to the use of heuristics in assembly has been machine learning. Reads are

obtained from DNA sequencers and submitted to an ML-reliant error detection or correction process.

For example, base frequency analysis involves calculating a custom frequency weight for every

nucleotide considering the frequency of each base along all positions within all reads for the same group

analyzed. The bases at positions whose weighted frequencies are below a certain threshold are

considered potentially erroneous and further assessed as either normal variation or error. A post-

assembly step verifies the entire genome sequence. Additionally there is promise for improvements in

the entire process through the construction of cohesive assemblers reliant solely on machine learning

techniques.

Generative artificial intelligence promises the next major leap forward in genome assembly

strategies, particularly for cases of de novo sequencing, in that it can generate new data in response to

prompts. Enormous data collection and analysis along with generative AI can not only lead to automated

self-correction in all fields, but in the field of genetics promises self-correction, eventual complete

accuracy, and application towards variant calling and phenotypic expression studies. Generative AI can

also in specific instances be used to design custom DNA probes to target regions of interest in a genome

or even to allow researchers to virtually explore the effects of mutations. While machine learning

analyzes sequence patterns and identifies repetitive elements to construct a complete genome

sequence, generative AI can achieve similar accuracy with far less input due to its use of transformer

architecture – text input is converted to a numerical representation or token and assigned as a vector to

be compared, within the context window, with other unmasked tokens for filtering the signal from the

noise. This form of highly efficient computation relies not on pattern recognition but on actively seeking

patterns, and is of utmost use in de novo sequencing where access to data is limited.

These advances in artificial intelligence used towards genome assembly will shorten the time

needed for genomic sequencing, ameliorating data collection and analysis for scientific and medical

development. Generative AI in particular will allow sufficient data to be collected for self-correction in

all fields, including genetics through sequencing and expedient assembly, with eventual improvements

in medicine and biology in the form of variant calling and phenotypic expression studies. Advances in

artificial intelligence and efficiency in the stages of genetic sequencing go hand in hand due to the

enormous data they require, with tremendous forthcoming benefits to science and medicine.

Bibliography

  1. “Ai in Genome Sequencing - Transforming Life Sciences.” Kodexo Labs, 6 May 2024, kodexolabs.com/ai-in-genome sequencing/#:~:text=Once%20the%20genome%20sequence%20is,might%20be%20affecting%20 biological%20processes.

  2. Padovani de Souza, Kleber, et al. “Machine learning meets genome assembly.” Briefings in Bioinformatics, vol. 20, no. 6, 17 Aug. 2018, pp. 2116–2129, https://doi.org/10.1093/bib/bby072.

  3. Vrček, Lovro, et al. Geometric Deep Learning Framework for de Novo Genome Assembly, 13 Mar. 2024, https://doi.org/10.1101/2024.03.11.584353. Waldron, Patricia, et al. “Processor Made for AI Speeds up Genome Assembly.” Cornell Chronicle, 1 Nov. 2023, news.cornell.edu/stories/2023/11/processor-made-ai-speeds-genome-assembly.

Previous
Previous

AlphaFold: Artificial Intelligence in Protein Structure Prediction

Next
Next

ARTIFICIAL INTELLIGENCE IN GENETICS AND GENE THERAPY FOR PREVENTION, IDENTIFICATION,MEDICATION AND ANNIHILATION OF CANCER