Understanding the Fundamentals of Whole-Genome Assembly

With the recent acquisition of two advanced long-read sequencing platforms, the ONT PromethION and PacBio Revio, at my workplace, many researchers are now eager to dive into whole-genome sequencing for their species of interest. To help guide these efforts, I’ve compiled insights from my readings and research surveys on the essentials of whole-genome assembly. Here are the key points:

Reads and Coverage
- Aim for at least 15x accurate hifi long-read coverage per haplotype. This ensures enough depth for high-confidence assembly.
- Adding approximately 5x ultra-long-read (ONT) coverage per haplotype can significantly improve assembly contiguity, as longer reads help span repetitive regions
- Adding 30x Hi-C long-range data aids scaffolding and phasing.
Assemblers Easy to setup and run, short runtime
- hifiasm, Easy to set up and run; has a short runtime.
- Verkko / LJA , Easy to set up and run, but errors were observed on the first attempt
- Canu, Simple to use and successfully provided a complete bacterial genome. However, for larger genomes like a mouse, runtime issues were a challenge.
Evaluation Metrics
- Assembly Size
- N50
- BUSCO
- K-mer Based Evaluation, Merqury, or KAT.
- Alignment-Based Evaluation, QUAST