Genome assembly
Reconstructing a genome from many short fragments from that genome is a complex process that requires a lot of computation. In long read assembly all reads are compared to each other to find reads that overlap. A layout is then constructed from all the overlaps that are found and all overlapping reads are condensed into a consensus sequence. Since each read has to be compared to all other reads a lot of computing needs to be done and this will get much more as the genome size increases. A ten times larger genome requires hundred times more calculations! Some genomes are tens of gigabases long and assembly jobs need to be run for weeks on large compute clusters.
The Uncorrected Long-read Integration Process
In collaboration with Leiden University we came up with a solution that scales linearly with the genome size. Here we only compare reads with a limited number of unique sequences from the genome. This has the advantage that far less comparisons have to be made and the assembly process becomes a simpler as a result. We have used this software to assemble the ~1.5 Gbp King cobra genome in a few hours on a laptop computer.
A assembly pre-processor or an assembler?
TULIP can be used both as an assembler but also as pre-processor for other assembly tools. It builds a layout from all overlaps between reads and unique sequences from that genome. It can then use that layout to stitch reads together into contigs. It doesn’t create a consensus from those reads so the contigs will have the same accuracy as the reads. It will also bundle all the reads that belong to a contig. These read bundles can then be used as input for other assembly tools. In this way you can split up your large genome assembly in many smaller assemblies.
Tulip is free for academic use. Commercial users are required to obtain a license.