0.2 Basecalling of ONT reads
For Illumina and PacBio, the sequences arrive already basecalled by the machines, i.e., fluorescence information is already translated into DNA. However, for ONT data this is not always the case, especially when sequencing yourself. In this case, we need to translate the electrical current information, or squiggles, that arrive in FAST5 files, into DNA/FASTQ. For newer flowcells and kits we can use either the older software guppy or the newer (and better) dorado. For older flowcells and kits we have to resort to guppy.
To run either we will have to know what flowcell we used, what kit we used, and what settings for the sequencer we used, in order to select the correct model for basecalling, e.g., dna_r9.5_450bps.cfg -> DNA sequences with the r9.5 flowcell.
Guppy:
guppy_basecaller -i PATH/TO/FAST5_FILES/ -s PATH/TO/OUTPUT/FASTQ/ -c dna_r9.5_450bps.cfg --device cuda:0 --compress_fastq -r
Using dorado is a bit more elaborate, we first have to put all the FAST5 files together and create a POD5 file from the grouped FAST5 files.
single_to_multi_fast5 -i PATH/TO/FAST5_FILES/ -s PATH/TO/CLUSTERED_FAST5/ -t 10 --recursive
pod5 convert fast5 PATH/TO/CLUSTERED_FAST5/ --output PATH/TO/POD5/ -t 10
Next, we can start up an analysis server for dorado, selecting the correct model, e.g., dna_r9.4.1_450bps_hac.cfg-> DNA, on a R9.4 flowcell:
dorado_basecall_server --log_path PATH/TO/dorado_server_logs/ --config dna_r9.4.1_450bps_hac.cfg -p auto --device cuda:0
This will give us a temporary URL for the port we created which we need for the final step, e.g., ipc:///tmp/704d-df70-c964-8689.
And finally, we can do the basecalling of the data:
# check port address
ont_basecall_client --config dna_r9.4.1_450bps_hac.cfg --input_path PATH/TO/POD5/ --save_path PATH/TO/DORADO_OUTPUR/ --port ipc:///tmp/704d-df70-c964-8689