5. Polishing the assembly

One way of improving the genome assembly is by comparing the obtained contigs to the reads used to create them, which we call polishing. Once again, depending on the sequencing technique, we will have to use different software to do this.

Polishing Illumina and/or hybrid data

For this we can use nextPolish (https://nextpolish.readthedocs.io/en/latest/index.html). To use nextPolish we will have to create a list of the read files:

ls PATH/TO/reads1_R1.fq PATH/TO/reads1_R2.fq > PATH/TO/WORKDIR/sgs.fofn
ls PATH/TO/ont_reads.fq > PATH/TO/WORKDIR/lgs.fofn

We will also need to create a config file which will tell the program where to find the files, where to put the output, and what options to use:

nano PATH/TO/WORKDIR/run.cfg

This will open a new file with the name run.cfg. There we will fill out the following information:

[General]
job_type = local
job_prefix = nextPolish
task = best
rewrite = yes
rerun = 3
parallel_jobs = 6
multithread_jobs = 5
genome = PATH/TO/ASSEMBLY_FILE
genome_size = auto
workdir = PATH/TO/WORKDIR/
polish_options = -p {multithread_jobs}

[sgs_option] # Short-read settings

sgs_fofn = PATH/TO/sgs.fofn
sgs_options = -max_depth 100 -bwa

[lgs_option] # Long-read settings
lgs_fofn = PATH/TO/lgs.fofn
lgs_options = -min_read_len 1k -max_depth 100
lgs_minimap2_options = -x map-ont # in case of PacBio use map-pb

Now we have set up the config file and the lists with reads and it is time to start the analysis!

nextPolish run.cfg

Polishing Oxford Nanopore data

Because the error rate in ONT reads is relatively high (compared to other sequencing methods) polishing the assembly needs a bit more work. What is generally recommended is perform Racon 4x and then Medaka. In some cases it is possible to run the software Homopolish after that as well, but this software relies heavily on already available genomes from closely related species (i.e., in my case, working with Leucocoprineae this didn’t work).

Polishing four rounds with Racon

To run Racon, we will prepare our data first with minimap2 followed by polishing the assembly with Racon:

minimap2 -a -t XX [ASSEMBLY FILE] [READS FILE] > OUTPUT.sam
racon -m 8 -x -6 -g -8 -w 500 -t XX [READS FILE] OUTPUT.sam [ASSEMBLY FILE] > OUTPUT.fasta

For Racon settings we have to look at the help output:

racon {options} [sequences]  [overlaps]  [target sequences] # default output is stdout
  [sequences] # input file in FASTA/FASTQ format (can be compressed with gzip) containing sequences used for correction
   [overlaps] # input file in MHAP/PAF/SAM format (can be compressed with gzip) containing overlaps between sequences and target sequences
   [target sequences] # input file in FASTA/FASTQ format (can be compressed with gzip) containing sequences which will be corrected

options:
   -u, --include-unpolished # output unpolished target sequences
   -f, --fragment-correction # perform fragment correction instead of contig polishing (overlaps file should contain dual/self overlaps!)
   -w, --window-length [int] # default: 500, size of window on which POA is performed
   -q, --quality-threshold [float] # default: 10.0, threshold for average base quality of windows used in POA
   -e, --error-threshold [float] # default: 0.3, maximum allowed error rate used for filtering overlaps
   --no-trimming # disables consensus trimming at window ends
   -m, --match [int] # default: 3, score for matching bases
   -x, --mismatch [int] # default: -5, score for mismatching bases
   -g, --gap [int] # default: -4, gap penalty (must be negative)
   -t, --threads [int] # default: 1, number of threads
   --version # prints the version number
   -h, --help # prints the usage

only available when built with CUDA:
   -c, --cudapoa-batches [int] # default: 0, number of batches for CUDA accelerated polishing per GPU
   -b, --cuda-banded-alignment # use banding approximation for polishing on GPU. Only applicable when -c is used.
   --cudaaligner-batches [int] # default: 0, number of batches for CUDA accelerated alignment per GPU
   --cudaaligner-band-width [int] # default: 0, Band width for cuda alignment. Must be >= 0. Non-zero allows user defined band width, whereas 0 implies auto band width determination.

NOTE: Sometimes the assembly file has a different file extension that minimap2is not able to recognize, e.g., in the case of SMARTdenovo, where the assembly file ends in .cns. In that case we will have to add .fasta as the file extension.
cp PATH/TO/ASSEMBLY.zmo.cns PATH/TO/WORKDIR/
mv PATH/TO/WORKDIR/ASSEMBLY.zmo.cns PATH/TO/WORKDIR/ASSEMBLY.zmo.cns.fasta

Because it is recommended to perform four rounds of Racon polishing we will have to redo the analysis 3x each with the newly produced assembly produced in the previous round:

minimap2 -a -t XX [RACON1 ASSEMBLY FILE] [READS FILE] > OUTPUT.sam
racon -m 8 -x -6 -g -8 -w 500 -t XX [READS FILE] OUTPUT.sam [RACON1 ASSEMBLY FILE] > OUTPUT.fasta

Polishing with Medaka

To run Medaka, it is important to know what Medaka model to use. To figure this out we have to look at what we used to obtain the data.

The model are named with the following scheme:

{pore}_{device}_{caller variant}_{caller version}

For example, if the pore/flowcell is r941, the device is MinION (min), you used the high-accuracy model (high), and the guppy version was 4.15. Choose the model, that is closest to that basecaller version.

With de medaka_consenus help we find:

-m  medaka model, (default: r1041_e82_400bps_sup_v5.0.0).
        Choices: r103_fast_g507 r103_hac_g507 r103_sup_g507 r1041_e82_260bps_fast_g632 r1041_e82_260bps_hac_g632 r1041_e82_260bps_hac_v4.0.0 r1041_e82_260bps_hac_v4.1.0 r1041_e82_260bps_joint_apk_ulk_v5.0.0 r1041_e82_260bps_sup_g632 r1041_e82_260bps_sup_v4.0.0 r1041_e82_260bps_sup_v4.1.0 r1041_e82_400bps_bacterial_methylation r1041_e82_400bps_fast_g615 r1041_e82_400bps_fast_g632 r1041_e82_400bps_hac_g615 r1041_e82_400bps_hac_g632 r1041_e82_400bps_hac_v4.0.0 r1041_e82_400bps_hac_v4.1.0 r1041_e82_400bps_hac_v4.2.0 r1041_e82_400bps_hac_v4.3.0 r1041_e82_400bps_hac_v5.0.0 r1041_e82_400bps_sup_g615 r1041_e82_400bps_sup_v4.0.0 r1041_e82_400bps_sup_v4.1.0 r1041_e82_400bps_sup_v4.2.0 r1041_e82_400bps_sup_v4.3.0 r1041_e82_400bps_sup_v5.0.0 r104_e81_fast_g5015 r104_e81_hac_g5015 r104_e81_sup_g5015 r104_e81_sup_g610 r941_e81_fast_g514 r941_e81_hac_g514 r941_e81_sup_g514 r941_min_fast_g507 r941_min_hac_g507 r941_min_sup_g507 r941_prom_fast_g507 r941_prom_hac_g507 r941_prom_sup_g507 r941_sup_plant_g610

So in this case we need to choose r941_min_hac_g507.

Now we are ready to run Medaka:

medaka_consensus -i PATH/TO/ONT.fastq.gz -d PATH/TO/RACON4_ASSEMBLY.fasta -o PATH/TO/OUTPUT -t XX -m MODEL

Polishing with Homopolish

INFO TO COME