4. Genome Assembly QC

We can check our assemblies in a variety of ways, e.g., compare it to reference genomes, check the overall statistics, see if there is any contamination that needs to be removed, etc

Assembly statistics

We will start with some simple statistics such as N50, number of contigs, contig size, etc. For this we use a simple perl script developed by Dr. Heath O’Brien (https://github.com/hobrien)

ContigStats -i ASSEMBLY_FILE -f FILE_FORMAT > output.log

When we then open the output.log file with a text editor (e.g., nano, gedit, vim), we will see something similar to this:

N50: 126,880  
Number contigs: 693  
Max_contig: 571,286  
Min_contig: 869  
Total_length: 37,632,407  
Average_contig: 54,304  
Num_gaps: 80  
Total_gap_length: 323

These are some basic statistics, but if we want to get more details or compare it to a reference genome of the same or closely related species, we will have to use the tool called QUAST - QUality ASsessment Tool:

quast [SEQUENCE FILE] -r [REFERENCE ASSEMBLY optional] -o [OUTPUT_NAME] -t XX --nanopore [ONT READS FILE] -1 [ILLUMINA 1 fastq.gz] -2 [ILLUMINA 2 fastq.gz]

This gives a HTML output that we can open in a browser with various statistics.

Completeness check (before polishing)

Statistics on size and number of fragments can give a good idea of the quality, but don’t tell everything. You might have beautiful statistics, but if the DNA of the fragments is of low quality, it is still a bad assembly! So how do we check this? But comparing the contents to fixed databases.

For this we use BUSCO, a tool that looks in a database of single copy genes for a particular group of organisms, and checks if it can find those genes in your assembly and if they are presented one time or are duplicated or fragmented

busco -i [ASSEMBLY FILE] -l [DATASET NAME] -o [OUTPUT_NAME] -m geno -c 10

List of fungal datasets available at the moment:

  • fungi_odb10

  • ascomycota_odb10

  • dothideomycetes_odb10

  • capnodiales_odb10

  • pleosporales_odb10

  • eurotiomycetes_odb10

  • chaetothyriales_odb10

  • eurotiales_odb10

  • onygenales_odb10

  • leotiomycetes_odb10

  • helotiales_odb10

  • saccharomycetes_odb10

  • sordariomycetes_odb10

  • glomerellales_odb10

  • hypocreales_odb10

  • basidiomycota_odb10

  • agaricomycetes_odb10

  • agaricales_odb10

  • boletales_odb10

  • polyporales_odb10

  • tremellomycetes_odb10

  • microsporidia_odb10

  • mucoromycota_odb10

  • mucorales_odb10

Example of an output:

C:95.3% (S:94.7%, D:0.6%), F:0.7%, M:4.0%, n:3870, E:5.7%       
3689    Complete BUSCOs (C)     (of which 209 contain internal stop codons)
3666    Complete and single-copy BUSCOs (S)       
23      Complete and duplicated BUSCOs (D)        
28      Fragmented BUSCOs (F)                     
153     Missing BUSCOs (M)                        
3870    Total BUSCO groups searched

Contamination and overall statistics check

Even if we try to work as clean as possible we sometimes still pick up contaminations that end up in our sequence results. Luckily there are tools that can help identify contaminated reads and select to remove them. Blobtools is one such tool: it uses BLAST, Diamond, minimap2, and BUSCO to give overall statistics and matches to the closest organisms for each of the reads. The software is quite extensive, so we might skip this but I will leave the codes here for you to use (if there is time or in the future).

Also, I have never run this on a server and it seems they have some instructions for this: https://blobtoolkit.genomehubs.org/pipeline/pipeline-tutorials/running-the-pipeline-on-a-cluster/

The full package of blobtools, know as blobtoolkit, exists of several steps that can be performed in whichever order you prefer with the exception of setting up the results folder and creating the base file for the analyses:

Create meta data yaml file

First we will need to create a file called meta.yamlwhich contains some basic information about your sample, e.g.;

record_type: scaffold 
taxon: 
	name: Leucoagaricus gongylophorus 
	taxid: 79220 
	phylum: Basidiomycota

As previously, you can use your favourite text editor for this (e.g., nano, gedit, vim, etc)

Create a BlobDir

Next we need to create a results folder called the BlobDir and add the information from the meta.yamlfile as well as some information regarding the taxonomy of the sample. For this you will need to know the NCBI taxid (https://www.ncbi.nlm.nih.gov/taxonomy):

blobtools create --fasta PATH/TO/ASSEMBLY_FILE --meta meta.yaml --taxid XXXX --taxdump PATH/TO/taxdump PATH/TO/OUTPUT/DIRECTORY

And we will merge all information together into a single file:

cat PATH/TO/BlobDir/meta.json

Use BLAST and UNIPROT to identify contaminations

This part works best when you have a local copy of the BLAST and UNIPROT databases, however, this takes up almost a 1TB…

BLAST

BLAST is a fantastic tool but it isn’t very efficient in using its resources, i.e., blasting my assemblies generally takes about 3 days with 10 threads:

blastn -db nt -query PATH/TO/ASSEMBLY_FILE -outfmt "6 qseqid staxids bitscore std" -max_target_seqs 10 -max_hsps 1 -evalue 1e-25 -num_threads 10 -out PATH/TO/BlobDir/blast.out

UNIPROT

UNIPROT is faster than BLAST but looks at proteins only, and, therefore, tends to have less hits:

diamond blastx --query PATH/TO/ASSEMBLY_FILE --db PATH/TO/uniprot/reference_proteomes.dmnd --outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore --sensitive --max-target-seqs 1 --evalue 1e-25 --threads 10 > PATH/TO/BlobDir/diamond.out

Adding BLAST and UNIPROT results to BlobDir

Now we have generated the BLAST and UNIPROT results we need to add them to the overall BlobDir results:

blobtools add --hits PATH/TO/BlobDir/blast.out --hits PATH/TO/BlobDir/diamond.out --taxrule bestsumorder --taxdump PATH/TO/taxdump PATH/TO/BlobDir

Adding mapping coverage

Next we will mapping coverage information for our assembly by mapping our reads back on to the assembly. This will show use how much coverage we will have for each contig.

Using minimap2 to map reads

# SHORT READS
minimap2 -ax sr -t 10 PATH/TO/ASSEMBLY_FILE PATH/TO/READS/illumina_1.fastq.gz PATH/TO/READS/illumina_2.fastq.gz | samtools sort -@10 -O BAM -o PATH/TO/BlobDir/NAME.shortreads.bam -

# LONG READS -> in case of PacBio use -ax map-pb
minimap2 -ax map-ont -t 10 PATH/TO/ASSEMBLY_FILE PATH/TO/READS/ont.fastq | samtools sort -@10 -O BAM -o PATH/TO/BlobDir/NAME.longreads.bam -

Adding mapping coverage of short and/or long reads to BlobDir

And again we will add the results to the BlobDir:

# SHORT READS
blobtools add --cov PATH/TO/BlobDir/NAME.shortreads.bam PATH/TO/BlobDir

# LONG READS
blobtools add --cov PATH/TO/BlobDir/NAME.longreads.bam PATH/TO/BlobDir

Add BUSCO scores

Run BUSCO if necessary

If BUSCO wasn’t done yet, now is the time to do that, otherwise you can skip this step and immediately add your BUSCO results to the BlobDir:

busco -i PATH/TO/ASSEMBLY_FILE -o PATH/TO/OUTPUT -l YOUR_LINEAGE -m geno -c XX

Add to BUSCO results to BlobDir

blobtools add --busco PATH/TO/OUTPUT/run_YOUR_LINEAGE/full_table.tsv PATH/TO/BlobDir

Open dataset in BlobToolKit Viewer

Now we have collected all information and it is time to visualize them! This will create a local URL you can open in a browser to look at Blobtools beautiful graphs:

blobtools view --remote PATH/TO/BlobDir

ITS extraction to identify with BLAST

Another quick way to see if you sequenced the correct thing is to blast the rDNA regions commonly used for fungal barcoding, i.e., ITS (internal transcribed spacers 1 & 2 and 5.8S), LSU (large subunit rDNA), and SSU (small subunit rDNA). To make this more easy I adapted an existing python script (ITS only: https://github.com/pwkooij/Extract-ITS-sequences-from-a-fungal-genome/tree/master) and made it possible to extract all rDNA regions: https://github.com/pwkooij/Extract-ITS-sequences-from-a-fungal-genome/blob/patch-1/README.md

python extractITS.py -which ITS2 -i genome.fasta -o ./output/ -name mySpecies

This python script relies on ITSx (Bengtsson-Palme et al., 2013) and Barrnap (https://github.com/tseemann/barrnap), as well as biopython and pandas, which means that these need to be installed (in the same environment) in order to make it work.