4. Genome Assembly QC
We can check our assemblies in a variety of ways, e.g., compare it to reference genomes, check the overall statistics, see if there is any contamination that needs to be removed, etc
Assembly statistics
We will start with some simple statistics such as N50, number of contigs, contig size, etc. For this we use a simple perl script developed by Dr. Heath O’Brien (https://github.com/hobrien)
ContigStats -i ASSEMBLY_FILE -f FILE_FORMAT > output.log
When we then open the output.log file with a text editor (e.g., nano, gedit, vim), we will see something similar to this:
N50: 126,880
Number contigs: 693
Max_contig: 571,286
Min_contig: 869
Total_length: 37,632,407
Average_contig: 54,304
Num_gaps: 80
Total_gap_length: 323
These are some basic statistics, but if we want to get more details or compare it to a reference genome of the same or closely related species, we will have to use the tool called QUAST - QUality ASsessment Tool:
quast [SEQUENCE FILE] -r [REFERENCE ASSEMBLY optional] -o [OUTPUT_NAME] -t XX --nanopore [ONT READS FILE] -1 [ILLUMINA 1 fastq.gz] -2 [ILLUMINA 2 fastq.gz]
This gives a HTML output that we can open in a browser with various statistics.
Completeness check (before polishing)
Statistics on size and number of fragments can give a good idea of the quality, but don’t tell everything. You might have beautiful statistics, but if the DNA of the fragments is of low quality, it is still a bad assembly! So how do we check this? But comparing the contents to fixed databases.
For this we use BUSCO, a tool that looks in a database of single copy genes for a particular group of organisms, and checks if it can find those genes in your assembly and if they are presented one time or are duplicated or fragmented
busco -i [ASSEMBLY FILE] -l [DATASET NAME] -o [OUTPUT_NAME] -m geno -c 10
List of fungal datasets available at the moment:
fungi_odb10
ascomycota_odb10
dothideomycetes_odb10
capnodiales_odb10
pleosporales_odb10
eurotiomycetes_odb10
chaetothyriales_odb10
eurotiales_odb10
onygenales_odb10
leotiomycetes_odb10
helotiales_odb10
saccharomycetes_odb10
sordariomycetes_odb10
glomerellales_odb10
hypocreales_odb10
basidiomycota_odb10
agaricomycetes_odb10
agaricales_odb10
boletales_odb10
polyporales_odb10
tremellomycetes_odb10
microsporidia_odb10
mucoromycota_odb10
mucorales_odb10
Example of an output:
C:95.3% (S:94.7%, D:0.6%), F:0.7%, M:4.0%, n:3870, E:5.7%
3689 Complete BUSCOs (C) (of which 209 contain internal stop codons)
3666 Complete and single-copy BUSCOs (S)
23 Complete and duplicated BUSCOs (D)
28 Fragmented BUSCOs (F)
153 Missing BUSCOs (M)
3870 Total BUSCO groups searched
Contamination and overall statistics check
Even if we try to work as clean as possible we sometimes still pick up contaminations that end up in our sequence results. Luckily there are tools that can help identify contaminated reads and select to remove them. Blobtools is one such tool: it uses BLAST, Diamond, minimap2, and BUSCO to give overall statistics and matches to the closest organisms for each of the reads. The software is quite extensive, so we might skip this but I will leave the codes here for you to use (if there is time or in the future).
Also, I have never run this on a server and it seems they have some instructions for this: https://blobtoolkit.genomehubs.org/pipeline/pipeline-tutorials/running-the-pipeline-on-a-cluster/
The full package of blobtools, know as blobtoolkit, exists of several steps that can be performed in whichever order you prefer with the exception of setting up the results folder and creating the base file for the analyses:
Create meta data yaml file
First we will need to create a file called meta.yamlwhich contains some basic information about your sample, e.g.;
record_type: scaffold
taxon:
name: Leucoagaricus gongylophorus
taxid: 79220
phylum: Basidiomycota
As previously, you can use your favourite text editor for this (e.g., nano, gedit, vim, etc)
Create a BlobDir
Next we need to create a results folder called the BlobDir and add the information from the meta.yamlfile as well as some information regarding the taxonomy of the sample. For this you will need to know the NCBI taxid (https://www.ncbi.nlm.nih.gov/taxonomy):
blobtools create --fasta PATH/TO/ASSEMBLY_FILE --meta meta.yaml --taxid XXXX --taxdump PATH/TO/taxdump PATH/TO/OUTPUT/DIRECTORY
And we will merge all information together into a single file:
cat PATH/TO/BlobDir/meta.json
Use BLAST and UNIPROT to identify contaminations
This part works best when you have a local copy of the BLAST and UNIPROT databases, however, this takes up almost a 1TB…
BLAST
BLAST is a fantastic tool but it isn’t very efficient in using its resources, i.e., blasting my assemblies generally takes about 3 days with 10 threads:
blastn -db nt -query PATH/TO/ASSEMBLY_FILE -outfmt "6 qseqid staxids bitscore std" -max_target_seqs 10 -max_hsps 1 -evalue 1e-25 -num_threads 10 -out PATH/TO/BlobDir/blast.out
UNIPROT
UNIPROT is faster than BLAST but looks at proteins only, and, therefore, tends to have less hits:
diamond blastx --query PATH/TO/ASSEMBLY_FILE --db PATH/TO/uniprot/reference_proteomes.dmnd --outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore --sensitive --max-target-seqs 1 --evalue 1e-25 --threads 10 > PATH/TO/BlobDir/diamond.out
Adding BLAST and UNIPROT results to BlobDir
Now we have generated the BLAST and UNIPROT results we need to add them to the overall BlobDir results:
blobtools add --hits PATH/TO/BlobDir/blast.out --hits PATH/TO/BlobDir/diamond.out --taxrule bestsumorder --taxdump PATH/TO/taxdump PATH/TO/BlobDir
Adding mapping coverage
Next we will mapping coverage information for our assembly by mapping our reads back on to the assembly. This will show use how much coverage we will have for each contig.
Using minimap2 to map reads
# SHORT READS
minimap2 -ax sr -t 10 PATH/TO/ASSEMBLY_FILE PATH/TO/READS/illumina_1.fastq.gz PATH/TO/READS/illumina_2.fastq.gz | samtools sort -@10 -O BAM -o PATH/TO/BlobDir/NAME.shortreads.bam -
# LONG READS -> in case of PacBio use -ax map-pb
minimap2 -ax map-ont -t 10 PATH/TO/ASSEMBLY_FILE PATH/TO/READS/ont.fastq | samtools sort -@10 -O BAM -o PATH/TO/BlobDir/NAME.longreads.bam -
Adding mapping coverage of short and/or long reads to BlobDir
And again we will add the results to the BlobDir:
# SHORT READS
blobtools add --cov PATH/TO/BlobDir/NAME.shortreads.bam PATH/TO/BlobDir
# LONG READS
blobtools add --cov PATH/TO/BlobDir/NAME.longreads.bam PATH/TO/BlobDir
Add BUSCO scores
Run BUSCO if necessary
If BUSCO wasn’t done yet, now is the time to do that, otherwise you can skip this step and immediately add your BUSCO results to the BlobDir:
busco -i PATH/TO/ASSEMBLY_FILE -o PATH/TO/OUTPUT -l YOUR_LINEAGE -m geno -c XX
Add to BUSCO results to BlobDir
blobtools add --busco PATH/TO/OUTPUT/run_YOUR_LINEAGE/full_table.tsv PATH/TO/BlobDir
Add links to BlobDir
The next step will just add more background information to the results output and is optional:
blobtools add --link taxon.taxid.ENA="https://www.ebi.ac.uk/ena/ EBI ENA LINK OF YOUR SPECIES" --link taxon.name.Wikipedia="WIKIPEDIA LINK OF YOUR SPECIES" PATH/TO/BlobDir
Open dataset in BlobToolKit Viewer
Now we have collected all information and it is time to visualize them! This will create a local URL you can open in a browser to look at Blobtools beautiful graphs:
blobtools view --remote PATH/TO/BlobDir
ITS extraction to identify with BLAST
Another quick way to see if you sequenced the correct thing is to blast the rDNA regions commonly used for fungal barcoding, i.e., ITS (internal transcribed spacers 1 & 2 and 5.8S), LSU (large subunit rDNA), and SSU (small subunit rDNA). To make this more easy I adapted an existing python script (ITS only: https://github.com/pwkooij/Extract-ITS-sequences-from-a-fungal-genome/tree/master) and made it possible to extract all rDNA regions: https://github.com/pwkooij/Extract-ITS-sequences-from-a-fungal-genome/blob/patch-1/README.md
python extractITS.py -which ITS2 -i genome.fasta -o ./output/ -name mySpecies
This python script relies on ITSx (Bengtsson-Palme et al., 2013) and Barrnap (https://github.com/tseemann/barrnap), as well as biopython and pandas, which means that these need to be installed (in the same environment) in order to make it work.