Home
Overview¶
Sketchy
is a nanopore lineage calling and genotyping tool based on the heuristic principle of genomic neighbor typing (Břinda et al. 2020). Sketchy
queries species-wide (hypothesis-agnostic) reference sketches using MinHash methods to infer genotypes based on the closest reference match. Reference databases and genotypes, such has multi-locus sequence types, susceptibility profiles, or virulence factors, are configurable by users.
Install¶
Rust client.
$ cargo install sketchy
$ conda install -c bioconda sketchy
Release binaries for Linux and MacOS are available.
Subcommands¶
Show subcommands available in the Rust client.
$ sketchy --help
Predictions¶
Sketchy
predicts the genotype of the genome with the highest shared hashes in the reference sketch. Two modes are available:
- Offline (read sets): k-mers are hashed and shared hashes computed from a set of reads. Output are the
--top
genome matches and their genotypes. - Online (streaming): k-mers are hashed for each incoming read, the sum of shared hashes for each genome in the reference sketch is updated and the
--top
genome matches and their genotypes based on the sorted sum of shared hashes at the current read are printed.
Sum of shared hashes based on streaming are slightly less informative than shared hashes based on read sets. Streaming requires less memory, but can be slow for large read sets, especially when deploying large reference sketches.
Predictions require the input sequences (-i
), a reference sketch (-r
) and a matching genotype file (-g
) as described below (genotype files).
Read sets¶
Output the best 5 matches against the reference sketch with header.
$ sketchy predict -i seq.fq -r saureus.msh -g saureus.tsv -t 5 -H
Streaming¶
Output the updated best match against the reference sketch from a stream of reads.
$ cat seq.fq | sketchy predict -r saureus.msh -g saureus.tsv -s
Sketches¶
Species sketches¶
Species sketches are available in the data repository and can be built from local reference assembly collections.
Local sketches¶
Reference sketches can be built from any collection of assembled genomes for which associated genotype or phenotype data are available.
Build a default-resolution (s = 1000
) reference database from any collection of fasta
.
$ sketchy sketch -i *.fa -k 16 -s 1000 -o ref.msh
You can pipe assemblies using find
into the sketch construction, as wildcard expansions are limited to ~30,000 files.
$ find assemblies/ -name "*.fa" | sketchy sketch -k 16 -s 1000 -o ref.msh
List sketch parameters.
$ sketchy info -i ref.msh -p
Genotype files¶
Prediction requires a tab-delimited genotype index in the same order and of the same length as the reference sketch. Names in the genotype index (first column) are the file name of the input files in the sketch
name mlst tetracyline penicillin methicilin
ERR129347.fa ST82 R R S
ERR121347.fa ST93 S S S
List the order of genomes in the sketch, their length (bp) and an estimate of cardinality (bp)
$ sketchy info -i ref.msh
Check if the genotype file contains the correct number and order of genomes as the sketch.
$ sketchy check -r ref.msh -g ref.tsv
This will outpout ok
to stdout
if the check is completed successfully and fail with an error message otherwise.
Sketch validation¶
We conducted simulations using badread
of the following species:
- Neisseria spp.
- Streptococcus pneumoniae
- Klebsiella pneumoniae
- Staphylococcus aureus
- Pseudomonas aeruginosa
Cross-validation sketches (using subsampling of reference assemblies) and read data are available in the data repository.
Other¶
Shared hashes¶
Given two assembled genome sequences, create a sketch at default k-mer size of k = 16
and sketch size of s = 1000
. Mash
configuration can be replicated by setting --seed 42
.
Sketch two genome assemblies with identical settings.
$ sketchy sketch -i genome1.fa -o genome1.msh
$ sketchy sketch -i genome2.fa -o genome2.msh
Compute shared hashes between the reference and query genomes.
$ sketchy shared -r genome1.msh -q genome2.msh
> genome1.fa genome2.fa 360
If multiple sketches are available compute pairwise shared hashes.
$ sketchy sketch -i genome1.fa genome2.fa -o multi.msh
$ sketchy shared -r multi.msh -q genome2.msh
> genome1.fa genome2.fa 360
> genome2.fa genome2.fa 1000