Please cite this article if you use our databases:
Bigot T, Temmam S, Pérot P and Eloit M. RVDB-prot, a reference viral protein database and its HMM profiles [version 2; peer review: 2 approved]. F1000Research 2020, 8:530 (https://doi.org/10.12688/f1000research.18776.2)
https://f1000research.com/articles/8-530
If you have any questions or comments regarding these protein databases, please contact Marc Eloit (marc.eloit@pasteur.fr) or Thomas Bigot (thomas.bigot@pasteur.fr) respectively for virological and bioinformatic topics.
Reference Viral Databases (RVDB-prot and RVDB-prot-HMM) were developed by Thomas Bigot in Marc Eloit’s Pathogen Discovery group in collaboration with Center of Bioinformatics, Biostatistics and Integrative Biology (C3BI) at Institut Pasteur, for enhancing virus detection using next-generation sequencing (NGS) technologies. They are based on the reference Viral DataBase, courtesy of Arifa Khan’s group at CBER, FDA:
https://rvdb.dbi.udel.edu/.
They are updated after each new release of the nucleotidic database. The version number of the protein databases follows the one of the original nucleic database.
Please note protein version is based on the unclustered (prefix “U-”) version of RVDB.
For each version, several files are available:
-prot.fasta
;-prot.hmm
, the HMM profiles, generated with and for hmmer 3.2.1 (from 2019, 3.1b2 before);-prot-hmm-annot.sqlite
, SQLite db containing annotations (please find a documentation below);-annot.txt/
, a directory with plain text files (one per family)You can find above a SQLite file containing all annotations. Here is its schema and some example usage.
# list top 10 keywords of family ID 5:
sqlite> SELECT freq,str FROM fam_kw JOIN keyword ON fam_kw.kwId = keyword.id WHERE fam_kw.famID = 5 ORDER BY freq DESC LIMIT 10;
2782|virus
2782|Influenza
2772|neuraminidase
806|H1N1
477|2009
444|H3N2
279|swine
272|chicken
251|duck
246|India
# how many families are related to keyword Influenza:
sqlite> SELECT COUNT(freq) FROM fam_kw JOIN keyword ON fam_kw.kwId = keyword.id WHERE keyword.str="Influenza";
65
# how many sequences in all hmm profiles are related to keyword Influenza:
sqlite> SELECT SUM(freq) FROM fam_kw JOIN keyword ON fam_kw.kwId = keyword.id WHERE keyword.str="Influenza";
14140
The steps are performed with internal scripts, except for Golden, which is a tool to access locally a database entry with no delay.
[Reference: https://projets.pasteur.fr/projects/golden/wiki
]
The workflow is based on vFAM one (see Profile Hidden Markov Models for the Detection of Viruses within Metagenomic Sequence Data, Peter Skewes-Cox et al.). Many steps of this workfow are performed with original script, courtesy of Peter Skewes-Cox. Nevertheless, changes were made: