RVDB database, protein version

Questions / Comments

If you have any questions or comments regarding these protein databases, please contact Marc Eloit (marc.eloit@pasteur.fr) or Thomas Bigot (thomas.bigot@pasteur.fr) respectively for virological and bioinformatic topics.

Description

Reference Viral Databases (RVDB-prot and RVDB-prot-HMM) were developed by Thomas Bigot in Marc Eloit’s Pathogen Discovery group in collaboration with Center of Bioinformatics, Biostatistics and Integrative Biology (C3BI) at Institut Pasteur, for enhancing virus detection using next-generation sequencing (NGS) technologies. They are based on the reference Viral DataBase, courtesy of Arifa Khan’s group at CBER, FDA:
https://hive.biochemistry.gwu.edu/rvdb/.
They are updated after each new release of the nucleotidic database. The version number of the protein databases follows the one of the original nucleic database.

Please note protein version is based on the unclustered (prefix “U-”) version of RVDB.

Download

Description

For each version, several files are available:

Files

VersionDateProteic sequencesHMM profilesSQLite AnnotationsPlain Text Annotations
16.02019-06 -prot.fasta.bz2 -prot.hmm.bz2 -prot-hmm.sqlite.bz2 -prot-hmm-txt.zip
15.12019-02 -prot.fasta.bz2 -prot.hmm.bz2 -prot-hmm.sqlite.bz2 -prot-hmm-txt.zip
14.02018-09 -prot.fasta.bz2 -prot.hmm.bz2 -prot-hmm.sqlite.bz2 -prot-hmm-txt.zip
13.02018-06 -prot.fasta.bz2 -prot-hmm.zip -prot-hmm-annot-sqlite.zip -prot-hmm-annot-txt.zip
12.22018-03 -prot.zip -prot-hmm.zip -prot-hmm-annot-sqlite.zip -prot-hmm-annot-txt.zip
11.52017-10 -prot.zip -prot-hmm.zip -prot-hmm-annot-sqlite.zip -prot-hmm-annot-txt.zip
10.22017-04 -prot.zip -prot-hmm.zip -prot-hmm-annot-sqlite.zip -prot-hmm-annot-txt.zip

Annotations

You can find above a SQLite file containing all annotations. Here is its schema and some example usage.


# list top 10 keywords of family ID 5:
sqlite> SELECT freq,str FROM fam_kw JOIN keyword ON fam_kw.kwId = keyword.id  WHERE fam_kw.famID = 5 ORDER BY freq DESC LIMIT 10;
2782|virus
2782|Influenza
2772|neuraminidase
806|H1N1
477|2009
444|H3N2
279|swine
272|chicken
251|duck
246|India

# how many families are related to keyword Influenza:
sqlite> SELECT COUNT(freq) FROM fam_kw JOIN keyword ON fam_kw.kwId = keyword.id WHERE keyword.str="Influenza";
65

# how many sequences in all hmm profiles are related to keyword Influenza:
sqlite> SELECT SUM(freq) FROM fam_kw JOIN keyword ON fam_kw.kwId = keyword.id WHERE keyword.str="Influenza";
14140


Method

Proteic flat file

The steps are performed with internal scripts, except for Golden, which is a tool to access locally a database entry with no delay.
[Reference: https://projets.pasteur.fr/projects/golden/wiki]

HMM Profiles

Workflow

The workflow is based on vFAM one (see Profile Hidden Markov Models for the Detection of Viruses within Metagenomic Sequence Data, Peter Skewes-Cox et al.). Many steps of this workfow are performed with original script, courtesy of Peter Skewes-Cox. Nevertheless, changes were made:

  • due to a significantly bigger amount of data to proceed (more than 800k sequence instead of about 50k for vFAM), original script was splitted and some steps received more efficient parallelization (computation cluster usage);
  • sequences were clustered with silix;
  • no minimum coverage so far;
  • no polyprotein removed.

© Institut Pasteur 2018