As a result of little standardization, large amount of new scientific findings generated almost every day and an explosion of sequencing data for various purposes, the landscape of human genomics is quite fragmented, siloed, and inconsistent. We all know how frustrating the process of assessing a comprehensive information for genomic variant can be. The way forward is data integration, harmonization and cross-referencing.
However, integration of large data sets (Big Data), especially in the field of genomics, is a challenging endeavor which can be successfully tackled only by a multidisciplinary team, bringing together strong skills in both life sciences and software engineering. The data integration process can be seen as constructing a skyscraper. One can't build a skyscraper by stacking up small houses on top of each other. Building a skyscraper requires a whole new approach, and so does the integration and harmonization of genomics Big Data. In our approach, VarSome is the skyscraper, while MolecularDB is the architectural plan for it.
MolecularDB is VarSome’s integration and harmonization engine for genomics Big Data. It’s a purpose-built data storage system specifically designed by our engineers since the first line of its code to meet the high demands of clinical-grade genomics applications, such as annotation of whole genomes, exomes, and gene panels.
Currently, VarSome provides access through MolecularDB to over 35 public genomics-related data sets, which represents over 33 billion data points, plus contributions from a 200'000-strong global community. But there is more to it: whenever a public database is updated, MolecularDB quickly processes it and makes it available on VarSome in very short turnaround times. Apart from public data resources, MolecularDB can facilitate access to proprietary (such as your own private variant database) as well as licensed databases (such as HGMD) and cross-reference their content with data sets already available on VarSome (either in public or private manner).
Data quality is of paramount importance: MolecularDB ensures genomics data are meticulously integrated and cross-referenced, and insertions and deletions are matched consistently across all the data resource available on VarSome. MolecularDB also runs daily comprehensive data integrity checks.
35+ Integrated Resources in VarSome
ClinVar, dbSNP, gnomAD, HPO, MONDO, Ensembl, RefSeq, GWAS, CGD, HGNC, UniGene, Orphanet, CIViC genes, GERP, dbNSFP, COSMIC, IARC TP53, ICGC, Kaviar, DANN scores, CIViC mutations, UniProt variants, UniProt domains, GHR, CPIC, DGV, DECIPHER, ExAC CNVs, ExAC genes, PanelApp, Mondo, PMKB, BRAVO, REVEL, scSNV. And we keep adding new ones!
VarSome’s integrated database is leveraged in VarSome Clinical, a clinically-certified platform allowing fast and accurate variant discovery, annotation, and interpretation of NGS data for whole genomes, exomes, and gene panels. VarSome Clinical helps molecular geneticists and clinicians reach faster and more accurate diagnoses and treatment decisions for genetic conditions.
One of the benefits of possessing such a massive aggregated and harmonized database is that it can be applied in further downstream processes, such as automated variant classification according to the guidelines of the American College of Medical Genetics and Genomics (ACMG). VarSome’s robust implementation of ACMG guidelines contains explanations for each ACMG rule, along with why it has been triggered, or why not. If you have some additional evidence, you can manually turn on or off other ACMG rules, reach and evaluate the final verdict for your variant, and save it eventually as a manual classification for your future samples. Besides that, VarSome’s ACMG receives lots of scrutiny from 200k+ users worldwide, which ensures its quality and comprehensiveness. Indeed, in our recent survey, a very large number of users claimed VarSome’s ACMG is one of the main reasons for using VarSome!
Apart from aggregation and harmonization of genomics data resources, VarSome’s MolecularDB ensures extremely fast data retrieval for sample annotations as performance matters a lot when it comes to annotation of large data sets, such as whole genomes and exomes, possessing easily millions of variants. Full results and functional annotations are typically generated in a few tenths of a second.
Application Programming Interface
VarSome comes with an Application Programming Interface (API), which similarly to the MolecularDB has been designed with performance in mind: in practice it can fully annotate over 1'000 variants per second. This is made possible through batch requests, where each API request can contain several thousand variants in a single call. A user-configurable allele frequency filter allows to further increase throughput up to 4x times.
Another consequence of the specific architecture of MolecularDB is that VarSome offers very versatile variant look up mechanisms. You can search VarSome by HGVS nomenclature (both on DNA and on protein level), rsID, gene name, transcript symbol or genomic location. VarSome can also parse single lines from VCF files to look up the variant it describes. In addition to that, the results are not limited to known variants only, you can query any possible variant, including ‘abstract variants’, i.e. variants defined with a range of coordinates or with specific attributes. As a consequence of this powerful search mechanism, VarSome annotates variants that no one has seen before. See VarSome’s variant query examples.
VarSome's full-text search functions like other Internet search engines with one important difference: the search query returns entries only from the VarSome aggregated knowledge base, thus showing you the result relevant only for the genomics field. It enables you to perform targeted searches not just for variants, but over the entire contents of VarSome, such as articles, diseases, phenotypes, genes, etc. Importantly, this includes content provided by the entire VarSome global user community. See VarSome's full-text search examples.