Skip to content
Snippets Groups Projects
ABSD Logo

SWH

ABSD

Antibodies are a key element of immune response. Despite the almost infinite combination of possible antibodies, only few thousands of human and other species sequences are publicly available as of today. However, these sequences are spread across several different resources, making difficult to compile all sequences in a single dataset (redundancy, missing "pairing" information between heavy and light chain, errors in metadata, etc.).

With deep learning generalization, having a single and standardized dataset containing as much antibodies as possible, according to certain criteria, is of importance.

This website aim to properly address this issue, by allowing the users to easily build a set of antibodies suited for their own needs.

The antibodies database

The data used by ABSD are stored in a MongoDB database on the Kubernetes cluster. The database is automatically recreated when the current content does not corresponds to the fasta files in the data/ folder.

Data folder

The data folder must contain FASTA files, compressed TAR archives, statistics JSON files, and a sources.json file.

The fasta files in the data folder must follow some naming and structure conventions explained in the next sections. The general rules are:

  • Files must respect the FASTA format.
  • Files must contain only polypeptide sequences of the variables regions for both heavy and light chains.
  • Only one file per species.

At any point, do not hesitate to look at the existing fasta files in the data folder to guide you.

FASTA naming convention

Each file must follow the format: Genus_species.fasta.

ex. : Homo_sapiens.fasta, Mus_musculus.fasta, Bos_taurus.fasta...

The taxonomical genus and species of the organism must be separated by an underscore, and the file name must end with the .fasta extension. The convention of biological nomenclature is to capitalize the genus and leave the species in lower case.

This convention allows ABSD to seamlessly add new species in the database and interface.

FASTA entries

The files must contain fasta entries of both light chain and heavy chain sequences. For each antibody the light chain must comes first and be immediately followed by the heavy chain entry:

For instance here, the 6CEO antibody light and heavy chains are followed by 4R4H chains.

>6CE0|||6CE0_6|Chain F[auth L]|...
LSVALGETARISCGRQALGSRAVQWYQHKPGQAP...
>6CE0|||6CE0_5|Chain E[auth H]|PGT124...
QVQLQESGPGLVRPSETLSVTCIVSGGSISNYYWTWIRQSPGK...
>4R4N|||4R4N_3|Chains AA[auth D],...
DIQMTQSPSFVSASVGDRVTITCRASQGISSYLAW...
>4R4H|||4R4H_4|Chain D[auth H]|Antib...
QVQLQQWGAGLLKPSETLSLTCGVYGESLSGHYWS...
...

The fasta headers must start with a main ID (like 6CE0) followed by an aggregation of different other fasta headers gathered from external databases, such as PDB or UNIPROT, separated by triple pipes |||.

Each of these sub-headers must follow the format header;id;vGeneSegment;source where:

  • header is the fasta header itself.
  • id is the identifier of the chain found in the source. It is used by to build URL links in the interface.
  • vGeneSegment is the name of V gene segment from which the sequence has been produced, according to blast.
  • source is the name of the source where this header was found.

Example: 5UM8_3|Chain C[auth L]|Fab PGT124 light chain|Homo sapiens (9606);5um8;IGLV3;PDB

This header ends with ;5um8;IGLV3;PDB, which indicates that it was found in the PDB with the 5um8 identifier and that the sequence is derived from the IGLV3 gene.

It is thus accessible at https://www.rcsb.org/structure/5um8.

Why is there an ID for each fasta header ?

The sequences may be stored in multiple sources and thus have different identifiers. Sometimes, they can even be duplicated within the same database.

This is why it's important to keep the id for each fasta header in order to ensure proper linking.

sources.json

The sources.json file must be a JSON file and contain metadata describing the sources present in the fasta data. It is imported by the interface to create links and documentation.

Each key must be the name of the source and the value must be an object with the optional link and description properties:

  • link: contain the URL of the source with an optional {id} token. This token would be replaced by the fasta id in order to generate a link in the interface.
  • description: a short, textual description of the source. It is mostly used for documentation in the interface.

For instance:

"IMGT": {
  "link": "https://www.imgt.org/3Dstructure-DB/cgi/details.cgi?pdbcode={id}"
},
"PDB": {
  "link": "https://www.rcsb.org/structure/{id}"
},
...

Compressed TAR archives

The tar.gz archives are generated by hand at each new update of the data, using the POSIX tar tool. They contain the FASTA files for each species, along with a statistics file about the antibodies.

They must be named with only the date in ISO format (YYYY-MM-DD) and always end with the tar.gz extension.

This allows the easy sorting of the archives in chronological order, since with the ISO format of the dates, alphabetical and chronological orders are the same.

Database collections

In MongoDB, the database is created with the name provided by the ABSD_DB_NAME environement variable. Two collections are then produced:

  • antibodies: contains the antibodies themselves
  • statistics: contains the statistics of the database

Those collections follow the JSON schemas store in the src/server/schemas directory.

Antibodies identifiers

Antibodies being gathered from multiple sources and databases, they need to have their own ID in ABSD.

This ID is actually a SHA256 hash, computed and stored during the database creation. The seed of the hash is composed of the concatenation of:

  • the species name, including the space between genus and species.
  • the sequence of the heavy chain.
  • the sequence of the light chain.

Example:

  • Mus musculus
  • EIQLQQSGPELVKPGTSVKVSCKASGYALTSYTMYWVKQSHGKSLEWIGYIDPYNGGTSYNQKFKGKATLTVDKSSSTAYMHLNSLTSEDSAVYYCAGWNRYDEDWGQGTTLTVSSA
  • DIVLTQSPASLAVSLGQRATISCRTSETIDSYGNSFMHWYQQKPGQPPKLLIYRASNLKSGIPARFSGSGSRTDFTLTINPVEADDVATYYCQQTNEVMYTFGGGTKLEIK

The hash seed would be:

Mus musculusEIQLQQSGPELVKPGTSVKVSCKASGYALTSYTMYWVKQSHGKSLEWIGYIDPYNGGTSYNQKFKGKATLTVDKSSSTAYMHLNSLTSEDSAVYYCAGWNRYDEDWGQGTTLTVSSADIVLTQSPASLAVSLGQRATISCRTSETIDSYGNSFMHWYQQKPGQPPKLLIYRASNLKSGIPARFSGSGSRTDFTLTINPVEADDVATYYCQQTNEVMYTFGGGTKLEIK

and the result would be 5548f0d1bcbbd127ae566a6f61110764972c6498715baf6525302bf77af4b948 which actually exists in ABSD here.

Since the hash collisions are very unlikely with SHA256, this have a few benefits:

  • There is a permanent ID for each antibody, removing the need of maintaining any kind of incremental system.
  • The ID can be re-computed again when creating a new database.
  • The ID allows users to check data integrity.

Development

To develop with ABSD, you first need to make sure that node.js is installed on your machine, with the proper version indicated in the .node-version file.

You will also need MongoDB either installed with Docker or directly on your machine.

Configuration

ABSD is configured via environment variables, mostly used in the docker and k8s files.

You will most likely not need to set up any of them since the default values work well for simple developments.

Variable Type Default Description
NODE_ENV string 'development' Type of environment. Can be 'development' or 'production'.
ABSD_SERVER_HOST string '127.0.0.1' The IP address to listen to.
ABSD_SERVER_PORT number 3000 Port of the server.
ABSD_SERVER_PROXY boolean false If the application is behind a proxy.
ABSD_DB_HOST string '127.0.0.1' IP address of the database.
ABSD_DB_PORT number 27017 Port of the database.
ABSD_DB_NAME string 'absd' Name of the database.

Set up and start

In the root of the project, simply enter a few npm commands to get you started :

# Install all dependencies in the node_modules directory
npm install

# Build the database from the FASTA files
npm run build:db

# Start the development server
npm run dev

Test for production

In production, ABSD is running in a docker container. If you would like to test your changes, you could simply build the interface and then run the server in production mode :

# Build the interface (src/client) with the Vite bundler.
# The code will be generated in the dist/ folder.
npm run build

# Start the server to use the dist/ directory code
# instead of the Vite dev server.
npm start

# OR

# Start the server in a production environment
# NOTE: this will recreate the database automatically
export NODE_ENV=production; npm start

Validate the FASTA files

It is possible to quickly validate that the fasta files roughly follow the proper format. To do that simple enter :

npm run validate

This will validate every .fasta file found in the data folder

Compute statistics

If you ever wish to see the current stastitics of your database, simply enter :

npm run stats

This will compute and output the statistics in JSON format. This command is also used to produce the statistics json files in the data folder when the fasta files are updated.

Credits

This project is conducted by the Bioinformatics and Biostatistics HUB of the Institut Pasteur.

  • Project leader: Nicolas Maillet
  • Web developer: Simon Malesys
  • Scientific supervisor: Bertrand Saunier
  • UI and UX design: Rachel Torchet
  • Logos and identity: Richard Bosseau

Reproducibility

Raw data can found on recherche.data.gouv.fr.

Software Heritage

This application is archived in the Software Heritage project.

Article

ABSD has been published in NAR Genomics and Bioinformatics and can be cited with the DOI of the associated research article: https://doi.org/10.1093/nargab/lqae171.