
ABSD
Antibodies are a key element of immune response. Despite the almost infinite combination of possible antibodies, only few thousands of human and other species sequences are publicly available as of today. However, these sequences are spread across several different resources, making difficult to compile all sequences in a single dataset (redundancy, missing "pairing" information between heavy and light chain, errors in metadata, etc.).
With deep learning generalization, having a single and standardized dataset containing as much antibodies as possible, according to certain criteria, is of importance.
This website aim to properly address this issue, by allowing the users to easily build a set of antibodies suited for their own needs.
The antibodies database
The data used by ABSD are stored in a MongoDB database on the Kubernetes cluster. The database is automatically recreated when the current content does not corresponds to the fasta files in the data/
folder.
Data folder
The data folder must contain FASTA files, compressed TAR archives, statistics JSON files, and a sources.json file.
The fasta files in the data folder must follow some naming and structure conventions explained in the next sections. The general rules are:
- Files must respect the FASTA format.
- Files must contain only polypeptide sequences of the variables regions for both heavy and light chains.
- Only one file per species.
At any point, do not hesitate to look at the existing fasta files in the data folder to guide you.
FASTA naming convention
Each file must follow the format: Genus_species.fasta.
ex. : Homo_sapiens.fasta, Mus_musculus.fasta, Bos_taurus.fasta...
The taxonomical genus and species of the organism must be separated by an underscore, and the file name must end with the .fasta extension. The convention of biological nomenclature is to capitalize the genus and leave the species in lower case.
This convention allows ABSD to seamlessly add new species in the database and interface.
FASTA entries
The files must contain fasta entries of both light chain and heavy chain sequences. For each antibody the light chain must comes first and be immediately followed by the heavy chain entry:
For instance here, the 6CEO
antibody light and heavy chains are followed by 4R4H
chains.
>6CE0|||6CE0_6|Chain F[auth L]|...
LSVALGETARISCGRQALGSRAVQWYQHKPGQAP...
>6CE0|||6CE0_5|Chain E[auth H]|PGT124...
QVQLQESGPGLVRPSETLSVTCIVSGGSISNYYWTWIRQSPGK...
>4R4N|||4R4N_3|Chains AA[auth D],...
DIQMTQSPSFVSASVGDRVTITCRASQGISSYLAW...
>4R4H|||4R4H_4|Chain D[auth H]|Antib...
QVQLQQWGAGLLKPSETLSLTCGVYGESLSGHYWS...
...
The fasta headers must start with a main ID (like 6CE0) followed by an aggregation of different other fasta headers gathered from external databases, such as PDB or UNIPROT, separated by triple pipes |||.
Each of these sub-headers must follow the format header;id;vGeneSegment;source where:
- header is the fasta header itself.
- id is the identifier of the chain found in the source. It is used by to build URL links in the interface.
- vGeneSegment is the name of V gene segment from which the sequence has been produced, according to blast.
- source is the name of the source where this header was found.
Example: 5UM8_3|Chain C[auth L]|Fab PGT124 light chain|Homo sapiens (9606);5um8;IGLV3;PDB
This header ends with ;5um8;IGLV3;PDB, which indicates that it was found in the PDB with the 5um8 identifier and that the sequence is derived from the IGLV3 gene.
It is thus accessible at https://www.rcsb.org/structure/5um8.
Why is there an ID for each fasta header ?
The sequences may be stored in multiple sources and thus have different identifiers. Sometimes, they can even be duplicated within the same database.
This is why it's important to keep the id for each fasta header in order to ensure proper linking.
sources.json
The sources.json file must be a JSON file and contain metadata describing the sources present in the fasta data. It is imported by the interface to create links and documentation.
Each key must be the name of the source and the value must be an object with the optional link and description properties:
- link: contain the URL of the source with an optional {id} token. This token would be replaced by the fasta id in order to generate a link in the interface.
- description: a short, textual description of the source. It is mostly used for documentation in the interface.
For instance:
"IMGT": {
"link": "https://www.imgt.org/3Dstructure-DB/cgi/details.cgi?pdbcode={id}"
},
"PDB": {
"link": "https://www.rcsb.org/structure/{id}"
},
...
Compressed TAR archives
The tar.gz archives are generated by hand at each new update of the data, using the POSIX tar tool. They contain the FASTA files for each species, along with a statistics file about the antibodies.
They must be named with only the date in ISO format (YYYY-MM-DD) and always end with the tar.gz extension.
This allows the easy sorting of the archives in chronological order, since with the ISO format of the dates, alphabetical and chronological orders are the same.
Database collections
In MongoDB, the database is created with the name provided by the ABSD_DB_NAME
environement variable.
Two collections are then produced:
-
antibodies
: contains the antibodies themselves -
statistics
: contains the statistics of the database
Those collections follow the JSON schemas store in the src/server/schemas
directory.
Antibodies identifiers
Antibodies being gathered from multiple sources and databases, they need to have their own ID in ABSD.
This ID is actually a SHA256 hash, computed and stored during the database creation. The seed of the hash is composed of the concatenation of:
- the species name, including the space between genus and species.
- the sequence of the heavy chain.
- the sequence of the light chain.
Example:
- Mus musculus
- EIQLQQSGPELVKPGTSVKVSCKASGYALTSYTMYWVKQSHGKSLEWIGYIDPYNGGTSYNQKFKGKATLTVDKSSSTAYMHLNSLTSEDSAVYYCAGWNRYDEDWGQGTTLTVSSA
- DIVLTQSPASLAVSLGQRATISCRTSETIDSYGNSFMHWYQQKPGQPPKLLIYRASNLKSGIPARFSGSGSRTDFTLTINPVEADDVATYYCQQTNEVMYTFGGGTKLEIK
The hash seed would be:
Mus musculusEIQLQQSGPELVKPGTSVKVSCKASGYALTSYTMYWVKQSHGKSLEWIGYIDPYNGGTSYNQKFKGKATLTVDKSSSTAYMHLNSLTSEDSAVYYCAGWNRYDEDWGQGTTLTVSSADIVLTQSPASLAVSLGQRATISCRTSETIDSYGNSFMHWYQQKPGQPPKLLIYRASNLKSGIPARFSGSGSRTDFTLTINPVEADDVATYYCQQTNEVMYTFGGGTKLEIK
and the result would be 5548f0d1bcbbd127ae566a6f61110764972c6498715baf6525302bf77af4b948
which actually exists in ABSD here.
Since the hash collisions are very unlikely with SHA256, this have a few benefits:
- There is a permanent ID for each antibody, removing the need of maintaining any kind of incremental system.
- The ID can be re-computed again when creating a new database.
- The ID allows users to check data integrity.
Development
To develop with ABSD, you first need to make sure that node.js is installed on your machine, with the proper version indicated in the .node-version
file.
You will also need MongoDB either installed with Docker or directly on your machine.
Configuration
ABSD is configured via environment variables, mostly used in the docker and k8s files.
You will most likely not need to set up any of them since the default values work well for simple developments.
Variable | Type | Default | Description |
---|---|---|---|
NODE_ENV | string | 'development' | Type of environment. Can be 'development' or 'production'. |
ABSD_SERVER_HOST | string | '127.0.0.1' | The IP address to listen to. |
ABSD_SERVER_PORT | number | 3000 | Port of the server. |
ABSD_SERVER_PROXY | boolean | false | If the application is behind a proxy. |
ABSD_DB_HOST | string | '127.0.0.1' | IP address of the database. |
ABSD_DB_PORT | number | 27017 | Port of the database. |
ABSD_DB_NAME | string | 'absd' | Name of the database. |
Set up and start
In the root of the project, simply enter a few npm commands to get you started :
# Install all dependencies in the node_modules directory
npm install
# Build the database from the FASTA files
npm run build:db
# Start the development server
npm run dev
Test for production
In production, ABSD is running in a docker container. If you would like to test your changes, you could simply build the interface and then run the server in production mode :
# Build the interface (src/client) with the Vite bundler.
# The code will be generated in the dist/ folder.
npm run build
# Start the server to use the dist/ directory code
# instead of the Vite dev server.
npm start
# OR
# Start the server in a production environment
# NOTE: this will recreate the database automatically
export NODE_ENV=production; npm start
Validate the FASTA files
It is possible to quickly validate that the fasta files roughly follow the proper format. To do that simple enter :
npm run validate
This will validate every .fasta
file found in the data
folder
Compute statistics
If you ever wish to see the current stastitics of your database, simply enter :
npm run stats
This will compute and output the statistics in JSON format. This command is also used to produce the statistics json files in the data
folder when the fasta files are updated.
Credits
This project is conducted by the Bioinformatics and Biostatistics HUB of the Institut Pasteur.
- Project leader: Nicolas Maillet
- Web developer: Simon Malesys
- Scientific supervisor: Bertrand Saunier
- UI and UX design: Rachel Torchet
- Logos and identity: Richard Bosseau
Reproducibility
Raw data can found on recherche.data.gouv.fr.
Software Heritage
This application is archived in the Software Heritage project.
Article
ABSD has been published in NAR Genomics and Bioinformatics and can be cited with the DOI of the associated research article: https://doi.org/10.1093/nargab/lqae171.