Usage Requirement
Nextflow Dependencies: Nextflow Version
License: GPL-3.0 Dependencies: Apptainer Version
Dependencies: Graphviz Version



  • Annotation of mRNA sequencing of the Immunoglobuline Heavy or Light variable region (fasta file sequences).
  • Clustering of the annotated sequences into clonal groups (same germline origin).
  • Tree visualization of the clonal groups.


  • Right now, only dedicated to the analysis of VDJ repertoires (corresponding to the germlines/imgt/<SPECIES>/vdj folder of the IMGT database.
  • To make the repertoires contingency tables and heatmaps, the script currently takes the first annotation of the imgt annotation if several are presents in the v_call or j_call column of the all_passed_seq.tsv file.


Files and folder Description File that can be executed using a linux terminal, a MacOS terminal or Windows 10 WSL2.
nextflow.config Parameter settings for the file. Users have to open this file, set the desired settings and save these modifications before execution.
bin folder Contains files required by the file.
xlsx2fasta.R Accessory file that creates all the fasta files required from a .xlsx file. To use it, 1) open the file, 2) complete the "Parameters that need to be set by the user" section, 3) save the modifications and 4) run the file in R.
Licence.txt Licence of the release.


Required files
A folder (zipped or not) containing nucleotide fasta files, each containing a single sequence. Use xlsx2fasta.R ( if sequences are in a .xlsx file.
A metadata file (optional) for adding informations in the results.

The dataset used in the nextflow.config file, as example, is available at

Use this code to split a multi sequence fasta file into fasta files made of a single sequence:

FASTA_FILE="./test.fasta" # add path and name of the fasta file here
awk -v slice_size=1 -v prefix="cut" '$1 ~ /^>/{nbSeq++; currSlice=int((nbSeq-1)/slice_size)+1; myOutFile=prefix"_"currSlice".fasta"}{print $0 > myOutFile}' ${FASTA_FILE}


1. Prerequisite

Installation of:
nextflow DSL2
Graphviz, sudo apt install graphviz for Linux ubuntu

2. Local running (personal computer)

2.1. file in the personal computer

  • Mount a server if required:
DRIVE="Z" # change the letter to fit the correct drive
sudo mkdir /mnt/share
sudo mount -t drvfs $DRIVE: /mnt/share

Warning: if no mounting, it is possible that nextflow does nothing, or displays a message like:

Launching `` [loving_morse] - revision: d5aabe528b
  • Run the following command from where the and nextflow.config files are (example: \wsl$\Ubuntu-20.04\home\gael):
nextflow run -c nextflow.config

with -c to specify the name of the config file used.

2.3. file in the public gitlab repository

Run the following command from where you want the results:

nextflow run -hub pasteur gmillot/repertoire_profiler -r v1.0.0

3. Distant running (example with the Pasteur cluster)

3.1. Pre-execution

Copy-paste this after having modified the EXEC_PATH variable:

EXEC_PATH="/pasteur/helix/projects/BioIT/gmillot/repertoire_profiler" # where the bin folder of the script is located
export CONF_BEFORE=/opt/gensoft/exe # on maestro

export JAVA_CONF=java/13.0.2
export JAVA_CONF_AFTER=bin/java # on maestro
export APP_CONF=apptainer/1.3.5
export APP_CONF_AFTER=bin/apptainer # on maestro
export GIT_CONF=git/2.39.1
export GIT_CONF_AFTER=bin/git # on maestro
export GRAPHVIZ_CONF=graphviz/2.42.3
export GRAPHVIZ_CONF_AFTER=bin/graphviz # on maestro

chmod 755 ${EXEC_PATH}/bin/*.* # not required if no bin folder

3.2. file in a cluster folder

Modify the second line of the code below, and run from where the and nextflow.config files are (which has been set thanks to the EXEC_PATH variable above):

HOME="${HELIXHOME}/repertoire_profiler/" # $HOME changed to allow the creation of .nextflow into /$HELIXHOME/repertoire_profiler/, for instance. See NFX_HOME in the nextflow software script
nextflow run --modules ${MODULES} -c nextflow.config

3.3. file in the public gitlab repository

Modify the first and third lines of the code below, and run (results will be where the EXEC_PATH variable has been set above):

HOME="${HELIXHOME}/repertoire_profiler/" # $HOME changed to allow the creation of .nextflow into /$HELIXHOME/repertoire_profiler/, for instance. See NFX_HOME in the nextflow software script
nextflow run --modules ${MODULES} -hub pasteur gmillot/repertoire_profiler -r $VERSION -c $HOME/nextflow.config

4. Error messages and solutions

Message 1

Unknown error accessing project `gmillot/repertoire_profiler` -- Repository may be corrupted: /pasteur/sonic/homes/gmillot/.nextflow/assets/gmillot/repertoire_profiler

Purge using:

rm -rf /pasteur/sonic/homes/gmillot/.nextflow/assets/gmillot*

Message 2

WARN: Cannot read project manifest -- Cause: Remote resource not found:

Contact Gael Millot (distant repository is not public).

Message 3

permission denied

Use chmod to change the user rights. Example linked to files in the bin folder:

chmod 755 bin/*.*


An example of results obtained with the dataset is present at this address:

Complete informations are in the Protocol 144-rev0 Ig clustering - Immcantation.docx (contact Gael Millot).

Mandatory elements:

repertoire_profiler_<UNIQUE_ID> folder Description
reports Folder containing all the reports of the different processes, including the nextflow.config file used.
repertoires Folder containing the repertoires, i.e., contingency tables of the VDJ allele usage from the all_passed_seq.tsv file (see below). Warning: the script currently takes the first annotation of the imgt annotation if several are presents in the v_call or j_call column of the all_passed_seq.tsv file. (e.g., v_call with IGKV1-39*01,IGKV1D-39*01), so that contingencies are identical to those from the donut frequencies, that use germline_v_call and germline_j_call columns (allele reassignment by the tool of immcantation)
png Folder containing the graphs in png format.
svg Folder containing the graphs in svg vectorial format.
RData Folder containing, for each clonal group, objects that can be used in R to further analyze of plot the data:
  • db: tibble data frame resulting from the import by the alakazam::readChangeoDb() function
  • clones: db in the airClone format
  • trees: output of the dowser::getTrees() function using the clones object as input (igphylm tree)

  • Also contains the all_trees.RData file that combine the trees R objects of the different files in a single trees object.
seq_distance.pdf Distribution of the distances between the two nearest sequences (see the nearest_distance column in the all_passed_seq.tsv file).
donuts.pdf donut plots showing the frequency of sequences per clonal groups, among:
  • all: all the passed sequences (all_passed_seq.tsv output file).
  • annotated: as the "all" donut but using all the passed sequences that have been annotated using the meta_name_replacement parameter of the nextflow.config file if not "NULL".
  • trees: all the sequences used for germline trees (germ_tree_seq.tsv output file).
repertoire.pdf heatmap of the files from the repertoires folder (see above), showing the frequency of alleles used among all the all passed sequences ("all"), non empty cells ("non-zero") and "annotated" sequences (if metadata are provided). Non-zero means that unused alleles are removed from the heatmap (empty row or column). Warning: to build the repertoire contingencies, the script currently takes the first annotation of the imgt annotation if several are presents in the v_call or j_call column of the all_passed_seq.tsv file (see the all_passed_seq_several_annot_igmt.tsv file below)
germ_tree.pdf Phylogenic trees of the sequences that belong to a clonal (supposedly germline) group made of at least n sequences, n being set by the nb_seq_per_clone parameter in the nextflow.config file (one page per clonal group). Warning: clonal group full names are those given by dowser::formatClones, i.e., those from germinal_v_call and germinal_j_call from the all_passed_seq.tsv file.
germ_no_tree.pdf All the clonal groups with less than n sequences, n being set by the nb_seq_per_clone parameter in the nextflow.config file (one page per clonal group). Clonal group nformation is recapitulated in each page.
donut_stat.tsv stats associated to the donuts.pdf file.
igblast_unaligned_seq.tsv Names of sequences that failed to be annotated by igblast (empty file if all the sequences are annotated).
igblast_aligned_seq.tsv Names of sequences annotated by igblast (more precisely by igblast). If empty, generate a subsequent nextflow failure. The number lines in igblast_unaligned_seq.tsv and igblast_aligned_seq.tsv is equal to the number of submitted .fasta files.
productive_seq.tsv Sequences annotated by igblast (see the unproductive_seq.tsv file for sequences that failed to be productive annotated by igblast). Productive means: (1) coding region has an open reading frame, (2) no defect in the start codon, splicing sites or regulatory elements, (3) no internal stop codons, (4) an in-frame junction region.
all_passed_seq.tsv Sequences from the productive_seq.tsv file with germline clustering (clone ID), allele reannotation (germinal_v_call and germinal_j_call columns), mutation load, distance and sequence nickname (annotation from the metadata file) added. Warning: the number of sequences (i.e., rows) can be lower than in the productive_seq.tsv file due to sequences that failed to be clone assigned (see the non_clone_assigned_sequence.tsv file).
Column description:
  • sequence_id: Unique query sequence identifier for the Rearrangement. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. When downloaded from an AIRR Data Commons repository, this will usually be a universally unique record locator for linking with other objects in the AIRR Data Model.
  • sequence: The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment.
  • rev_comp: True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of 'sequence'.
  • productive: True if the V(D)J sequence is predicted to be productive.
  • v_call: V gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHV4-59*01 if using IMGT/GENE-DB).
  • d_call: First or only D gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHD3-10*01 if using IMGT/GENE-DB).
  • j_call: J gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHJ4*02 if using IMGT/GENE-DB).
  • sequence_alignment: Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement.
  • germline_alignment: Assembled, aligned, full-length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any). Thus, If well understood, this sequence is built from VDJ sequences in databases that match the sequence in sequence_alignment, with gap included only. Nucleotides can be different with sequence_alignment. Warning: sequences with the same clone_ID can have different germline_aligments.
  • junction: Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons.
  • junction_aa: Amino acid translation of the junction.
  • v_cigar: CIGAR string for the V gene alignment. See protocol 50
  • d_cigar: CIGAR string for the first or only D gene alignment. See protocol 50
  • j_cigar: CIGAR string for the J gene alignment. See protocol 50
  • stop_codon: True if the aligned sequence contains a stop codon.
  • vj_in_frame: True if the V and J gene alignments are in-frame.
  • locus: Gene locus (chain type). Note that this field uses a controlled vocabulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature.
  • junction_length: Number of nucleotides in the junction sequence.
  • np1_length: Number of nucleotides between the V gene and first D gene alignments or between the V gene and J gene alignments.
  • np2_length: Number of nucleotides between either the first D gene and J gene alignments or the first D gene and second D gene alignments.
  • v_sequence_start: Start position of the V gene in the query sequence (1-based closed interval).
  • v_sequence_end: End position of the V gene in the query sequence (1-based closed interval).
  • v_germline_start: Alignment start position in the V gene reference sequence (1-based closed interval).
  • v_germline_end: Alignment end position in the V gene reference sequence (1-based closed interval).
  • d_sequence_start: Start position of the first or only D gene in the query sequence. (1-based closed interval).
  • d_sequence_end: End position of the first or only D gene in the query sequence. (1-based closed interval).
  • d_germline_start: Alignment start position in the D gene reference sequence for the first or only D gene (1-based closed interval).
  • d_germline_end: Alignment end position in the D gene reference sequence for the first or only D gene (1-based closed interval).
  • j_sequence_start: Start position of the J gene in the query sequence (1-based closed interval).
  • j_sequence_end: End position of the J gene in the query sequence (1-based closed interval).
  • j_germline_start: Alignment start position in the J gene reference sequence (1-based closed interval).
  • j_germline_end: Alignment end position in the J gene reference sequence (1-based closed interval).
  • v_score: Alignment score for the V gene. See raw score
  • v_identity: Fractional identity for the V gene alignment (proportion)
  • v_support: V gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the V gene assignment as defined by the alignment tool.
  • d_score: Alignment score for the first or only D gene alignment.
  • d_identity: Fractional identity for the first or only D gene alignment.
  • d_support: D gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the first or only D gene as defined by the alignment tool.
  • j_score: Alignment score for the J gene alignment.
  • j_identity: Fractional identity for the J gene alignment.
  • j_support: J gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the J gene assignment as defined by the alignment tool.
  • fwr1: Nucleotide sequence of the aligned FWR1 region of the query sequence (i.e., sequence_alignment field).
  • fwr2: Nucleotide sequence of the aligned FWR2 region of the query sequence (i.e., sequence_alignment field).
  • fwr3: Nucleotide sequence of the aligned FWR3 region of the query sequence (i.e., sequence_alignment field).
  • fwr4: Nucleotide sequence of the aligned FWR4 region of the query sequence (i.e., sequence_alignment field).
  • cdr1: Nucleotide sequence of the aligned CDR1 region of the query sequence (i.e., sequence_alignment field).
  • cdr2: Nucleotide sequence of the aligned CDR2 region of the query sequence (i.e., sequence_alignment field).
  • cdr3: Nucleotide sequence of the aligned CDR3 region of the query sequence (i.e., sequence_alignment field).
  • sequence_alignment_aa: Translation in aa of the sequence_alignment column
  • clone_id: Clone number. A same clone_id gathers all the sequences that putatively come from a same germline cell.
  • germline_alignment_d_mask: as germline_alignment but with D masked (i.e., replaced by N, in the middle of the CDR3). Because the D-segment call for B cell receptor alignments is often low confidence, the default germline format (-g dmask) places Ns in the N/P and D-segments of the junction region rather than using the D-segment assigned during reference alignment; this can be modified to generate a complete germline (-g full) or a V-segment only germline (-g vonly)
  • germline_v_call: V germline cassette
  • germline_d_call: D germline cassette (usually NA)
  • germline_j_call: J germline cassette
  • mu_count_cdr_r: number of replacement mutations in CDR1 and CDR2 of the V-segment (comparing column sequence_alignment and column germline_alignment_d_mask, see
  • mu_count_cdr_s: number of silent mutations in CDR1 and CDR2 of the V-segment.
  • mu_count_fwr_r: number of replacement mutations in FWR1, FWR2 and FWR3 of the V-segment.
  • mu_count_fwr_s: number of silent mutations in FWR1, FWR2 and FWR3 of the V-segment.
  • mu_count: number of replacement and silent mutations (sum of the previous columns)
  • mu_freq_cdr_r: frequency of replacement mutations in CDR1 and CDR2 of the V-segment (if frequency=TRUE, R and S mutation frequencies are calculated over the number of non-N positions in the specified regions).
  • mu_freq_cdr_s: frequency of silent mutations in CDR1 and CDR2 of the V-segment (idem).
  • mu_freq_fwr_r: frequency of replacement mutations in FWR1, FWR2 and FWR3 of the V-segment (idem).
  • mu_freq_fwr_s: frequency of silent mutations in FWR1, FWR2 and FWR3 of the V-segment (idem).
  • mu_freq: frequency of replacement and silent mutations (sum of the previous columns)
  • dist_nearest: minimal distance from the nearest sequence using the model from the clone_model parameter (Haming by default). NA if no other sequences have same V, J and junction length or if another sequence is strictly identical (should be 0 but NA is returned)
all_passed_seq_several_annot_igmt.tsv Sequences from the all_passed_seq.tsv file with several annotation in the v_call and j_call columns, because several alleles had the best alignment using imgt blast (same )
unproductive_seq.tsv Sequences that failed productive annotations by igblast (empty file if all the sequences are productively annotated).
non_clone_assigned_sequence.tsv Productive sequences that failed to be assigned to a clone ID by the function (empty file if all the sequences are assigned).
germ_tree_clone_id.tsv Clonal group IDs used in the germline tree analysis (clonal group with at least n sequences, n being set by the nb_seq_per_clone parameter in the nextflow.config file).
germ_tree_dismissed_clone_id.tsv Clonal group IDs not used in the germline tree analysis (clonal group with less than n sequences, n being set by the nb_seq_per_clone parameter in the nextflow.config file).
germ_tree_seq.tsv Sequences of the all_passed_seq.tsv file used in the germline tree analysis (clonal group with at least n sequences, n being set by the nb_seq_per_clone parameter in the nextflow.config file).
germ_tree_dismissed_seq.tsv Sequences of the all_passed_seq.tsv file not used in the germline tree analysis (clonal group with less than n sequences, n being set by the nb_seq_per_clone parameter in the nextflow.config file).
germ_tree_dup_seq_not_displayed.tsv Sequences file used in the germline tree analysis but not displayed in the graph, (1) because strictly identical to another sequence already in the tree and (2) because the tree_duplicate_seq parameter of the nextflow.config file has been set to "FALSE".

Optional elements only returned if the igblast_aa parameter is 'false' and if the input fasta are nucleotide sequences:

repertoire_profiler_xxxxx folder Description
aa Folder containing the translation of the alignment_sequence column of the productive_seq.tsv file in fasta files.
aligned_seq Folder containing the alignment_sequence column of the productive_seq.tsv file in fasta files.
aa.tsv File containing all the translation of the alignment_sequence column of the productive_seq.tsv file.


The different releases are tagged here.


This package of scripts can be redistributed and/or modified under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchandability or fitness for a particular purpose. See the GNU General Public License for more details at or in the Licence.txt attached file.


Version V10.3:

Dejoux A, Zhu Q, Ganneau C, Goff OR, Godon O, Lemaitre J, Relouzat F, Huetz F, Sokal A, Vandenberghe A, Pecalvel C, Hunault L, Derenne T, Gillis CM, Iannascoli B, Wang Y, Rose T, Mertens C, Nicaise-Roland P; NASA Study Group; England P, Mahévas M, de Chaisemartin L, Le Grand R, Letscher H, Saul F, Pissis C, Haouz A, Reber LL, Chappert P, Jönsson F, Ebo DG, Millot GA, Bay S, Chollet-Martin S, Gouel-Chéron A, Bruhns P. Rocuronium-specific antibodies drive perioperative anaphylaxis but can also function as reversal agents in preclinical models. Sci Transl Med. 2024 Sep 11;16(764):eado4463. doi: 10.1126/scitranslmed.ado4463. Epub 2024 Sep 11. PMID: 39259810.


Pascal Chappert, INSERM U1151 Institut Necker Enfants Malades, Paris, France

Frédéric Lemoine, Bioinformatics and Biostatistics Hub, Institut Pasteur, Paris, France

Gael A. Millot, Bioinformatics and Biostatistics Hub, Institut Pasteur, Paris, France


The developers & maintainers of the mentioned softwares and packages, including:

Special acknowledgement to Kenneth Hoehn, Yale School of Medicine, New Haven, CT, USA



