Skip to content
Snippets Groups Projects

TaggingBackends

Build Status Coverage

This library helps to implement automatic tagging backends for the Nyx larva tagger.

Template project for tagging backends

A tagging backend, called e.g. TaggingBackend, is a Python project with the following directory structure:

├── LICENSE                          <- Default is MIT.
├── README.md                        <- Project description.
├── data
│   ├── raw                          <- The input data are accessible from this
│   │                                   directory, with their original file structure.
│   ├── interim                      <- Preprocessed data and extracted features
│   │                                   can be stored in this directory.
│   └── processed                    <- Predicted labels from predict_model.py are
│                                       expected in this directory.
├── models                           <- Hyperparameters and weights of trained
│                                       classifiers can be stored here.

├── pyproject.toml                   <- Project definition file for Poetry.
├── src
│   └── taggingbackend               <- Python package name.
│       │                               Same as project name, with lowercase letters,
│       │                               hyphens converted into underscores.
│       │                               For example, `My-Tagger` becomes `my_tagger`.
│       ├── __init__.py              <- Defines variable `__version__`.
│       ├── data
│       │   └── make_dataset.py      <- Picks and converts raw data files; if files
│       │                               are to be written, they go into data/interim;
│       │                               optional.
│       ├── features
│       │   └── build_features.py    <- Extracts and saves features to file in
│       │                               data/interim; optional.
│       └── models
│           ├── train_model.py       <- Trains the behavior tagging algorithm and
│           │                           stores the trained model in models/;
│           │                           optional.
│           └── predict_model.py     <- Loads the trained model and features from
│                                       data/interim, and moves the resulting
│                                       labels in data/processed.
└── test
    ├── __init__.py                  <- Empty file.
    └── test_taggingbackend.py       <- Automated tests; optional.
                                        Filename is `test_<package_name>.py`.

The above structure borrows elements from the Cookiecutter Data Science project template, adapted for use with Poetry.

The src/<package_name>/{data,features,models} directories can accommodate Python modules (in subpackages <package_name>.{data,features,models} respectively). For example, the model can be implemented as a Python class in an additional file in src/<package_name>/models, e.g. mymodel.py. In this case, an empty __init__.py file should be created in the same directory.

As the Python package is installed, this custom module will be loadable from anywhere with import <package_name>.models.mymodel.

On the other hand, the make_dataset.py, build_features.py, predict_model.py and train_model.py are Python scripts, with a main program. These scripts will be run using Poetry, from the project root.

See example scripts in the examplebackend directory.

Only predict_model.py is required by the Nyx tagging UI.

The simplest working directory structure for a tagging backend is:

├── models/
│   └── trained_model/
├── pyproject.toml
└── scripts/
    └── predict_model.py

with trained_model the name of the trained model. Backends that do not need to store trained models should still have an empty subdirectory there, as these subdirectories in models are looked for by the Nyx tagger UI.

The data directory is automatically created by the BackendExplorer object, together with its raw and processed subdirectories, therefore there is no need to include these directories in the backend.

Although the Nyx tagger UI does not expect the project to include a Python package, a Poetry-managed virtual environment should be set up with the taggingbackends package installed, so that the command poetry run tagging-backend is available at the project root directory.

The tests directory is renamed test for compatibility with Julia projects. Python/Poetry do not need additional configuration to properly handle the tests.

Input and output data

Per default, the input data will be copied into data/raw. Input data files can be in any format, and a backend is responsible for handling these various files. Per default, training labels are provided as json files that can be loaded using the taggingbackends.labels.Labels class.

Predicted labels are expected in data/processed.

A backend can make use of data/interim or ignore it. Similarly, the trained models can be stored in models or not. However, a subdirectory by the name of the model instance should be created for model discovery (see below).

Full paths to data and model directories, and files, are made available in the training and prediction procedures with a taggingbackends.explorer.BackendExplorer object.

Labels

The internal representation is as follows:

  • dictionary of run identifiers (str, typically date_time) as keys and, as values:
    • dictionary of larva identifiers (int) as keys and, as values:
      • dictionary of timestamps (float) as keys and discrete behavioral states (str) as values.

Labels are encapsulated in a dedicated datatype that also stores metadata and information about labels (names, colors).

See taggingbackends.data.labels.Labels, and an example json labels file.

Model specification

A backend can train and use multiple model instances. Each instance is assigned an identifier and the related files, including data files, are actually stored in corresponding subdirectories in the data/raw, data/interim, data/processed and models directories.

Per default, a new model is identified with a timestamp in the YYYYMMDD_HHMMSS format. For example, the taggingbackends.explorer.BackendExplorer.list_input_files method will seek for files in the data/raw/<instance_identifier> directory only.

A backend can store the model in the models/<instance_identifier> directory (please call the taggingbackends.explorer.BackendExplorer.model_dir method to get the exact location) or not. However, the directory should be created to make the trained models discoverable. Indeed, the Nyx tagging UI will seek for subdirectories in the models directory to list the available trained models.

Recommended installation

TaggingBackends uses both Python and Julia. In particular, Python-side taggingbackends may call Julia-side TaggingBackends to compile the data from a data repository, prior to training a tagger.

The communication between the two language stacks requires Julia package TaggingBackends to be manually installed. Depending on where TaggingBackends is installed, a pointer to that location may be needed.

The simplest approach consists in doing so in the main Julia environment. In this case, you will have to explicitly install PlanarLarvae as well:

julia -e 'using Pkg; Pkg.add(url="https://gitlab.pasteur.fr/nyx/planarlarvae.jl"); Pkg.add(url="https://gitlab.pasteur.fr/nyx/TaggingBackends")'

This is enough for Python-side taggingbackends to find its Julia counterpart.

Another approach we recommend, if you have LarvaTagger.jl installed, consists in installing TaggingBackends in the Julia environment associated with LarvaTagger, and set the JULIA_PROJECT environment variable so that it points to the directory associated with this environment. This directory typically is the root directory of your local copy of LarvaTagger.jl, if you installed LarvaTagger with julia --project=..

To do so, on Unix-like OSes:

cd /path/to/LarvaTagger.jl/
export JULIA_PROJECT=$(pwd)
julia --project=. -e 'using Pkg; Pkg.add(url="https://gitlab.pasteur.fr/nyx/TaggingBackends")'

With the above export expression, the JULIA_PROJECT environment variable is set for the running commandline.

To install a backend, taking MaggotUBA-adapter as an example:

git clone --depth 1 --single-branch -b 20230111 https://gitlab.pasteur.fr/nyx/MaggotUBA-adapter MaggotUBA
cd MaggotUBA
poetry install

You can check for message "PyCall is already installed" in the output of:

poetry run python -c 'import julia; julia.install()'

Note that, if PyCall is not found, the above command will install it. However, TaggingBackends still needs to be installed for Python-side taggingbackends to successfully call Julia-side TaggingBackends.

A major drawback of this second approach is the JULIA_PROJECT environment variable must be set accordingly everytime the train command is called. For example, from the backend directory tree, on Unix-like OSes:

JULIA_PROJECT=<path> poetry run tagging-backend train

or, from LarvaTagger.jl root directory:

JULIA_PROJECT=<path> scripts/larvatagger.jl train

with <path> the path of the Julia project with TaggingBackends installed.

Note however that the last command above will not work if Julia was installed using juliaup. Prefer jill.