TaggingBackends
This library helps to implement automatic tagging backends for the Nyx larva tagger.
Template project for tagging backends
A tagging backend, called e.g. TaggingBackend
, is a Python project with the following directory structure:
├── LICENSE <- Default is MIT.
├── README.md <- Project description.
├── data
│ ├── raw <- The input data are accessible from this
│ │ directory, with their original file structure.
│ ├── interim <- Preprocessed data and extracted features
│ │ can be stored in this directory.
│ └── processed <- Predicted labels from predict_model.py are
│ expected in this directory.
├── models <- Hyperparameters and weights of trained
│ classifiers can be stored here.
│
├── pyproject.toml <- Project definition file for Poetry.
├── src
│ └── taggingbackend <- Python package name.
│ │ Same as project name, with lowercase letters,
│ │ hyphens converted into underscores.
│ │ For example, `My-Tagger` becomes `my_tagger`.
│ ├── __init__.py <- Defines variable `__version__`.
│ ├── data
│ │ └── make_dataset.py <- Picks and converts raw data files; if files
│ │ are to be written, they go into data/interim;
│ │ optional.
│ ├── features
│ │ └── build_features.py <- Extracts and saves features to file in
│ │ data/interim; optional.
│ └── models
│ ├── train_model.py <- Trains the behavior tagging algorithm and
│ │ stores the trained model in models/;
│ │ optional.
│ └── predict_model.py <- Loads the trained model and features from
│ data/interim, and moves the resulting
│ labels in data/processed.
└── test
├── __init__.py <- Empty file.
└── test_taggingbackend.py <- Automated tests; optional.
Filename is `test_<package_name>.py`.
The above structure borrows elements from the Cookiecutter Data Science project template, adapted for use with Poetry.
The src/<package_name>/{data,features,models}
directories can accommodate Python modules (in subpackages <package_name>.{data,features,models}
respectively).
For example, the model can be implemented as a Python class in an additional file in
src/<package_name>/models
, e.g. mymodel.py
.
In this case, an empty __init__.py
file should be created in the same directory.
As the Python package is installed, this custom module will be loadable from anywhere
with import <package_name>.models.mymodel
.
On the other hand, the make_dataset.py
, build_features.py
, predict_model.py
and train_model.py
are Python scripts, with a main program.
These scripts will be run using Poetry, from the project root.
See example scripts in the examplebackend
directory.
Only predict_model.py
is required by the Nyx tagging UI.
The simplest working directory structure for a tagging backend is:
├── models/
│ └── trained_model/
├── pyproject.toml
└── scripts/
└── predict_model.py
with trained_model
the name of the trained model.
Backends that do not need to store trained models should still have an empty subdirectory there,
as these subdirectories in models
are looked for by the Nyx tagger UI.
The data
directory is automatically created by the BackendExplorer
object, together with its raw
and processed
subdirectories, therefore there is no need to include these directories in the backend.
Although the Nyx tagger UI does not expect the project to include a Python package, a Poetry-managed virtual environment should be set up with the taggingbackends
package installed, so that the command poetry run tagging-backend
is available at the project root directory.
The tests
directory is renamed test
for compatibility with Julia projects.
Python/Poetry do not need additional configuration to properly handle the tests.
Input and output data
Per default, the input data will be copied into data/raw
.
Input data files can be in any format, and a backend is responsible for handling these
various files.
Per default, training labels are provided as json files that can be loaded using the
taggingbackends.labels.Labels
class.
Predicted labels are expected in data/processed
.
A backend can make use of data/interim
or ignore it.
Similarly, the trained models can be stored in models
or not.
However, a subdirectory by the name of the model instance should be created for model
discovery (see below).
Full paths to data and model directories, and files, are made available in the training
and prediction procedures with a taggingbackends.explorer.BackendExplorer
object.
Labels
The internal representation is as follows:
- dictionary of run identifiers (
str
, typicallydate_time
) as keys and, as values:- dictionary of larva identifiers (
int
) as keys and, as values:- dictionary of timestamps (
float
) as keys and discrete behavioral states (str
) as values.
- dictionary of timestamps (
- dictionary of larva identifiers (
Labels are encapsulated in a dedicated datatype that also stores metadata and information about labels (names, colors).
See taggingbackends.data.labels.Labels
, and an example json labels file.
Model specification
A backend can train and use multiple model instances.
Each instance is assigned an identifier and the related files, including data files,
are actually stored in corresponding subdirectories in the data/raw
, data/interim
,
data/processed
and models
directories.
Per default, a new model is identified with a timestamp in the YYYYMMDD_HHMMSS
format.
For example, the taggingbackends.explorer.BackendExplorer.list_input_files
method will
seek for files in the data/raw/<instance_identifier>
directory only.
A backend can store the model in the models/<instance_identifier>
directory
(please call the taggingbackends.explorer.BackendExplorer.model_dir
method to get the
exact location) or not.
However, the directory should be created to make the trained models discoverable.
Indeed, the Nyx tagging UI will seek for subdirectories in the models
directory to list
the available trained models.