diff --git a/README.md b/README.md index d2b77e4397d4fc698c8ab00093ed157850ce2bc8..0ee7e9c9980df8b07bf2f82f7e3bc626c41432fd 100644 --- a/README.md +++ b/README.md @@ -3,150 +3,15 @@ [](https://gitlab.pasteur.fr/metagenomics/metagenedb/commits/dev) [](https://gitlab.pasteur.fr/metagenomics/metagenedb/commits/dev) -Django based project to build genes catalog and tools -to play with it and contact external services. +## The project ----- +The main motivation behind MetageneDB is to provide a support for all the analysis that are based on gene catalogs. +It is composed of both an API and a client side for visualization and interaction with the DB. -## Setup the services on your local machine +* Graphical interface to browse through the catalog +* REST API to programmatically query and retrieve information from the database +* (not implemented) Interface to perform analysis from gene counts present on the catalog -### Dependencies +## Wiki & Documentation -The application depends on different services that run independently on docker images and all of this is -orchestrated by `docker-compose`. - -Therefore to run the application you need: - -* `Docker` : [Install instructions](https://docs.docker.com/install/) -* `Docker Compose` : [Install instructions](https://docs.docker.com/compose/install/) - -### Configuration - -For `docker-compose`, you need to create a `.env` file: `touch .env`. An example is available: `.env.sample`. - -The settings of the Django server is based on the `backend/.env` file. You can copy the sample file -(`cp backend/.env.sample backend/.env`) and fill in the variables. - -You can of course customize more of the Django server settings in the `settings` module of metagenedb. - -Now we will go through the different parts - -#### Secret key - -This is the Django `SECRET_KEY` and you need to specify your own. For instance you can use the command -`openssl rand -base64 32` to generate one by command line. - -#### Create your own DB on postgresql - -The following variables have the default value: - -```bash -DATABASE_HOST=postgresql -DATABASE_USER=postgres -DATABASE_NAME=postgres -DATABASE_PASSWORD="" -DATABASE_PORT=5432 -``` - -It will work if you leave it as it is but you might face security issues having a by default database -without credentials. - -What we recommand is to create your own database. Here is described one way to do it. To do that you need to -first run the db image and identify its running ID: - -```bash -khillion:~/metagenedb $ docker-compose up postgresql -d # This runs only the postgresql service of your docker-compose in detached mode. You can also detached from you running screen using Ctrl+Z -Creating postgresql ... done -khillion:~/metagenedb $ docker ps # List your running docker images -CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES -5002f210f9d8 postgres:11.4-alpine "docker-entrypoint.s…" 1 minute ago Up 1 minute 0.0.0.0:5433->5432/tcp postgresql -``` - -Now that you have the `CONTAINER ID`, here `5002f210f9d8` you can run a `bash` terminal in this container and -create your own database: - -```bash -khillion:~/metagenedb $ docker exec -it 5002f210f9d8 bash -bash-5.0# psql --user=postgres -```` - -This will open the `SQL` console where you can do what you need: - -```psql -CREATE ROLE metagenedb WITH PASSWORD 'yourawesomepassword'; -ALTER ROLE metagenedb WITH CREATEDB; -CREATE DATABASE metagenedb WITH OWNER metagenedb; -exit -``` - -Now you have you own database, protected by a password and you need to update your `.env`: - -```bash -DATABASE_HOST=postgresql -DATABASE_USER=metagenedb -DATABASE_NAME=metagenedb -DATABASE_PASSWORD=yourawesomepassword -DATABASE_PORT=5432 -``` - -> **Note**: The by default port for postgres is `5432`. In the `docker-compose.yaml` you will notice that this -port is redirected to `5433` on the `localhost`. This is done in order to not interfere with your local -postgres if you have one. This means you need to change `DATABASE_HOST` to `localhost` and `DATABASE_POS - -### Pre-computed statistics - -Some statistics about genes are pre-computed and can be accessed through the `/api/catalog/v1/statistics` endpoint. -The ID is constructed with the following format: `<statisctics-type>-<gene_source>-<method>-<options>`. - ----- - -## Run the application - -For the moment, only the `docker-compose.dev.yaml` is used. To run the application simply run the command: - -```bash -docker-compose up --build -``` - -The `--build` option is only necessary during the first usage or when you make changes that need the docker -container to be built again. - -Since directories with source codes are mounted in the containers, changes you make locally should be -directly reflected on the application. - -### Populate the database - -You have a set of scripts available within the `backend/scripts` directory that you can execute directly -from within the container. First identify the container ID corresponding to the backend with `docker ps` command. Then you can execute a bash terminal within the container and execute the scripts you want: - -```bash -docker exec -it YOURCONTAINER_ID bash -root@YOURCONTAINER_ID:/code# python scripts/script.py -``` - -For the moment you can: - -* Import all kegg orthologies with `load_kegg.py`: It directly fetch all KEGGs KO from the KEGG REST API. -* Import genes from IGC catalog from the [annotation file](ftp://ftp.cngb.org/pub/SciRAID/Microbiome/humanGut_9.9M/GeneAnnotation/IGC.annotation_OF.summary.gz). You can a small part of this annotation file in the `dev_data` folder. - -> **Note**: You can also execute the scripts locally from a `pipenv shell` for instance. You need to make -sure that you change the way to log to postgres since the access is different from your machine compared to -from a container. - ------ - -## Dev tips - -#### Profiling code - -```python -from metagenedb.common.utils.profiling import profile - -@profile("/my/file/path") -def my_function(a, b, c): - ... -``` - -```bash -snakeviz /my/file/path -``` \ No newline at end of file +For more information, please have a look at our [Wiki](https://gitlab.pasteur.fr/metagenomics/metagenedb/-/wikis/MetageneDB) diff --git a/backend/.env.sample b/backend/.env.sample index e1d292e5133f06b9d66b0a30516fc80c7fb1ded0..840b355f234d78db80b6a340606c1590b45bc4ef 100644 --- a/backend/.env.sample +++ b/backend/.env.sample @@ -6,3 +6,4 @@ DATABASE_NAME=postgres DATABASE_PASSWORD="" DATABASE_PORT=5432 SECRET_KEY=YOUR_KEY +DB_LOG_LEVEL=INFO diff --git a/backend/metagenedb/apps/catalog/factory/function.py b/backend/metagenedb/apps/catalog/factory/function.py index a8b077ebc828c9b5bd0894b8fa3b3bff182cdf07..4d0a275f9576bb2e094fba0e33772503b9516038 100644 --- a/backend/metagenedb/apps/catalog/factory/function.py +++ b/backend/metagenedb/apps/catalog/factory/function.py @@ -1,36 +1,33 @@ -from factory import DjangoModelFactory, fuzzy -from faker import Factory +from factory import DjangoModelFactory, Faker, fuzzy from metagenedb.apps.catalog import models -from .fuzzy_base import FuzzyLowerText - -faker = Factory.create() SELECTED_SOURCE = [i[0] for i in models.Function.SOURCE_CHOICES] EGGNOG_VERSIONS = [i[0] for i in models.EggNOG.VERSION_CHOICES] -class BaseFunctionFactory(DjangoModelFactory): - function_id = FuzzyLowerText(prefix='function-', length=15) - - -class FunctionFactory(BaseFunctionFactory): +class FunctionFactory(DjangoModelFactory): class Meta: model = models.Function + function_id = Faker('bothify', text='function-####') source = fuzzy.FuzzyChoice(SELECTED_SOURCE) - function_id = FuzzyLowerText(prefix='function-', length=15) -class EggNOGFactory(BaseFunctionFactory): +class EggNOGFactory(DjangoModelFactory): + function_id = Faker('bothify', text='COG####') + name = Faker('bothify', text='COG-????????') + class Meta: model = models.EggNOG version = fuzzy.FuzzyChoice(EGGNOG_VERSIONS) -class KeggOrthologyFactory(BaseFunctionFactory): +class KeggOrthologyFactory(DjangoModelFactory): + function_id = Faker('bothify', text='K0####') + class Meta: model = models.KeggOrthology diff --git a/backend/metagenedb/apps/catalog/factory/gene.py b/backend/metagenedb/apps/catalog/factory/gene.py index 304d3edc8a12a90ee906b804da47c1933cc2151b..3637655ab605e2c8b6e0a3d59d61379869bba9ea 100644 --- a/backend/metagenedb/apps/catalog/factory/gene.py +++ b/backend/metagenedb/apps/catalog/factory/gene.py @@ -1,16 +1,12 @@ from factory import ( - DjangoModelFactory, RelatedFactory, SubFactory, fuzzy + DjangoModelFactory, Faker, RelatedFactory, SubFactory, fuzzy ) -from faker import Factory from metagenedb.apps.catalog import models -from .fuzzy_base import FuzzyLowerText -from .function import FunctionFactory +from .function import FunctionFactory, KeggOrthologyFactory, EggNOGFactory from .taxonomy import TaxonomyFactory -faker = Factory.create() - GENE_SOURCES = [i[0] for i in models.Gene.SOURCE_CHOICES] @@ -18,9 +14,9 @@ class GeneFactory(DjangoModelFactory): class Meta: model = models.Gene - gene_id = FuzzyLowerText(prefix='gene-', length=15) - name = fuzzy.FuzzyText(prefix='name-', length=15) - length = fuzzy.FuzzyInteger(200, 10000) + gene_id = Faker('bothify', text='gene-?#?#??-#??#?#') + name = Faker('bothify', text='Gene_name-##-????') + length = Faker('pyint', min_value=200, max_value=4200) source = fuzzy.FuzzyChoice(GENE_SOURCES) @@ -36,9 +32,17 @@ class GeneFunctionFactory(DjangoModelFactory): function = SubFactory(FunctionFactory) +class GeneKeggFactory(GeneFunctionFactory): + function = SubFactory(KeggOrthologyFactory) + + +class GeneEggNOGFactory(GeneFunctionFactory): + function = SubFactory(EggNOGFactory) + + class GeneWithKeggFactory(GeneFactory): - kegg = RelatedFactory(GeneFunctionFactory, 'gene', function__source='kegg') + kegg = RelatedFactory(GeneKeggFactory, 'gene') class GeneWithEggNOGFactory(GeneFactory): - eggnog = RelatedFactory(GeneFunctionFactory, 'gene', function__source='eggnog') + eggnog = RelatedFactory(GeneEggNOGFactory, 'gene') diff --git a/backend/metagenedb/apps/catalog/factory/taxonomy.py b/backend/metagenedb/apps/catalog/factory/taxonomy.py index 2504f40847b6254c9f113b397c01cd20d05e018e..0f770047d2c1fd6524b172d5a29b145fee9c8ed3 100644 --- a/backend/metagenedb/apps/catalog/factory/taxonomy.py +++ b/backend/metagenedb/apps/catalog/factory/taxonomy.py @@ -1,3 +1,5 @@ +from collections import OrderedDict + from factory import DjangoModelFactory, fuzzy from faker import Factory @@ -25,44 +27,48 @@ class DbGenerator: self.created_ids = set() # store already created IDs to skip them def generate_db_from_tree(self, tree): + """ + Tree need to be an OrderedDict from higher to lower level + """ + self.last_tax = None for rank, desc in tree.items(): if desc['tax_id'] not in self.created_ids: - TaxonomyFactory.create( + self.last_tax = TaxonomyFactory.create( tax_id=desc['tax_id'], name=desc['name'], rank=rank, + parent=getattr(self, "last_tax", None) ) self.created_ids.add(desc['tax_id']) + self.last_tax.build_hierarchy() def _generate_lactobacillus_db(db_generator): """ Generate db with few ranks corresponding to Lactobacillus genus """ - tree = { - "class": {"name": "Bacilli", "tax_id": "91061"}, - "genus": {"name": "Lactobacillus", "tax_id": "1578"}, - "order": {"name": "Lactobacillales", "tax_id": "186826"}, - "family": {"name": "Lactobacillaceae", "tax_id": "33958"}, - "phylum": {"name": "Firmicutes", "tax_id": "1239"}, - "no_rank": {"name": "cellular organisms", "tax_id": "131567"}, - "superkingdom": {"name": "Bacteria", "tax_id": "2"}, - "species_group": {"name": "Lactobacillus casei group", "tax_id": "655183"} - } + tree = OrderedDict() + tree['no_rank'] = {"name": "root", "tax_id": "1"} + tree["superkingdom"] = {"name": "Bacteria", "tax_id": "2"} + tree["phylum"] = {"name": "Firmicutes", "tax_id": "1239"} + tree["class"] = {"name": "Bacilli", "tax_id": "91061"} + tree["order"] = {"name": "Lactobacillales", "tax_id": "186826"} + tree["family"] = {"name": "Lactobacillaceae", "tax_id": "33958"} + tree["genus"] = {"name": "Lactobacillus", "tax_id": "1578"} + tree["species_group"] = {"name": "Lactobacillus casei group", "tax_id": "655183"} db_generator.generate_db_from_tree(tree) def _generate_escherichia_db(db_generator): - tree = { - "class": {"name": "Gammaproteobacteria", "tax_id": "1236"}, - "genus": {"name": "Escherichia", "tax_id": "561"}, - "order": {"name": "Enterobacterales", "tax_id": "91347"}, - "family": {"name": "Enterobacteriaceae", "tax_id": "543"}, - "phylum": {"name": "Proteobacteria", "tax_id": "1224"}, - "no_rank": {"name": "cellular organisms", "tax_id": "131567"}, - "species": {"name": "Escherichia coli", "tax_id": "562"}, - "superkingdom": {"name": "Bacteria", "tax_id": "2"} - } + tree = OrderedDict() + tree["no_rank"] = {"name": "root", "tax_id": "1"} + tree["superkingdom"] = {"name": "Bacteria", "tax_id": "2"} + tree["phylum"] = {"name": "Proteobacteria", "tax_id": "1224"} + tree["class"] = {"name": "Gammaproteobacteria", "tax_id": "1236"} + tree["order"] = {"name": "Enterobacterales", "tax_id": "91347"} + tree["family"] = {"name": "Enterobacteriaceae", "tax_id": "543"} + tree["genus"] = {"name": "Escherichia", "tax_id": "561"} + tree["species"] = {"name": "Escherichia coli", "tax_id": "562"} db_generator.generate_db_from_tree(tree) diff --git a/backend/metagenedb/apps/catalog/management/commands/create_light_db.py b/backend/metagenedb/apps/catalog/management/commands/create_light_db.py new file mode 100644 index 0000000000000000000000000000000000000000..16bf6fa96ac8cd41645f469fd6813ba44c0f64b7 --- /dev/null +++ b/backend/metagenedb/apps/catalog/management/commands/create_light_db.py @@ -0,0 +1,68 @@ +import logging +from random import randint + +from django.core.management.base import BaseCommand + +from metagenedb.apps.catalog.factory import GeneFactory, GeneWithEggNOGFactory, GeneWithKeggFactory +from metagenedb.apps.catalog.factory.taxonomy import generate_simple_db as gen_tax_db +from metagenedb.apps.catalog.models import ( + Gene, Function, Taxonomy +) +from metagenedb.apps.catalog.management.commands.compute_stats import ( + ComputeStatistics, ComputeCounts, ComputeGeneLength, ComputeTaxonomyRepartition, ComputeTaxonomyPresence +) + +logging.basicConfig(format='[%(asctime)s] %(levelname)s:%(name)s:%(message)s') +logger = logging.getLogger() + + +def empty_db(): + Gene.objects.all().delete() + Taxonomy.objects.all().delete() + Function.objects.all().delete() + + +def create_taxonomy_db(): + Taxonomy.objects.all().delete() + gen_tax_db() + + +def create_genes_db(): + Gene.objects.all().delete() + GeneFactory.create_batch(50) + GeneWithEggNOGFactory.create_batch(15) + GeneWithKeggFactory.create_batch(12) + for tax in Taxonomy.objects.all(): + GeneFactory.create_batch(randint(1, 10), taxonomy=tax) + GeneWithEggNOGFactory.create(taxonomy=tax) + GeneWithKeggFactory.create(taxonomy=tax) + + +def compute_stats(): + ComputeStatistics('all').clean_db() + for gene_source in ['all', 'virgo', 'igc']: + ComputeCounts(gene_source).all() + ComputeGeneLength(gene_source).all() + ComputeTaxonomyRepartition(gene_source).all() + ComputeTaxonomyPresence(gene_source).all() + + +def create_small_db(): + empty_db() + create_taxonomy_db() + create_genes_db() + compute_stats() + + +class Command(BaseCommand): + help = 'Create a light DB with random items to illustrate functionnalities of the application.' + + def set_logger_level(self, verbosity): + if verbosity > 2: + logger.setLevel(logging.DEBUG) + elif verbosity > 1: + logger.setLevel(logging.INFO) + + def handle(self, *args, **options): + self.set_logger_level(int(options['verbosity'])) + create_small_db() diff --git a/backend/metagenedb/apps/catalog/models/taxonomy.py b/backend/metagenedb/apps/catalog/models/taxonomy.py index bd2bfe48466c6438ab1f20229e91dbaf2292f003..ffb2a12fd9c11b8b9d1cf7238be73fde4893484b 100644 --- a/backend/metagenedb/apps/catalog/models/taxonomy.py +++ b/backend/metagenedb/apps/catalog/models/taxonomy.py @@ -62,12 +62,13 @@ class Taxonomy(models.Model): Build and save parental hierarchy for an entry """ hierarchy = {} - if self.name != 'root' and self.parent is not None: + if self.name != 'root': hierarchy[self.rank] = { 'tax_id': self.tax_id, 'name': self.name } - hierarchy = {**hierarchy, **getattr(self.parent, 'hierarchy', self.parent.build_hierarchy())} + if self.parent is not None: + hierarchy = {**hierarchy, **getattr(self.parent, 'hierarchy', self.parent.build_hierarchy())} self.hierarchy = hierarchy self.save() return hierarchy diff --git a/backend/scripts/populate_db/test_import_igc_data.py b/backend/scripts/populate_db/test_import_igc_data.py index d43344693b235cc1430cc14de64fd806221ece72..1eeb0ef92c9b78f8ba72c7fd2b4f9b0ec7a55254 100644 --- a/backend/scripts/populate_db/test_import_igc_data.py +++ b/backend/scripts/populate_db/test_import_igc_data.py @@ -297,12 +297,12 @@ class TestBuildTaxoMapping(APITestCase): self.assertDictEqual(self.import_igc_genes.phylum_mapping, expected_phylum_dict) -class TestBuildBuildFunctionCatalog(APITestCase): +class TestBuildFunctionCatalog(APITestCase): @classmethod def setUpTestData(cls): - cls.keggs = KeggOrthologyFactory.create_batch(100) - cls.eggnogs = EggNOGFactory.create_batch(100) + cls.keggs = KeggOrthologyFactory.create_batch(10) + cls.eggnogs = EggNOGFactory.create_batch(10) def setUp(self): self.import_igc_genes = ImportIGCGenes('test', 'test_url', 'test_token')