diff --git a/simulation/README.md b/simulation/README.md index 1959af1caeb06fb3a03a9b224ec23e140ff2b34b..6cd4492c5ddf2f7ac238332618337962a5c546bd 100644 --- a/simulation/README.md +++ b/simulation/README.md @@ -1,145 +1,43 @@ # Simulation -[CAMISIM](https://github.com/CAMI-challenge/CAMISIM) can model different microbial abundance -profiles (from multi-sample time series to differential abundance studies) and was used to -generate the benchmark data sets of the first CAMI challenge. +[InSilicoSeq](https://github.com/HadrienG/InSilicoSeq) is a sequencing simulator producing +realistic Illumina reads. Primarily intended for simulating metagenomic samples, it can also +be used to produce sequencing data from a single genome. -We describe here only the usage of de novo metagenomes simulation. For a more exhaustive -documentation, please refer to [CASISIM wiki](https://github.com/CAMI-challenge/CAMISIM/wiki). +We describe here only the usage for a simple simulation. For a more exhaustive documentation, +please refer to [InSilicoSeq documentation](https://insilicoseq.readthedocs.io/en/latest/?badge=latest) +or look at the help section of the tool with `iss generate --help`. ## Install -CAMISIM contains a lot of dependencies and the list can be found [Here](https://github.com/CAMI-challenge/CAMISIM/wiki/User-manual#installation). +You have several ways of installing InSilico seq that can be found [Here](https://insilicoseq.readthedocs.io/en/latest/iss/install.html). -However, we recommand the use of the docker image `cami/camisim:latest` or the singularity one ([here](https://gitlab.pasteur.fr/metagenomics/singularity/tree/master/tools/camisim), only from docker for the moment). +Here, we will use of the docker image `hadrieng/insilicoseq:1.4.2` or the singularity one built from the docker image. ## Run de novo metagenomes simulation -The repository of the tool ([Here](https://github.com/CAMI-challenge/CAMISIM)) comes with a -serie of sample data which makes it possible to directly test the tool without downloading -anything. +### Build your reference metagenome -For this example, we are going to start from the files given and modify a bit the parameters to -make the process a bit faster. Files can be found in the `example/` directory. -All the path are preceded by the `/input` directory since we are going to mount our config -files into this directory. +First step you need to do is select genomes to build your own metagenome. You just need to select genomes of +interest and put them in one `fasta` file. -> **Warning** Generated paired-end reads are in one unique `fastq` file that need to be -splitted in `_R1` and `_R2` files. +> **Note** InSilicoSeq can [download and build a metagenome for you](https://github.com/HadrienG/InSilicoSeq#generate-reads-without-input-genomes) but you need internet access. ### With docker ```bash -docker pull cami/camisim:latest +docker pull hadrieng/insilicoseq:1.4.2 mkdir output -docker run -it -v "/path/to/folder/example/:/input:rw" -v "/path/to/folder/output/:/output:rw" cami/camisim:latest metagenomesimulation.py /input/config.ini +docker run -it -v "/metagenome/folder:/input:rw" -v "/path/to/folder/output/:/output:rw" hadrieng/insilicoseq:1.4.2 iss generate --genomes /input/metagenome.fasta --model HiSeq --output /output/simulation --compress ``` ### With singularity on TARS -You can build the singularity image from the docker one. A recipe doing that is -available [here](https://gitlab.pasteur.fr/metagenomics/singularity/tree/master/tools/camisim). +You can easily build the singularity image from the docker one with `singularity pull NAME.simg docker:hadrieng/insilicoseq:1.4.2` command. -You will probably need to mount also a tmp directory since the one by default from -singularity seems to be too small for the tool. ```bash module load singularity -mkdir output tmp -singularity run --bind /path/to/folder/example:/input,/path/to/folder/tmp:/tmp,/path/to/folder/output:/output /path/to/singularity.simg /input/config.ini -``` - -> **WARNING**: This process can require at least 12Gb of RAM - -### Configuration file - -You can here set the different parameters for your simulation. The customed file is the -`config.ini` file. - -We will quickly go through the different part of this config file. You can find the -complete description on the [Documentation](https://github.com/CAMI-challenge/CAMISIM/wiki/Configuration-File-Options). - -By default path can be kept by default. However, it is easier to use absolute path -instead of relative one (by default) for singularity. - -#### Main - -```ini -[Main] -seed=632741178 # if None is used, random seed is chosen -phase=0 # 0: Full run; 1: Only community design; 2: Start with read simulation -max_processors=8 -dataset_id=RL # name of the created sample -output_directory=out -temp_directory=/tmp -gsa=True # whether a gold standard assembly should be created -pooled_gsa=True # whether a pooled gold standard over all samples is created -anonymous=True # whether the output is anonymized (reads from genomes are shuffled) -compress=1 # 0 is for no compression, 9 is maximum compression -``` - -#### Read Simulator - -```ini -[ReadSimulator] -readsim=/usr/local/bin/tools/art_illumina-2.3.6/art_illumina # leave by default since we are in a container -error_profiles=/usr/local/bin/tools/art_illumina-2.3.6/profiles # leave by default -samtools=/usr/local/bin/tools/samtools-1.3/samtools # leave by default -profile=mbarc # choose for ART: mi/hi/hi150/mbarc -size=0.1 # size of a single sample in Gigabasepairs (Gbp) -type=art # simulation tool -fragments_size_mean=270 -fragment_size_standard_deviation=27 -``` - -All the path for the tools are kept by default since we are using the tool from a container. -For the different profile, this corresponds to errors profiles and the documentation mention -that `mbarc` is recommended for bacterial communities. - -#### Community Design - -```ini -[CommunityDesign] -#distribution_file_paths='out/abundance0.tsv', 'out/abundance1.tsv', 'out/abundance2.tsv', 'out/abundance3.tsv', 'out/abundance4.tsv', 'out/abundance5.tsv', 'out/abundance6.tsv', 'out/abundance7.tsv', 'out/abundance8.tsv', 'out/abundance9.tsv' -ncbi_taxdump=/usr/local/bin/tools/ncbi-taxonomy_20170222.tar.gz -strain_simulation_template=scripts/StrainSimulationWrapper/sgEvolver/simulation_dir -number_of_samples=10 # Number of samples to be created -``` - -Again, different path can be kept by default since everything is running in the container. - -#### Community - -```ini -[community0] -metadata=metadata.tsv # tab separated data table It maps genome ids with additional information of their classification -id_to_genome_file=genome_to_id.tsv # format is: "genome_ID\tgenome_path" without header -id_to_gff_file= -genomes_total=24 # Total number of simulated genomes -genomes_real=24 # Number of genomes used from the input genomes -max_strains_per_otu=1 # Maximum number of strains drawn from genomes belonging to a single OTU -ratio=1 # ratio between different communities -mode=differential # mode for changing the abundances in different samples (replicates/timeseries_lognormal/timeseries_normal/differential) -log_mu=1 # mean of the used log-normal distribution -log_sigma=2 # standard deviation of the used log-normal distribution -gauss_mu=1 # mean of the used normal distribution -gauss_sigma=1 # standard deviation of the used normal distribution -view=False # show the used distribution of genomes before simulating +mkdir output +singularity run /path/to/singularity.simg iss generate --genomes metagenome.fasta --model HiSeq --output /output/simulation --compress ``` - -##### Metadata `tsv` file - -The first row of the file is the header for the following columns: - -genome_ID | OTU | NCBI_ID | novelty_category ---------- | --- | ------- | ---------------- - -__novelty category__: if a genome is not in the database, how "new" is it in comparison to genomes in the NCBI (`new_strain`, `new_species`, `new_genus`, ...). -Also see more information on this file here: [Genome selection](https://github.com/CAMI-challenge/CAMISIM/wiki/Genome-Selection) - -A by default file is provided but we are going to use copy of them in order to easily give the -possibility to use customized ones. - -##### Number of genomes - -Difference between `genomes_total` and `genomes_real` are simulated by sgEvolver