Kenzo-Hugo Hillion · 1e42999a
--- a/simulation/README.md

+ 17

− 119
+++ b/simulation/README.md

+ 17

− 119
 # Simulation

-[CAMISIM](https://github.com/CAMI-challenge/CAMISIM) can model different microbial abundance
-profiles (from multi-sample time series to differential abundance studies) and was used to
-generate the benchmark data sets of the first CAMI challenge.
+[InSilicoSeq](https://github.com/HadrienG/InSilicoSeq) is a sequencing simulator producing
+realistic Illumina reads. Primarily intended for simulating metagenomic samples, it can also
+be used to produce sequencing data from a single genome.

-We describe here only the usage of de novo metagenomes simulation. For a more exhaustive
-documentation, please refer to [CASISIM wiki](https://github.com/CAMI-challenge/CAMISIM/wiki).
+We describe here only the usage for a simple simulation. For a more exhaustive documentation,
+please refer to [InSilicoSeq documentation](https://insilicoseq.readthedocs.io/en/latest/?badge=latest)
+or look at the help section of the tool with `iss generate --help`.

 ## Install

-CAMISIM contains a lot of dependencies and the list can be found [Here](https://github.com/CAMI-challenge/CAMISIM/wiki/User-manual#installation).
+You have several ways of installing InSilico seq that can be found [Here](https://insilicoseq.readthedocs.io/en/latest/iss/install.html).

-However, we recommand the use of the docker image `cami/camisim:latest` or the singularity one ([here](https://gitlab.pasteur.fr/metagenomics/singularity/tree/master/tools/camisim), only from docker for the moment).
+Here, we will use of the docker image `hadrieng/insilicoseq:1.4.2` or the singularity one built from the docker image.

 ## Run de novo metagenomes simulation

-The repository of the tool ([Here](https://github.com/CAMI-challenge/CAMISIM)) comes with a
-serie of sample data which makes it possible to directly test the tool without downloading
-anything.
+### Build your reference metagenome

-For this example, we are going to start from the files given and modify a bit the parameters to
-make the process a bit faster. Files can be found in the `example/` directory.
-All the path are preceded by the `/input` directory since we are going to mount our config
-files into this directory.
+First step you need to do is select genomes to build your own metagenome. You just need to select genomes of
+interest and put them in one `fasta` file.

-> **Warning** Generated paired-end reads are in one unique `fastq` file that need to be
-splitted in `_R1` and `_R2` files.
+> **Note** InSilicoSeq can [download and build a metagenome for you](https://github.com/HadrienG/InSilicoSeq#generate-reads-without-input-genomes) but you need internet access.

 ### With docker

 ```bash
-docker pull cami/camisim:latest
+docker pull hadrieng/insilicoseq:1.4.2
 mkdir output
-docker run -it -v "/path/to/folder/example/:/input:rw" -v "/path/to/folder/output/:/output:rw" cami/camisim:latest metagenomesimulation.py /input/config.ini
+docker run -it -v "/metagenome/folder:/input:rw" -v "/path/to/folder/output/:/output:rw" hadrieng/insilicoseq:1.4.2 iss generate --genomes /input/metagenome.fasta --model HiSeq --output /output/simulation --compress
 ```

 ### With singularity on TARS

-You can build the singularity image from the docker one. A recipe doing that is
-available [here](https://gitlab.pasteur.fr/metagenomics/singularity/tree/master/tools/camisim).
+You can easily build the singularity image from the docker one with `singularity pull NAME.simg docker:hadrieng/insilicoseq:1.4.2` command.

-You will probably need to mount also a tmp directory since the one by default from
-singularity seems to be too small for the tool.

 ```bash
 module load singularity
-mkdir output tmp
-singularity run --bind /path/to/folder/example:/input,/path/to/folder/tmp:/tmp,/path/to/folder/output:/output /path/to/singularity.simg /input/config.ini
-```
-
-> **WARNING**: This process can require at least 12Gb of RAM
-
-### Configuration file
-
-You can here set the different parameters for your simulation. The customed file is the
-`config.ini` file.
-
-We will quickly go through the different part of this config file. You can find the
-complete description on the [Documentation](https://github.com/CAMI-challenge/CAMISIM/wiki/Configuration-File-Options).
-
-By default path can be kept by default. However, it is easier to use absolute path
-instead of relative one (by default) for singularity.
-
-#### Main
-
-```ini
-[Main]
-seed=632741178          # if None is used, random seed is chosen
-phase=0                 # 0: Full run; 1: Only community design; 2: Start with read simulation
-max_processors=8
-dataset_id=RL           # name of the created sample
-output_directory=out
-temp_directory=/tmp
-gsa=True                # whether a gold standard assembly should be created
-pooled_gsa=True         # whether a pooled gold standard over all samples is created
-anonymous=True          # whether the output is anonymized (reads from genomes are shuffled)
-compress=1              # 0 is for no compression, 9 is maximum compression
-```
-
-#### Read Simulator
-
-```ini
-[ReadSimulator]
-readsim=/usr/local/bin/tools/art_illumina-2.3.6/art_illumina       # leave by default since we are in a container
-error_profiles=/usr/local/bin/tools/art_illumina-2.3.6/profiles    # leave by default
-samtools=/usr/local/bin/tools/samtools-1.3/samtools                # leave by default
-profile=mbarc                                       # choose for ART: mi/hi/hi150/mbarc
-size=0.1                                            # size of a single sample in Gigabasepairs (Gbp)
-type=art                                            # simulation tool
-fragments_size_mean=270
-fragment_size_standard_deviation=27
-```
-
-All the path for the tools are kept by default since we are using the tool from a container.
-For the different profile, this corresponds to errors profiles and the documentation mention
-that `mbarc` is recommended for bacterial communities.
-
-#### Community Design
-
-```ini
-[CommunityDesign]
-#distribution_file_paths='out/abundance0.tsv', 'out/abundance1.tsv', 'out/abundance2.tsv', 'out/abundance3.tsv', 'out/abundance4.tsv', 'out/abundance5.tsv', 'out/abundance6.tsv', 'out/abundance7.tsv', 'out/abundance8.tsv', 'out/abundance9.tsv'
-ncbi_taxdump=/usr/local/bin/tools/ncbi-taxonomy_20170222.tar.gz
-strain_simulation_template=scripts/StrainSimulationWrapper/sgEvolver/simulation_dir
-number_of_samples=10  # Number of samples to be created
-```
-
-Again, different path can be kept by default since everything is running in the container.
-
-#### Community
-
-```ini
-[community0]
-metadata=metadata.tsv                   # tab separated data table It maps genome ids with additional information of their classification
-id_to_genome_file=genome_to_id.tsv      # format is: "genome_ID\tgenome_path" without header
-id_to_gff_file=
-genomes_total=24                        # Total number of simulated genomes
-genomes_real=24                         # Number of genomes used from the input genomes
-max_strains_per_otu=1                   # Maximum number of strains drawn from genomes belonging to a single OTU
-ratio=1                                 # ratio between different communities
-mode=differential                       # mode for changing the abundances in different samples (replicates/timeseries_lognormal/timeseries_normal/differential)
-log_mu=1                                # mean of the used log-normal distribution
-log_sigma=2                             # standard deviation of the used log-normal distribution
-gauss_mu=1                              # mean of the used normal distribution
-gauss_sigma=1                           # standard deviation of the used normal distribution
-view=False                              # show the used distribution of genomes before simulating
+mkdir output
+singularity run /path/to/singularity.simg iss generate --genomes metagenome.fasta --model HiSeq --output /output/simulation --compress
 ```
-
-##### Metadata `tsv` file
-
-The first row of the file is the header for the following columns:
-
-genome_ID | OTU | NCBI_ID | novelty_category
--------- | --- | ------- | ----------------
-
-__novelty category__: if a genome is not in the database, how "new" is it in comparison to genomes in the NCBI (`new_strain`, `new_species`, `new_genus`, ...).
-Also see more information on this file here: [Genome selection](https://github.com/CAMI-challenge/CAMISIM/wiki/Genome-Selection)
-
-A by default file is provided but we are going to use copy of them in order to easily give the
-possibility to use customized ones.
-
-##### Number of genomes
-
-Difference between `genomes_total` and `genomes_real` are simulated by sgEvolver