Skip to content
Snippets Groups Projects
user avatar
Blaise Li authored
36614b60
History

Getting the data

# for V. cholerae
./scripts/download_genome.py configs/download_Vcholerae.yaml
# for E. coli
./scripts/download_genome.py configs/download_Ecoli.yaml

Computing codon usage

Counting codons

On an interactive compute node (with 2 CPUs requested (for slurm, option -c 2)):

# for V. cholerae
snakemake --snakefile workflow/codon_usage.snakefile --configfile configs/codon_usage_Vcholerae.yaml -j 2 1>log 2>err
snakemake --snakefile workflow/codon_usage.snakefile --configfile configs/codon_usage_Vcholerae_first30.yaml -j 2 1>log 2>err
# for E. coli
snakemake --snakefile workflow/codon_usage.snakefile --configfile configs/codon_usage_Ecoli.yaml -j 2 1>log 2>err
snakemake --snakefile workflow/codon_usage.snakefile --configfile configs/codon_usage_Ecoli_first30.yaml -j 2 1>log 2>err

As of 05/07/2021, This actually only counts codons.

Codon counts are available here:

Similar files are also available in the codon_usage_first30 folder.

Filtering genes

Gene filtering is performed in the following Jupyter notebooks:

This consists in pre-processing the counts to exclude genes with no valid start codons and group codons by corresponding amino-acid.

The notebooks can be executed on the command-line as follows:

jupyter nbconvert --execute --to html Select_genes_Vcholerae.ipynb
jupyter nbconvert --execute --to html Select_genes_Vcholerae_first30.ipynb
jupyter nbconvert --execute --to html Select_genes_Ecoli.ipynb
jupyter nbconvert --execute --to html Select_genes_Ecoli_first30.ipynb

This results in the following "filtered" counts table.

For V. cholerae:

For E. coli:

Similar files are also available in the codon_usage_first30 folder.

Computing various usage biases

The actual codon usage biases computations is performed in the following Jupyter notebooks.

The notebooks can be executed on the command-line as follows:

jupyter nbconvert --execute --to html Explore_usage_biases_Vcholerae.ipynb
jupyter nbconvert --execute --to html Explore_usage_biases_Vcholerae_first30.ipynb
jupyter nbconvert --execute --to html Explore_usage_biases_Ecoli.ipynb
jupyter nbconvert --execute --to html Explore_usage_biases_Ecoli_first30.ipynb

This results in the following standardized codon biases tables:

For V. cholerae:

For E. coli:

Similar files are also available in the codon_usage_first30 folder.

Clustering genes based on their codon usage biases

The above notebooks (Explore_usage_biases_Vcholerae.ipynb and Explore_usage_biases_Ecoli.ipynb, and their _first30 versions) are also used to try clustering methods applied for each amino-acid on "by amino-acid codon biases", with as many clusters as there are codons for this amino-acid:

  • KMeans (previously tried for V. cholerae only, not used any more).

  • A method inspired by KMeans (i.e. using the same distance type, the squared euclidean distance), but with fixed "centroids" defined by computing standardized usage biases assuming exclusive use of one of the possible codons ("full bias"). The genes are simply assigned to the cluster whose centroids they are closest.

  • Assigning a gene to a cluster labelled by the codon with the highest bias for this gene ("highest bias").

The second and third methods make it easier to associate each cluster to a preferred codon.

The comparison of cluster composition on one example showed that the results are different. All methods seem to yield reasonable results, judging by the differentiated codon usage bias distributions that are obtained when drawing violin plots for each cluster.

For the sake of clarity, the KMeans method was not used for E. coli.

The cluster assignations are available in the following tables:

For V. cholerae:

For E. coli:

For each method and each amino-acid there is one column containing the cluster assignations. The column names follow the pattern cluster_<aa>_<method> where <aa> is the one-letter code of an amino-acid, and <method> is either full_bias or highest_bias.

Similar files are also available in the codon_usage_first30 folder.

Extracting lists of genes for each cluster

This is done in the following notebook:

The notebooks can be executed on the command-line as follows:

jupyter nbconvert --execute --to html Extract_gene_lists_Vcholerae.ipynb
jupyter nbconvert --execute --to html Extract_gene_lists_Vcholerae_first30.ipynb
jupyter nbconvert --execute --to html Extract_gene_lists_Ecoli.ipynb
jupyter nbconvert --execute --to html Extract_gene_lists_Ecoli_first30.ipynb

These notebooks contains links to gene lists as well as violin plots reprensenting the distribution of codon usage biases within each cluster.

Extracting most positively biased genes from each cluster

A further processing step was added to the above notebooks, that consists in defining a threshold that cuts the distribution of codon usage biases for a given gene cluster at a "valley" of low density in this distribution. Genes above this threshold are those with the top positively-biased genes for the codon around which the cluster is built.

Material and Methods

See doc/Material_and_methods.md.