Commit 45c23c03 authored by Blaise Li's avatar Blaise Li
Browse files

Fix yaml format for libtype_info.yaml

parent bfa5e8c8
......@@ -6,12 +6,12 @@ GRO-seq:
rule: make_normalized_bigwig
snakefile: /pasteur/homes/bli/src/bioinfo_utils/GRO-seq/GRO-seq.snakefile
processing_steps:
"The 3' adaptor (TGGAATTCTCGGGTGCCAAGG) was trimmed from the raw reads using cutadapt (version 1.15). The reads where the adaptor was not found were nonetheless kept (\"untrimmed\")"
"The 5' and 3' 4 nt UMIs were removed from the trimmed reads using cutadapt (version 1.15) with options -u 4 and -u -4"
"The 5' 4 nt UMIs as well as 7 nt in 3' (to be sure to eliminate possibly unrecognized first adaptor bases, as well as preceding 3' 4 nt UMIs) were removed from the \"untrimmed\" reads using cutadapt (version 1.15) options -u 4 and -u -7"
"Both types of reads were mapped on the C. elegans genome (WBcel235) using bowtie2 (version 2.3.4.1) with default parameters, and the resulting alignment files were merged using samtools (version 1.3.1)"
"Mapped reads were used to estimate the abundance of protein coding genes using featureCounts (version 1.5.2) with options -O -M --primary -s 1 --fracOverlap 0 and annotations corresponding to protein coding genes (as annotated in the iGenome distribution of WBcel235 obtained at ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/Ensembl/WBcel235/Caenorhabditis_elegans_Ensembl_WBcel235.tar.gz)"
"The merged alignment was used to generate the normalized bigwig file using millions of summed forward reads per kilobase in protein coding genes as normalizer. This was done with a custom bash script using bedtools (version 2.27.1), bedops (version 2.4.26) and bedGraphToBigWig (version 4)"
"The 3' adaptor (TGGAATTCTCGGGTGCCAAGG) was trimmed from the raw reads using cutadapt (version 1.15). The reads where the adaptor was not found were nonetheless kept (\"untrimmed\")
The 5' and 3' 4 nt UMIs were removed from the trimmed reads using cutadapt (version 1.15) with options -u 4 and -u -4
The 5' 4 nt UMIs as well as 7 nt in 3' (to be sure to eliminate possibly unrecognized first adaptor bases, as well as preceding 3' 4 nt UMIs) were removed from the \"untrimmed\" reads using cutadapt (version 1.15) options -u 4 and -u -7
Both types of reads were mapped on the C. elegans genome (WBcel235) using bowtie2 (version 2.3.4.1) with default parameters, and the resulting alignment files were merged using samtools (version 1.3.1)
Mapped reads were used to estimate the abundance of protein coding genes using featureCounts (version 1.5.2) with options -O -M --primary -s 1 --fracOverlap 0 and annotations corresponding to protein coding genes (as annotated in the iGenome distribution of WBcel235 obtained at ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/Ensembl/WBcel235/Caenorhabditis_elegans_Ensembl_WBcel235.tar.gz)
The merged alignment was used to generate the normalized bigwig file using millions of summed forward reads per kilobase in protein coding genes as normalizer. This was done with a custom bash script using bedtools (version 2.27.1), bedops (version 2.4.26) and bedGraphToBigWig (version 4)"
genome_build:
"C. elegans ce11 (WBcel235)"
processed_files:
......@@ -24,9 +24,9 @@ RNA-seq:
rule: make_normalized_bigwig
snakefile: /pasteur/homes/bli/src/bioinfo_utils/RNA-seq/RNA-seq.snakefile
processing_steps:
"Reads were mapped on the C. elegans genome (WBcel235) using hisat2 (version 2.0.4) with default parameters"
"Mapped reads were used to estimate the abundance of protein coding genes using featureCounts (version 1.5.2) with options -O -M --primary -s 2 --fracOverlap 0 and annotations corresponding to protein coding genes (as annotated in the iGenome distribution of WBcel235 obtained at ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/Ensembl/WBcel235/Caenorhabditis_elegans_Ensembl_WBcel235.tar.gz)"
"The alignment was used to generate the normalized bigwig file using millions of summed forward reads per kilobase in protein coding genes as normalizer. This was done with a custom bash script using bedtools (version 2.27.1), bedops (version 2.4.26) and bedGraphToBigWig (version 4)"
"Reads were mapped on the C. elegans genome (WBcel235) using hisat2 (version 2.0.4) with default parameters
Mapped reads were used to estimate the abundance of protein coding genes using featureCounts (version 1.5.2) with options -O -M --primary -s 2 --fracOverlap 0 and annotations corresponding to protein coding genes (as annotated in the iGenome distribution of WBcel235 obtained at ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/Ensembl/WBcel235/Caenorhabditis_elegans_Ensembl_WBcel235.tar.gz)
The alignment was used to generate the normalized bigwig file using millions of summed forward reads per kilobase in protein coding genes as normalizer. This was done with a custom bash script using bedtools (version 2.27.1), bedops (version 2.4.26) and bedGraphToBigWig (version 4)"
genome_build:
"C. elegans ce11 (WBcel235)"
processed_files:
......@@ -39,18 +39,18 @@ sRNA-IP-seq:
rule: make_normalized_bigwig
snakefile: /pasteur/homes/bli/src/bioinfo_utils/sRNA-seq/sRNA-seq.snakefile
processing_steps:
"The 3' adaptor (TGGAATTCTCGGGTGCCAAGG) was trimmed from the raw reads using cutadapt (version 1.15)"
"The trimmed reads were sorted by sequence using fastq-sort (from fastq-tools version 0.8) with option -s and deduplicated using a custom haskell program, keeping the highest quality among duplicates, at any given position"
"The 5' and 3' 4 nt UMIs were removed from the deduplicated reads using cutadapt (version 1.15) with options -u 4 and -u -4"
"After removing UMIs, the reads from 18 to 24 nt were selected using bioawk version 20110810"
"The size-selected reads were mapped on the C. elegans genome (WBcel235) using bowtie2 (version 2.3.4.1) with options -L 6 -i S,1,0.8 -N 0"
"The reads that failed to map were inspected using grep -E -B 1 -A 2 \"^G[ACGTN]{20,25}T+$\" to detect possible reads starting with G with 20 to 25 nt followed by a poly-U tail that might have prevented the mapping, and this tail was removed from such reads using a custom haskell program before re-mapping them."
"Mapped and remapped reads were used to estimate the abundance of small RNAs derived from structural RNAs using featureCounts (version 1.5.2) with options -O -s 1 --fracOverlap 1 and annotations corresponding to tRNA, snRNA, snoRNA, rRNA or RNA (as annotated in the iGenome distribution of WBcel235 obtained at ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/Ensembl/WBcel235/Caenorhabditis_elegans_Ensembl_WBcel235.tar.gz)"
"The abundance of non-structural RNAs was estimated by subtracting the above counts from the number of mapped and remapped reads."
"Initially mapped reads were classified using a custom python program according to their length, composition and on the annotations on which they mapped. Reads that didn't match miRNA and piRNA annotations were considered as potential endo-siRNAs."
"The potential endo-siRNAs of size 21 to 23 nt that started with G were classified as \"si_22G\" if they mapped antisense to annotation belonging to the following categories: DNA transposons, RNA transposons, satellites, simple repeats (as annotated in http://hgdownload.cse.ucsc.edu/goldenPath/ce11/database/rmsk.txt.gz) or pseudogene or protein-coding genes (as annotated in the iGenome distribution of WBcel235 obtained at ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/Ensembl/WBcel235/Caenorhabditis_elegans_Ensembl_WBcel235.tar.gz)"
"The \"si_22G\" reads were re-mapped on the C. elegans genome (WBcel235) using bowtie2 (version 2.3.4.1) with options -L 6 -i S,1,0.8 -N 0"
"The resulting alignment was used to generate the normalized bigwig file using millions of non-structural RNAs as normalizer. This was done with a custom bash script using bedtools (version 2.27.1), bedops (version 2.4.26) and bedGraphToBigWig (version 4)"
"The 3' adaptor (TGGAATTCTCGGGTGCCAAGG) was trimmed from the raw reads using cutadapt (version 1.15)
The trimmed reads were sorted by sequence using fastq-sort (from fastq-tools version 0.8) with option -s and deduplicated using a custom haskell program, keeping the highest quality among duplicates, at any given position
The 5' and 3' 4 nt UMIs were removed from the deduplicated reads using cutadapt (version 1.15) with options -u 4 and -u -4
After removing UMIs, the reads from 18 to 24 nt were selected using bioawk version 20110810
The size-selected reads were mapped on the C. elegans genome (WBcel235) using bowtie2 (version 2.3.4.1) with options -L 6 -i S,1,0.8 -N 0
The reads that failed to map were inspected using grep -E -B 1 -A 2 \"^G[ACGTN]{20,25}T+$\" to detect possible reads starting with G with 20 to 25 nt followed by a poly-U tail that might have prevented the mapping, and this tail was removed from such reads using a custom haskell program before re-mapping them.
Mapped and remapped reads were used to estimate the abundance of small RNAs derived from structural RNAs using featureCounts (version 1.5.2) with options -O -s 1 --fracOverlap 1 and annotations corresponding to tRNA, snRNA, snoRNA, rRNA or RNA (as annotated in the iGenome distribution of WBcel235 obtained at ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/Ensembl/WBcel235/Caenorhabditis_elegans_Ensembl_WBcel235.tar.gz)
The abundance of non-structural RNAs was estimated by subtracting the above counts from the number of mapped and remapped reads.
Initially mapped reads were classified using a custom python program according to their length, composition and on the annotations on which they mapped. Reads that didn't match miRNA and piRNA annotations were considered as potential endo-siRNAs.
The potential endo-siRNAs of size 21 to 23 nt that started with G were classified as \"si_22G\" if they mapped antisense to annotation belonging to the following categories: DNA transposons, RNA transposons, satellites, simple repeats (as annotated in http://hgdownload.cse.ucsc.edu/goldenPath/ce11/database/rmsk.txt.gz) or pseudogene or protein-coding genes (as annotated in the iGenome distribution of WBcel235 obtained at ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/Ensembl/WBcel235/Caenorhabditis_elegans_Ensembl_WBcel235.tar.gz)
The \"si_22G\" reads were re-mapped on the C. elegans genome (WBcel235) using bowtie2 (version 2.3.4.1) with options -L 6 -i S,1,0.8 -N 0
The resulting alignment was used to generate the normalized bigwig file using millions of non-structural RNAs as normalizer. This was done with a custom bash script using bedtools (version 2.27.1), bedops (version 2.4.26) and bedGraphToBigWig (version 4)"
genome_build:
"C. elegans ce11 (WBcel235)"
processed_files:
......@@ -63,18 +63,18 @@ sRNA-seq:
rule: make_normalized_bigwig
snakefile: /pasteur/homes/bli/src/bioinfo_utils/sRNA-seq/sRNA-seq.snakefile
processing_steps:
"The 3' adaptor (TGGAATTCTCGGGTGCCAAGG) was trimmed from the raw reads using cutadapt (version 1.15)"
"The trimmed reads were sorted by sequence using fastq-sort (from fastq-tools version 0.8) with option -s and deduplicated using a custom haskell program, keeping the highest quality among duplicates, at any given position"
"The 5' and 3' 4 nt UMIs were removed from the deduplicated reads using cutadapt (version 1.15) with options -u 4 and -u -4"
"After removing UMIs, the reads from 18 to 24 nt were selected using bioawk version 20110810"
"The size-selected reads were mapped on the C. elegans genome (WBcel235) using bowtie2 (version 2.3.4.1) with options -L 6 -i S,1,0.8 -N 0"
"The reads that failed to map were inspected using grep -E -B 1 -A 2 \"^G[ACGTN]{20,25}T+$\" to detect possible reads starting with G with 20 to 25 nt followed by a poly-U tail that might have prevented the mapping, and this tail was removed from such reads using a custom haskell program before re-mapping them."
"Mapped and remapped reads were used to estimate the abundance of structural RNAs using featureCounts (version 1.5.2) with options -O -s 1 --fracOverlap 1 and annotations corresponding to tRNA, snRNA, snoRNA, rRNA or RNA (as annotated in the iGenome distribution of WBcel235 obtained at ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/Ensembl/WBcel235/Caenorhabditis_elegans_Ensembl_WBcel235.tar.gz)"
"The abundance of non-structural RNAs was estimated by subtracting the above counts from the number of mapped and remapped reads."
"Initially mapped reads were classified using a custom python program according to their length, composition and on the annotations on which they mapped. Reads that didn't match miRNA and piRNA annotations were considered as potential endo-siRNAs."
"The potential endo-siRNAs of size 21 to 23 nt that started with G were classified as \"si_22G\" if they mapped antisense to annotation belonging to the following categories: DNA transposons, RNA transposons, satellites, simple repeats (as annotated in http://hgdownload.cse.ucsc.edu/goldenPath/ce11/database/rmsk.txt.gz) or pseudogene or protein-coding genes (as annotated in the iGenome distribution of WBcel235 obtained at ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/Ensembl/WBcel235/Caenorhabditis_elegans_Ensembl_WBcel235.tar.gz)"
"The \"si_22G\" reads were re-mapped on the C. elegans genome (WBcel235) using bowtie2 (version 2.3.4.1) with options -L 6 -i S,1,0.8 -N 0"
"The resulting alignment was used to generate the normalized bigwig file using millions of non-structural RNAs as normalizer. This was done with a custom bash script using bedtools (version 2.27.1), bedops (version 2.4.26) and bedGraphToBigWig (version 4)"
"The 3' adaptor (TGGAATTCTCGGGTGCCAAGG) was trimmed from the raw reads using cutadapt (version 1.15)
The trimmed reads were sorted by sequence using fastq-sort (from fastq-tools version 0.8) with option -s and deduplicated using a custom haskell program, keeping the highest quality among duplicates, at any given position
The 5' and 3' 4 nt UMIs were removed from the deduplicated reads using cutadapt (version 1.15) with options -u 4 and -u -4
After removing UMIs, the reads from 18 to 24 nt were selected using bioawk version 20110810
The size-selected reads were mapped on the C. elegans genome (WBcel235) using bowtie2 (version 2.3.4.1) with options -L 6 -i S,1,0.8 -N 0
The reads that failed to map were inspected using grep -E -B 1 -A 2 \"^G[ACGTN]{20,25}T+$\" to detect possible reads starting with G with 20 to 25 nt followed by a poly-U tail that might have prevented the mapping, and this tail was removed from such reads using a custom haskell program before re-mapping them.
Mapped and remapped reads were used to estimate the abundance of structural RNAs using featureCounts (version 1.5.2) with options -O -s 1 --fracOverlap 1 and annotations corresponding to tRNA, snRNA, snoRNA, rRNA or RNA (as annotated in the iGenome distribution of WBcel235 obtained at ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/Ensembl/WBcel235/Caenorhabditis_elegans_Ensembl_WBcel235.tar.gz)
The abundance of non-structural RNAs was estimated by subtracting the above counts from the number of mapped and remapped reads.
Initially mapped reads were classified using a custom python program according to their length, composition and on the annotations on which they mapped. Reads that didn't match miRNA and piRNA annotations were considered as potential endo-siRNAs.
The potential endo-siRNAs of size 21 to 23 nt that started with G were classified as \"si_22G\" if they mapped antisense to annotation belonging to the following categories: DNA transposons, RNA transposons, satellites, simple repeats (as annotated in http://hgdownload.cse.ucsc.edu/goldenPath/ce11/database/rmsk.txt.gz) or pseudogene or protein-coding genes (as annotated in the iGenome distribution of WBcel235 obtained at ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/Ensembl/WBcel235/Caenorhabditis_elegans_Ensembl_WBcel235.tar.gz)
The \"si_22G\" reads were re-mapped on the C. elegans genome (WBcel235) using bowtie2 (version 2.3.4.1) with options -L 6 -i S,1,0.8 -N 0
The resulting alignment was used to generate the normalized bigwig file using millions of non-structural RNAs as normalizer. This was done with a custom bash script using bedtools (version 2.27.1), bedops (version 2.4.26) and bedGraphToBigWig (version 4)"
genome_build:
"C. elegans ce11 (WBcel235)"
processed_files:
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment