diff --git a/doc/source/index.rst b/doc/source/index.rst index a13ff110ae135fe107ea73948474cc9cd68275fb..c21c5b3b87f078b80aeaaeaed8286b25fe1a3306 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -22,7 +22,7 @@ Overview ======== The QC and preprocessing step goes as follow: -* map column from of a specific GWAS to standardize names +* Map column from of a specific GWAS to standardize names * Select GWAS SNPs that are in the input reference panel * Align coded allele of the GWAS with the reference panel * Infer Number of sample by SNPs if not present in input data @@ -58,14 +58,13 @@ In a terminal, execute the following lines: Input ====== -* A reference panel (1000 genome format). The user is expected to provide a reference panel - in tsv format with the following columns in the following order, without header: +* **A reference panel** to the format below. The user is expected to provide a reference panel + in tsv format with the following columns in the following order (chromosome, rsID, Minor Allele + Frequency, Position, reference, Alternative allele), **without header**. +-----+------------+---------+-------+-----+-----+ -| chr | snp_id | MAF | pos | ref | alt | -+=====+============+=========+=======+=====+=====+ | 1 | rs62635286 |0.0970447| 13116 | T | G | -+-----+------------+---------+-------+-----+-----+ ++=====+============+=========+=======+=====+=====+ | 1 | rs63125786 |0.0970447| 15116 | T | A | +-----+------------+---------+-------+-----+-----+ | 1 | rs5686 |0.1970447| 17116 | A | G | @@ -74,31 +73,47 @@ Input +-----+------------+---------+-------+-----+-----+ -* Folder containing all raw gwas data : (all chromosomes in one file) (minimal conditions?? tab separated?) -* a list containing the name of GWAS file to the string format. -* A descriptor csv files that will described each GWAS summary statistic files: - +* The **GWAS Folder** containing all raw gwas data (correspond to the --gwas-info command line parameter): all chromosomes in one file, compressed or uncompressed +* A descriptor csv files (see example below and `here <https://gitlab.pasteur.fr/statistical-genetics/jass_suite_pipeline/-/blob/master/input_files/Data_test_EAS.csv>`_)that will described each GWAS summary statistic files (correspond to the --input-folder command line parameter): * a header * 1 line per study - * the fields are: - + * the fields categories are: +-------------------------------------------+---------------------------------------------------------------+ | category | field name | +===========================================+===============================================================+ | path to the data | filename | +-------------------------------------------+---------------------------------------------------------------+ -| study info fields | Consortium,Outcome,fullName,type,Nsample,Ncase,Ncontrol,Nsnp | +| study info fields | Consortium,Outcome,fullName,Nsample,Ncase,Ncontrol,Nsnp | +-------------------------------------------+---------------------------------------------------------------+ | names of the header in the GWAS file | snpid,a1,a2,freq,pval,n,z,OR,se,code,imp,ncas,ncont | +-------------------------------------------+---------------------------------------------------------------+ -.. Give an example -.. | I don't know | altNcas,altNcont| - -Note that the combination of Consortium and outcome must be unique because it will be used as an index in the cleaning process. - -Here is an example of descriptor field, the field irrelevant (for example odd ratio for continuous trait) for the study must be filled with na. +**Study field definition**: +* filename: gwas summary statistic name as it appear in the **GWAS folder** +* Consortium : the Consortium of the study (can also be the category of the trait) in upper case and without _ characters +* Outcome: a short tag for the Outcome of the study in upper case and without _ characters +* FullName: full description of the trait (for your own information not used in the cleaning process) +* Nsample: Number of sample in the study +* Ncase: Number of cases in the study (left empty if trait is continuous) +* Ncontrol: Number of control in the study (left empty if trait is continuous) +**Field corresping to column names in the summary statistic** +* snpid: name of the column storing rsid in the gwas file +* POS: name of the column storing the position in the gwas file +* CHR: name of the column storing the chromosome in the gwas file +* a1: effect allele +* a2: Other allele +* freq: name of the column storing the minor allele frequence in the gwas file +* pval: name of the column storing the pvalue in the gwas file +* n: name of the column storing the sample size by variants (optional, will be inferred from the MAF, genetic effect and standard deviation if absent) +* z: name of the column storing the genetic effect (beta) in the gwas file +* index-type: precise the type of index + + +.. warning:: + Note that the combination of Consortium and outcome must be unique because as it will be used as an index in the cleaning process. + +Here is an example of descriptor field, the field irrelevant (for example odd ratio for continuous trait) for the study must be filled with na or left empty. Some fields are optional like the imputation_quality. If not used they can be filled with na. .. csv-table:: GWAS information table @@ -108,8 +123,6 @@ Some fields are optional like the imputation_quality. If not used they can be fi "GIANT_HEIGHT_Wood_et_al.txt","GIANT","HEIGHT","Height","Anthropometry",253288, na, na, 2550858, "MarkerName", "Allele1", "Allele2", "Freq.Allele1.HapMapCEU","p","N","b",na,"SE",na,na,na,na, "imputationInfo","rs-number" - - Command line usage example: ============================