index.rst.txt

.. jass_preprocessing documentation master file, created by
   sphinx-quickstart on Wed Nov  7 11:03:55 2018.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to jass_preprocessing's documentation!
==============================================

.. toctree::
   :maxdepth: 2
   :caption: Contents:

What is jass preprocessing ?
============================
Jass preprocessing is a tool that takes in input
heterogeneous GWAS summary statistics and performs standardization and quality checks to output standardized summary statistic files that can be used as input in the JASS python package and the RAISS imputation package.


Overview
========
The QC and preprocessing step goes as follow:

* map column from of a specific GWAS to standardize names
* Select GWAS SNPs that are in the input reference panel
* Align coded allele of the GWAS with the reference panel
* Infer Number of sample by SNPs if not present in input data
* Filter SNPs with a small sample size
* Normalize the effect size by sample size to have Z-scores
* Save the output by chromosome as the following example:

+----------+-------+------+-----+--------+
| rsID     | pos   | A0   | A1  |  Z     |
+==========+=======+======+=====+========+
| rs6548219| 30762 | A	  | G   | -1.133 |
+----------+-------+------+-----+--------+

* (Optional) Save the output to one file with a chromosome column
(input format needed to perform LD-score)

+-------+-----------+--------+----+----+-----+-----+
| chrom	|    rsID   |  pos   | A0	| A1 |  Z  | P   |
+-------+-----------+--------+----+----+-----+-----+
|   1	  | rs4075116	|1003629 | C  | T	 |0.30 | 0.76|
+-------+-----------+--------+----+----+-----+-----+


Installation
============

In a terminal, execute the following lines:

.. code-block:: shell

  pip3 install git+https://gitlab.pasteur.fr/statistical-genetics/JASS_Pre-processing

Input
======

* A reference panel (1000 genome format). The user is expected to provide a reference panel in tsv format with the following columns in that order, without header:

+-----+-----+------------+-----+-----+---------+
| chr | pos |   snp_id   | ref | alt |   MAF   |
+=====+=====+============+=====+=====+=========+
|  1  |13116| rs62635286 |  T  |  G  |0.0970447|
+-----+-----+------------+-----+-----+---------+
|  1  |13118| rs200579949|  A  |  G  |0.0970447|
+-----+-----+------------+-----+-----+---------+
|  1  |14604| rs541940975|  A  |  G  | 0.147564|
+-----+-----+------------+-----+-----+---------+
|  1  |14930| rs75454623 |  A  |  G  | 0.482228|
+-----+-----+------------+-----+-----+---------+

* Folder containing all raw gwas data (all chromosomes in one file) (minimal conditions?? tab separated?)
* a list containing the name of GWAS file to the string format.
* A descriptor csv files that will described each GWAS summary statistic files:

  * a header
  * 1 line per study
  * the fields are:


+-------------------------------------------+------------------------------------------------------------+
|                     category              |                         field name                         |
+===========================================+============================================================+
|             path to the data              |                            filename                        |
+-------------------------------------------+------------------------------------------------------------+
|            study info fields              | consortia,outcome,fullName,type,Nsample,Ncase,Ncontrol,Nsnp|
+-------------------------------------------+------------------------------------------------------------+
|    names of the header in the GWAS file   |      snpid,a1,a2,freq,pval,n,z,OR,se,code,imp,ncas,ncont   |
+-------------------------------------------+------------------------------------------------------------+

.. Give an example
.. |               I don't know                 |                          altNcas,altNcont|


Here is an example of descriptor field, the field irrelevant (for example odd ratio for continuous trait) for the study must be filled with na. 

.. csv-table:: GWAS information table!

   :header: "filename","consortia","outcome","fullName","type","Nsample","Ncase","Ncontrol","Nsnp","snpid","a1","a2","freq","pval","n","z","OR","se","code","imp","ncas","ncont"
   "GIANT_HEIGHT_Wood_et_al.txt","GIANT","HEIGHT","Height","Anthropometry",253288,	na,	na, 2550858,	"MarkerName",	"Allele1", "Allele2", "Freq.Allele1.HapMapCEU","p","N","b",na,"SE",na,na,na,na


Command line usage example:
============================

It is possible to launch the complete preprocessing (all steps described in section Overview section) from the command line:

.. argparse::
  :ref: jass_preprocessing.__main__.add_preprocessing_argument
  :prog: jass_preprocessing

Indices and tables
==================


* :ref:`genindex`
* :ref:`modindex`
.. automodule:: impute_jass
   :members:
* :ref:`search`