Update input_formats authored by Apolline  GALLOIS's avatar Apolline GALLOIS
# Input format
- **Genotypes bim file**
This is a PLINK file, with columns separated by tabulations and no header line. It contains one line per variant with the following six fields: chromosome, variant identifier, position in morgans or centimorgans, base-pair coordinate, allele 1 and allele 2.
Example:
*(chromosome)* | *(variant identifier)* | *(position)* | *(base-pair coordinate)* | *(A1)* | *(A2)*
:---: | :-------: | :----: | :-----: | :---: | :---:
1 | rs123456 | 7568 | 15411 | A | T
5 | rs6715 | 89863 | 41347 | G | A
21 | rs75354 | 148962 | 305716 | C | A
- **Genotypes raw file**
This is a PLINK file, with columns separated by spaces and a header line. It contains one line per sample with V+6 fields, where V is the number of variants.
To recode bed/bim/fam to raw file, use this command on PLINK:
```bash
plink --bfile $inputFile --recodeA --out $outputFile
```
Example:
FID | IID | PAT | MAT | SEX | PHENOTYPE | SNP1 | SNP2 | SNP3 | ..........
:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---:
1 | 1 | 0 | 0 | 2 | 0 | 0 | 1 | 2 | ..........
2 | 2 | 0 | 0 | 1 | 0 | 1 | 0 | 2 | ..........
- **Phenotypes file**
This is a text file, with columns separated by tabulations and a header line. In contains one line per individual. First column must be the individual ID.
Example:
ID | Sex | Age | LDL-C | HDL-C | HDL-D | HDL-TG | ..........
:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---:
1 | 1 | 45 | 0.1 | 0.48 | 0.85 | 0.89 | ..........
2 | 1 | 32 | 0.2 | 0.65 | 0.1 | 0.41 | ..........
3 | 2 | 47 | 0.8 | 0.21 | 0.5 | 0.3 | ..........
- **Summary file**
This is a csv file with columns separated by commas and a header line. This file aims at describing the role of each variable contained in the phenotypes file. For each selected variable, the user must provide a label and a binary indicator for classification as confounding factors (i.e. variables systematically included as covariates), outcome (i.e. each single variable that will be treated as a primary outcome) and candidate covariates (i.e. variables that will be assessed by CMS for inclusion as a covariate).
`Note that variables classified as confounding factor cannot be used as either outome or covariate, and such combination will be flagged as an error.`
By default, all variables in "Covariates" column will be included as covariates in each outcome analysis. The "Excluded" column give the opportunity to exclude specific variables from covariates for a given outcome. These variables must be separated by ";" without any spaces. If no variables need to be excluded, simply let the column empty. In the example, we exclude all "HDL" variable when analysing one of them.
Example:
Label | Conf | Outcome | Covariate | Excluded
:---: | :---: | :---: | :---: | :---:
Sex | 1 | 0 | 0 |
Age | 1 | 0 | 0 |
LDL-C | 0 | 1 | 1 |
HDL-C | 0 | 1 | 1 | HDL-D;HDL-TG
HDL-D | 0 | 1 | 1 | HDL-C;HDL-TG
HDL-TG | 0 | 1 | 1 | HDL-C;HDL-D
\ No newline at end of file