stuart.Rmd 15.6 KB
Newer Older
Marie  BOURDON's avatar
Marie BOURDON committed
1
---
Marie Bourdon's avatar
Marie Bourdon committed
2
title: "stuart"
Marie  BOURDON's avatar
Marie BOURDON committed
3
4
output: rmarkdown::html_vignette
vignette: >
Marie Bourdon's avatar
Marie Bourdon committed
5
  %\VignetteIndexEntry{stuart}
Marie  BOURDON's avatar
Marie BOURDON committed
6
7
8
9
10
11
12
13
14
15
16
17
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

Marie Bourdon's avatar
Marie Bourdon committed
18
19
# stuart

Marie  BOURDON's avatar
Marie BOURDON committed
20
21
Marie Bourdon

Marie Bourdon's avatar
Marie Bourdon committed
22
June 2021
Marie  BOURDON's avatar
Marie BOURDON committed
23
24
25

## Goal

Marie Bourdon's avatar
Marie Bourdon committed
26
stuart is a R package which formats results of genotyping. It was developed to analyse data from MUGA arrays (Neogen) for Rqtl analysis, for backcross or F2 crosses, but can be used to analyze data of other laboratory animal strains with other arrays. It allows to filter the markers in arrays, from a genetic point of view. Indeed, markers will be selected depending on their proportion of each genotype, correspondance between F2 or N2 individuals alleles and parental strains alleles, etc.
Marie  BOURDON's avatar
Marie BOURDON committed
27
28
29
30
31
32

The examples shown here require the use of dplyr package.


```{r setup}
library(dplyr)
Marie Bourdon's avatar
Marie Bourdon committed
33
library(stuart)
Marie  BOURDON's avatar
Marie BOURDON committed
34
35
36
37
38
```


## Annotation files

Marie Bourdon's avatar
Marie Bourdon committed
39
40
In order to map the markers on the genome of the individuals, you need to load a table with the position of all markers in the array. The data frame must contain the following columns: `marker` with the markers names, `chr` with the chromosome of each marker, and a column with the position of the marker on the chromosome. For Rqtl analysis, you need to provide positions in cM. The data frame can contain other columns that you judge helpful.

Marie  BOURDON's avatar
Marie BOURDON committed
41
The developer of Rqtl and Rqtl2 packages, Karl Broman, realised that the annotation of the MUGA arrays was not correct for some markers. Thus, he produced new annotation files for MUGA, miniMUGA, megaMUGA and gigaMUGA arrays. These files contain some informations about the markers including the chromosome and position where the probe of the marker matchs on the genome, wether the marker maps uniquely or not, etc. These files also contains the genetic position of the markers calculated with two methods : "cM_cox" and "cM_g2f1" (see https://kbroman.org/MUGAarrays/mini_revisited.html for more informations).
Marie  BOURDON's avatar
Marie BOURDON committed
42

Marie  BOURDON's avatar
Marie BOURDON committed
43
We recommand to use these annotation files to reconstruct the file use for Rqtl analysis. You can load the datasets with these annotations from GitHub (https://github.com/kbroman/MUGAarrays/tree/master/UWisc). Choose the file corresponding to the MUGA array that you used and use the URL to load the dataset in R.
Marie  BOURDON's avatar
Marie BOURDON committed
44
45


Marie  BOURDON's avatar
Marie BOURDON committed
46
47
48
49
50
```{r annot}
annot_mini <- read.csv(url("https://raw.githubusercontent.com/kbroman/MUGAarrays/master/UWisc/mini_uwisc_v2.csv"))
```


Marie Bourdon's avatar
Marie Bourdon committed
51
52
53
54
Here, we will present an example of the use of stuart with results of a F2 cross genotyped with miniMUGA. Examples of genotypes and phenotypes dataframe are available in stuart package. 

The genotype data frame must contain a first column with marker names, a second column with sample IDs, a third column with the first allele and a fourth column with the second allele. This format corresponds to the MUGA results. If your data differ, make sure to have these columns in this order.

mariefbourdon's avatar
mariefbourdon committed
55
56
57
58
59
60
61
62
63
64
65
66
67
You can load your genotype data with base R function `read.table()` or with `readr::read_tsv()` function. You have to skip the first lines with the header.

* `read.table("path/to/genotype_data.txt",sep = "\t",skip=9,header=TRUE)`

* `readr::read_tsv("path/to/genotype_data.txt",skip=9)`

The phenotype data produced by the lab must have the following format: one line per individual, one column per phenotype. The first column must be the identification number of each individual.

You can load your phenotype data depending on its format with `read.table()` or `readr::read_tsv()` for txt files, `read.table()` or `readr::read_csv()` for csv files, `read_excel::read_xls` or `read_excel::read_xlsx` for excel files. 

Here, we load the result of Neogen genotyping: `genos` (only useful columns with marker name, sample ID and alleles were kept) and the phenotype dataset produced by the lab: `phenos`, directly available in the package.


Marie Bourdon's avatar
Marie Bourdon committed
68

Marie  BOURDON's avatar
Marie BOURDON committed
69
70
71
72
73
74
75

```{r load}
data(genos)
summary(genos)
data(phenos)
summary(phenos)
```
Marie Bourdon's avatar
Marie Bourdon committed
76

Marie  BOURDON's avatar
Marie BOURDON committed
77
78
79
80
81
82
### Genotyping of parental strains

To use genotyping result for Rqtl analysis, we need to recode the genotypes of the individuals (originally encoded in A, T, G, C) depending on the genotype of the parental strains: homozygous for the first parental strain (0), heterozygous (1) or homozygous for the second parental strain (2).

We recommend to always genotype the parental strains of the cross. Here, their genotypes are in the `genos` file and correspond to the Sample.ID "StrainsA_1", "StrainsA_2", "StrainsB_1" and "StrainsB_2". Two individuals were genotyped for each parental strain. The first step will be to create a consensus genotype for each strain from the two genotyped individuals. The consensus genotype will be added to the annotation dataset in order to obtain a dataset with both annotation and reference genotype of the parental strains that will be used for recoding the genotypes or the F2 individuals. 

mariefbourdon's avatar
mariefbourdon committed
83
This is done with the `geno_strains()` function. If parental genotypes was in another dataset than the one with second generation individuals' geotypes, from a previous genotyping result for example, this dataset would have been used in this function, indicated with the `geno` argument. If you use reference genotypes data for your parental strain, you must create a table with genotypes of parental strains and marker informations by merging the `annot_mini` table with reference genotypes table. This can be done with `merge()` or `dplyr::full_join()` functions.
Marie  BOURDON's avatar
Marie BOURDON committed
84
85

```{r strains}
86
87
88
89
90
91
parental_strains <- tibble::tibble(parent1 = c("StrainsA_1","StrainsA_2"),
                                   parent2 = c("StrainsB_1","StrainsB_2"))


strains <- geno_strains(annot=annot_mini,geno=genos,
                        strn=parental_strains,cols=c("chr","cM_cox"))
92
head(strains) %>% print.data.frame()
Marie  BOURDON's avatar
Marie BOURDON committed
93
94
95
```

After this step, we need to remove the genotyping result for these individuals from the `genos` dataset.
mariefbourdon's avatar
mariefbourdon committed
96

Marie  BOURDON's avatar
Marie BOURDON committed
97
```{r no_parent}
98
99
genos <- genos %>% filter(!Sample.ID %in% c("StrainsA_1", "StrainsA_2", 
                                            "StrainsB_1","StrainsB_2"))
Marie  BOURDON's avatar
Marie BOURDON committed
100
101
102
103
104
105
106
```


## Markers sorting

### Marker tab

107
The first step of the markers sorting is to create the marker dataframe with the tab_mark() function. This dataframe contains for each marker its chromosome (`chr`) and position (`cM_cox` was chosen from annot_mini, but you can choose the column that you want provided that it is in the annotation file), the two alleles that can be found in the F2/N2 population (`Allele_1` and `Allele_2`), the number of individuals for each genotype (homozygous for each allele (`n_HM1` and `n_HM2`) and heterozygous (`n_HT`)), and the number of non genotyped individuals (`n_NA`) This step can take several minutes, so you can also load the example output of this function.
Marie  BOURDON's avatar
Marie BOURDON committed
108
109


Marie Bourdon's avatar
Marie Bourdon committed
110
```{r tab_mark}
Marie Bourdon's avatar
Marie Bourdon committed
111
# how to use the function:
112
# stuart_tab <- tab_mark(genos,annot_mini,"cM_cox")
Marie Bourdon's avatar
Marie Bourdon committed
113
114
data(stuart_tab)
summary(stuart_tab)
Marie  BOURDON's avatar
Marie BOURDON committed
115
116
```

Marie Bourdon's avatar
Marie Bourdon committed
117
118
Then we will use the different mark_* functions in order to filter the markers. 

mariefbourdon's avatar
mark_na    
mariefbourdon committed
119
120
121
122
123
124
125
126
127
### mark_na

First, we can use `mark_na()` function in order to remove markers with high proportion of missing genotypes in our F2 individuals.

```{r mark_na}
tab2 <- mark_na(stuart_tab)
```


Marie Bourdon's avatar
Marie Bourdon committed
128
129
130

### mark_match

mariefbourdon's avatar
mark_na    
mariefbourdon committed
131
Then, we can use `mark_match()` function. Here, the parental strains were genotyped with the F2 individuals, but it can happen that you use previous genotyping results for the parental strains. `mark_match()` function excludes markers that are in your genotype file but not in the reference genotype dataset. We recomend using this function as the chip used for genotyping may change.
Marie  BOURDON's avatar
Marie BOURDON committed
132

133
```{r mark_match}
mariefbourdon's avatar
mariefbourdon committed
134
tab2 <- mark_match(tab2,ref=strains)
135
tab2 %>% filter(exclude_match==1) %>% print.data.frame()
Marie  BOURDON's avatar
Marie BOURDON committed
136
137
138
139
```

Here the reference strains were genotyped with the same version of the chip as the F2 individuals so no marker was excluded.

Marie Bourdon's avatar
Marie Bourdon committed
140
141
### mark_poly

142
Then, we can use the `mark_poly()` function, which will exclude the markers that are not polymorphic.
Marie  BOURDON's avatar
Marie BOURDON committed
143

144
```{r mark_poly ex}
Marie  BOURDON's avatar
Marie BOURDON committed
145
tab2 <- mark_poly(tab2)
146
head(tab2) %>% print.data.frame()
Marie  BOURDON's avatar
Marie BOURDON committed
147
148
```

Marie Bourdon's avatar
Marie Bourdon committed
149
150
### mark_prop

151
152
153
154
155
156
157
158
159
The `mark_prop()` function can be used to filter markers depending on the proportion of each genotype. Here, we have a F2 and we use `homo=0.1, hetero=0.1`  so the function will exclude all markers with less than 10% of each genotype. For chromosome X, we use the `homo1X`, `homo2X` and `heteroX` arguments. The expected proportion of each genotype for markers on X chromosome depend on the directions of the cross performed in order to obtain F1 individuals.

For example, in an F2 between a strain A (a/a for all loci) and B (b/b for all loci), if we cross a female A (i.e. a/a for markers on X chromosome) with a male B (i.e. b/Y for markers on X chromosome), we obtain F1 females with a/b genotype and F1 males with a/Y for markers on X chromosome. Crossed together, they give birth to F2 females that can be either be a/a or a/b for X chromosome markers.

On the other hand, if we cross a female B (i.e. b/b for markers on X chromosome) with a male A (i.e. a/Y for markers on X chromosome), we still obtain F1 females with a/b genotype but F1 males are b/Y for these markers. The obtained F2 females can be either a/b or b/b for X chromosome markers. 

This example shows that depending on the directions of the crosses, genotypic proportions for markers on X chromosome can vary. In some cases, it is possible that a genotype cannot be present in the second generation individuals. You must determine these expected genotypic proportions. Moreover, it is possible that all second generation individuals were not obtained thanks to crosses with the same direction; in such cases, it can be quite complex to estimate the expected genotypic proportions.

However, it is still important to clean markers on X chromosome. The `homo1X`, `homo2X` and `heteroX` arguments are vectors of 2 numbers which are the lower and upper limits of the estimated genotypic proportion (`homo1X` for the homozygosity for the allele of parent 1, `homo2X` for the homozygosity for the allele of parent 2 and `heteroX` for the heterozygosity). As we saw previously, estimate these limits can be quite tough. Thus, if all genotypes are possible, we recommand to use the minimal limits `c(0.1,1)` for the 3 argumetns in order to eliminate most of the unwanted markers.
Marie  BOURDON's avatar
Marie BOURDON committed
160

Marie Bourdon's avatar
Marie Bourdon committed
161
```{r mark_prop_ex_homo}
162
tab2 <- mark_prop(tab2,cross="F2",homo=0.1,hetero=0.1,homo1X=c(0.1,1),homo2X=c(0.1,1),heteroX=c(0.1,1))
163
head(tab2) %>% print.data.frame()
Marie  BOURDON's avatar
Marie BOURDON committed
164
```
165
We could also use the `pval` argument which allows to exclude autosomal markers by performing a Chi2 test comparing observed distribution with Mendelian proportions. By using `pval=0.5` we would exclude all markers with a Chi2 p-value inferior to 0.05. However, for some markers, Chi2 approximation may be incorrect.
Marie Bourdon's avatar
Marie Bourdon committed
166
167

```{r mark_prop_ex_pval,warning=FALSE}
mariefbourdon's avatar
mariefbourdon committed
168
mark_prop(tab2,cross="F2",pval=0.05,homo1X=c(0.1,1),homo2X=c(0.1,1),heteroX=c(0.1,1)) %>% head() %>% print.data.frame()
Marie Bourdon's avatar
Marie Bourdon committed
169
170
171
```

### mark_allele
Marie  BOURDON's avatar
Marie BOURDON committed
172

Marie Bourdon's avatar
Marie Bourdon committed
173
Last, we can use the `mark_allele()` function. This function excludes markers for which the alleles found in the F2/N2 individuals do not correspond to the alleles found in the parental strains. For example, if a marker is not polymorphic in the parental strains but we found two alleles in the F2/N2 individuals, it will be excluded. Moreover, for a backcross, this function will verify that the homozygous N2 individuals are homozygous for the allele of the strain used to backcross F1 individuals. This strain must be indicated as par1.
Marie  BOURDON's avatar
Marie BOURDON committed
174

Marie Bourdon's avatar
Marie Bourdon committed
175
```{r mark_allele}
Marie Bourdon's avatar
Marie Bourdon committed
176
tab2 <- mark_allele(tab=tab2,ref=strains,cross="F2",par1="parent1",par2="parent2")
177
tab2 %>% arrange(desc(exclude_allele)) %>% head() %>% print.data.frame()
Marie  BOURDON's avatar
Marie BOURDON committed
178
179
```

180
Indeed, we can see that the markers excluded with `mark_allele()` have different alleles in the parental strains.
Marie  BOURDON's avatar
Marie BOURDON committed
181

Marie Bourdon's avatar
Marie Bourdon committed
182
```{r mark_allele-strains}
183
strains %>% filter(marker %in% c("gJAX00001099","gUNC83675","gUNC66010","mUNC010010509","gUNC30322","mUNC010515443")) %>% 
184
  select(marker,parent1,parent2) %>% print.data.frame()
Marie  BOURDON's avatar
Marie BOURDON committed
185
186
```

Marie Bourdon's avatar
Marie Bourdon committed
187
## Creation of the R/qtl file
Marie  BOURDON's avatar
Marie BOURDON committed
188

189
After excluding the problematic markers, we can create the R/qtl file. The individuals must have the same ID in the geno and in the pheno file. If there is a prefix in the geno file that must be removed in order to acheive this, you can use the "prefix" argument. The "path" argument can be used in order to create a CSV file that you can laod with `qtl::read.cross`. 
Marie  BOURDON's avatar
Marie BOURDON committed
190

191
192
```{r write_rqtl}
stuart_output <- write_rqtl(geno=genos,pheno=phenos,tab=tab2,
Marie Bourdon's avatar
Marie Bourdon committed
193
                        ref=strains,par1="parent1",par2="parent2",prefix="ind_",pos="cM_cox")
Marie  BOURDON's avatar
Marie BOURDON committed
194

195
stuart_output[1:10,1:7] %>% print.data.frame()
Marie  BOURDON's avatar
Marie BOURDON committed
196
197
```

Marie Bourdon's avatar
Marie Bourdon committed
198
199
200
201
## Check markers with est.map()

The best way to verify that the markers are correctly sorted is to calculate the estimated map with R/qtl. Indeed, if there is a problematic marker, it will present high recombination fraction with adjacent markers.

202
203
204
The CSV file produced with stuart must be loaded in R as a cross with `qtl::read.cross`. The dataframe produced by `write_rqtl` cannot be used with R/qtl as it does not have the "cross" class.

The cross object was saved in stuart. Here we can load it as well as the newmap that was produced with the function `est.map(cross=stuart_cross,error.prob=0.01)`.
Marie Bourdon's avatar
Marie Bourdon committed
205
206
207

```{r load_cross}
library(qtl)
208

Marie Bourdon's avatar
Marie Bourdon committed
209
210
211
data(stuart_cross)
summary(stuart_cross)

Marie Bourdon's avatar
Marie Bourdon committed
212
data(stuart_newmap)
Marie Bourdon's avatar
Marie Bourdon committed
213
214
215
216
217
218
```


Then we can plot the newmap to compare it with the positions found in the annotation file.

```{r plot_map}
Marie Bourdon's avatar
Marie Bourdon committed
219
plotMap(stuart_cross,stuart_newmap,shift=TRUE)
Marie Bourdon's avatar
Marie Bourdon committed
220
```
221

Marie Bourdon's avatar
Marie Bourdon committed
222
### mark_estmap
223

Marie Bourdon's avatar
Marie Bourdon committed
224
225
226
It seems to be still 2 problematic markers. We can identify them with the `mark_estmap()`function.

```{r mark_estmap}
mariefbourdon's avatar
mariefbourdon committed
227
tab2 <- mark_estmap(tab=tab2,map=stuart_newmap,annot=annot_mini)
Marie Bourdon's avatar
Marie Bourdon committed
228
tab2 %>% filter(exclude_estmap == 1) %>% print.data.frame()
Marie Bourdon's avatar
Marie Bourdon committed
229
```
Marie Bourdon's avatar
Marie Bourdon committed
230
We can see that these two markers have correct proportions of each genotypes so they were not excluded. Let's see the alleles of the parental strains for these markers.
Marie Bourdon's avatar
Marie Bourdon committed
231

Marie Bourdon's avatar
Marie Bourdon committed
232

Marie Bourdon's avatar
Marie Bourdon committed
233
```{r bad_markers_annot}
mariefbourdon's avatar
mariefbourdon committed
234
strains %>% filter(marker %in% c("S6J010381992","SS6071602326","S6J205609960"))
Marie Bourdon's avatar
Marie Bourdon committed
235
236
```

mariefbourdon's avatar
mariefbourdon committed
237
We can see that for 2 of these markers, one of the two parents has a missing genotype. In the cases where one parent has a missing of a heterozygous genotype, the `mark_allele()` function does not by default exclude the marker if the observed allele of the genotyped parent correspond to one of the alleles observed in the crossed individuals.
Marie Bourdon's avatar
Marie Bourdon committed
238

Marie Bourdon's avatar
Marie Bourdon committed
239
240
### Exclusion of last problematic markers

Marie Bourdon's avatar
Marie Bourdon committed
241
242
To avoid the issues with these situations, you can use the `parNH=FALSE` argument in the `mark_allele()` function. This will exclude all the markers with one parent with a missing or a heterozygous genotype. However, this will exclude a very large number of markers. More than 1000 are in this situation in this cross, but about 700 were excluded with other filters. If we exclude all these markers with `parNH=FALSE` we would lose about 300 markers for only 2 problematic ones.

mariefbourdon's avatar
mariefbourdon committed
243
We rather advise you to keep these markers and identify possible problematic markers with `mark_estmap()` and exlude them. 
Marie Bourdon's avatar
Marie Bourdon committed
244
245


mariefbourdon's avatar
mariefbourdon committed
246
247
248
249
250
251
252
If there are still some markers that need to be excluded after all these steps, you have two possibilities to exclude them. First, you have to identify the name of these markers by looking at the `newmap` object as follow : in RStudio, click on this objectin the Environment to view it, then click on the arrow on the left of the desired chromosome. Each chromosome is a named vector that is visible here in two columns: one with the name of the marker, and one with the calculated position of the marker. This is how you can detect markers that present a huge gap with the previous and the following marker.

You can see an example for the stuart_newmap object in the following screen shot:

![screenchot_newmap](screenshot_newmap.png)

You can drop these markers in the cross object with `qtl::drop.markers()`, or if you prefere to have a correct CSV object that can always be used for QTL analysis, you can delete these markers in the marker tab and create a new cross file with `write_rqtl()`.
Marie Bourdon's avatar
Marie Bourdon committed
253

Marie Bourdon's avatar
Marie Bourdon committed
254