96AB3736-contents 7.9 KB
Newer Older
Marie Bourdon's avatar
Marie Bourdon committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
---
title: "stuart"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{stuart}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

Marie Bourdon

Marie Bourdon's avatar
Marie Bourdon committed
20
June 2021
Marie Bourdon's avatar
Marie Bourdon committed
21
22
23

## Goal

Marie Bourdon's avatar
Marie Bourdon committed
24
stuart is a R package which formats results of genotyping. It was developed to analyse data from MUGA arrays (Neogen) for Rqtl analysis, for backcross or F2 crosses, but can be used to analyze data of other laboratory animal strains with other arrays. It allows to filter the markers in arrays, from a genetic point of view. Indeed, markers will be selected depending on their proportion of each genotype, correspondance between F2 or N2 individuals alleles and parental strains alleles, etc.
Marie Bourdon's avatar
Marie Bourdon committed
25
26
27
28
29
30
31
32
33
34
35
36

The examples shown here require the use of dplyr package.


```{r setup}
library(dplyr)
library(stuart)
```


## Annotation files

Marie Bourdon's avatar
Marie Bourdon committed
37
38
In order to map the markers on the genome of the individuals, you need to load a table with the position of all markers in the array. The data frame must contain the following columns: `marker` with the markers names, `chr` with the chromosome of each marker, and a column with the position of the marker on the chromosome. For Rqtl analysis, you need to provide positions in cM. The data frame can contain other columns that you judge helpful.

Marie Bourdon's avatar
Marie Bourdon committed
39
40
41
42
43
44
45
46
47
48
The developer of Rqtl and Rqtl2 packages, Karl Broman, realised that the annotation of the MUGA arrays was not correct for some markers. Thus, he produced new annotation files for MUGA, miniMUGA, megaMUGA and gigaMUGA arrays. These files contain some informations about the markers including the chromosome and position where the probe of the marker matchs on the genome, wether the marker maps uniquely or not, etc. These files also contains the genetic position of the markers calculated with two methods : "cM_cox" and "cM_g2f1" (see https://kbroman.org/MUGAarrays/mini_revisited.html for more informations).

We recommand to use these annotation files to reconstruct the file use for Rqtl analysis. You can load the datasets with these annotations from GitHub (https://github.com/kbroman/MUGAarrays/tree/master/UWisc). Choose the file corresponding to the MUGA array that you used and use the URL to load the dataset in R.


```{r annot}
annot_mini <- read.csv(url("https://raw.githubusercontent.com/kbroman/MUGAarrays/master/UWisc/mini_uwisc_v2.csv"))
```


Marie Bourdon's avatar
Marie Bourdon committed
49
50
51
52
53
54
Here, we will present an example of the use of stuart with results of a F2 cross genotyped with miniMUGA. Examples of genotypes and phenotypes dataframe are available in stuart package. 

The genotype data frame must contain a first column with marker names, a second column with sample IDs, a third column with the first allele and a fourth column with the second allele. This format corresponds to the MUGA results. If your data differ, make sure to have these columns in this order.

We load the result of Neogen genotyping: `genos` (only useful columns with marker name, sample ID and alleles were kept) and the phenotype dataset produced by the lab: `phenos`.

Marie Bourdon's avatar
Marie Bourdon committed
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

```{r load}
data(genos)
summary(genos)
data(phenos)
summary(phenos)
```

### Genotyping of parental strains

To use genotyping result for Rqtl analysis, we need to recode the genotypes of the individuals (originally encoded in A, T, G, C) depending on the genotype of the parental strains: homozygous for the first parental strain (0), heterozygous (1) or homozygous for the second parental strain (2).

We recommend to always genotype the parental strains of the cross. Here, their genotypes are in the `genos` file and correspond to the Sample.ID "StrainsA_1", "StrainsA_2", "StrainsB_1" and "StrainsB_2". Two individuals were genotyped for each parental strain. The first step will be to create a consensus genotype for each strain from the two genotyped individuals. The consensus genotype will be added to the annotation dataset in order to obtain a dataset with both annotation and reference genotype of the parental strains that will be used for recoding the genotypes or the F2 individuals. 

This is done with the `geno_strains` function. 

```{r strains}
strains <- geno_strains(ref=annot_mini,geno=genos,par1=c("StrainsA_1","StrainsA_2"),par2=c("StrainsB_1","StrainsB_2"),name1="parent1",name2="parent2")
head(strains)
```

After this step, we need to remove the genotyping result for these individuals from the `genos` dataset.
```{r no_parent}
genos <- genos %>% filter(!Sample.ID %in% c("StrainsA_1", "StrainsA_2", "StrainsB_1","StrainsB_2"))
```


## Markers sorting

### Marker tab

The first step of the markers sorting is to create the marker dataframe with the tab_mark() function. This dataframe contains for each marker the two alleles that can be found in the F2/N2 population (`Allele_1` and `Allele_2`), the number of individuals for each genotype (homozygous for each allele (`n_HM1` and `n_HM2`) and heterozygous (`n_HT`)), and the number of non genotyped individuals (`n_NA`) This step can take several minutes. You can also load the output of this function.


Marie Bourdon's avatar
Marie Bourdon committed
89
```{r tab_mark}
Marie Bourdon's avatar
Marie Bourdon committed
90
91
92
93
data(stuart_tab)
summary(stuart_tab)
```

94
Then we will use the different mark_* functions in order to filter the markers. First, we can use `mark_match()`` function. Here, the parental strains were genotyped with the F2 individuals, but it can happen that you use previous genotyping results for the parental strains. `mark_match()` function excludes markers that are in your genotype file but not in the reference genotype dataset. We recomend using this function as the chip used for genotyping may change.
Marie Bourdon's avatar
Marie Bourdon committed
95

96
```{r mark_match}
Marie Bourdon's avatar
Marie Bourdon committed
97
98
99
100
101
102
103
104
tab2 <- mark_match(stuart_tab,ref=strains)


tab2 %>% filter(exclude_match==1)
```

Here the reference strains were genotyped with the same version of the chip as the F2 individuals so no marker was excluded.

105
Then, we can use the `mark_poly()` function, which will exclude the markers that are not polymorphic.
Marie Bourdon's avatar
Marie Bourdon committed
106

107
```{r mark_poly ex}
Marie Bourdon's avatar
Marie Bourdon committed
108
109
110
111
tab2 <- mark_poly(tab2)
head(tab2)
```

112
The `mark_prop()` function can be used to filter markers depending on the proportion of each genotype. Here, we have a F2 so we can use the "homo" argument in order to filter depending on the proportion of both homozygous genotype. If we have a N2, we can filter with the proportion of homozygous individuals with the "homo" argument and of heterozygous individuals with the hetero" argument. Moreover, this function allows to filter marker depending on the proportion on non genotyped animals. By defaults, markers for which more than 50% of individuals were not genotyped.
Marie Bourdon's avatar
Marie Bourdon committed
113

Marie Bourdon's avatar
Marie Bourdon committed
114
```{r mark_prop ex,eval=F}
Marie Bourdon's avatar
Marie Bourdon committed
115
116
117
118
tab2 <- mark_prop(tab2,cross="F2",homo=0.1,hetero=0.1)
head(tab2)
```

119
Last, we can use the `mark_allele()` function. This very helpful function excludes markers for which the alleles found in the F2/N2 individuals do not correspond to the alleles found in the parental strains. For example, if for a marker is not polymorphic in the parental strains but we found two alleles in the F2/N2 individuals, it will be excluded.
Marie Bourdon's avatar
Marie Bourdon committed
120

Marie Bourdon's avatar
Marie Bourdon committed
121
```{r mark_allele,eval=F}
Marie Bourdon's avatar
Marie Bourdon committed
122
123
124
125
tab2 <- mark_allele(tab=tab2,ref=strains,par1="parent1",par2="parent2")
tab2 %>% arrange(desc(exclude_allele)) %>% head()
```

126
Indeed, we can see that the markers excluded with `mark_allele()` have different alleles in the parental strains.
Marie Bourdon's avatar
Marie Bourdon committed
127

Marie Bourdon's avatar
Marie Bourdon committed
128
```{r mark_allele-strains,eval=F}
Marie Bourdon's avatar
Marie Bourdon committed
129
130
131
132
133
strains %>% filter(marker %in% c("gJAX00038569","gJAX00425031","gUNC12245354","gUNC15530876","gUNC21555204","gUNC21596600")) %>% arrange(marker) %>% select(marker,parent1,parent2)
```

# Creation of the R/qtl file

134
After excluding the problematic markers, we can create the R/qtl file. The individuals must have the same ID in the geno and in the pheno file. If there is a prefix in the geno file that must be removed in order to acheive this, you can use the "prefix" argument. The "path" argument can be used in order to create a CSV file that you can laod with `qtl::read.cross`. 
Marie Bourdon's avatar
Marie Bourdon committed
135

Marie Bourdon's avatar
Marie Bourdon committed
136
```{r write_qtl,eval=F}
Marie Bourdon's avatar
Marie Bourdon committed
137
138
139
140
141
rqtl_file <- write_rqtl(geno=genos,pheno=phenos,tab=tab2,ref=strains,par1="parent1",par2="parent2",prefix="ind_",pos="cM_cox")

rqtl_file[1:10,1:7]
```