MTAR package has functions to 1) compute the summary statistics with different types of data input; 2) to adjust summary statistics for sample overlap if necessary; and 3) to perform multi-trait rare-variant association tests using the summary statistics.
The package development version is tested on the following systems:
Mac OSX: Mojave version 10.14.6 (R version 3.6.0)
Windows 10 (R version 3.4.0)
The CRAN package should be compatible with Windows and Mac operating systems.
Before setting up the MTAR
package, users should have
R
version 3.4.0 or higher.
MTAR
package requires R with version 3.4.0 or higher,
which can be downloaded and installed from CRAN.
install.packages("MTAR")
MTAR
package depends on several R packages
(MASS
, CompQuadForm
, Matrix
,
stats
, utils
), which will be downloaded when
installing MTAR
. MTAR
also uses non-exported R
code from R packages ACAT
, SKAT
and
SeqMeta
. The MTAR
package functions with all
packages in their latest versions as they appear on CRAN
on
September 9, 2019. The versions of software are, specifically:
CompQuadForm_1.4.3
MASS_7.3-51.4
Matrix_1.2-17
seqMeta_1.6.7
SKAT_1.3.2.1
The score statistics U and their covariance matrix V are the summary statistics required by MTAR tests. In this section, we will demonstrate how to use various utility functions to prepare for these summary statistics. The output object of these utility functions can be directly taken by the MTAR function as input.
Assume we are interested in effects of m rare variants (MAF < 5%) in a gene or SNP-set on K traits. For the kth trait, when the individual-level genotype Gik, covariate Xik and trat data Yik are available, the score statistics Uk and their covariance estimate Vk can be obtained from the generalized linear model with the likelihood function $$ \exp\left\{\frac{Y_{ik} \left(\mathbf{\beta}^{\rm T}_{k} \mathbf{G}_{ik}+ \mathbf{\gamma}^{\rm T}_{k} \mathbf{X}_{ik}\right)- b(\mathbf{\beta}^{\rm T}_{k} \mathbf{G}_{ik}+ \mathbf{\gamma}^{\rm T}_{k} \mathbf{X}_{ik})}{a(\phi_k)}+c(Y_{ik},\phi_k)\right\}, $$ where βk and γk are regression parameters, ϕk is a dispersion parameter, and a, b, and c are specific functions. The first and second derivatives of b are denoted by b′ and b″. Specifically, we have
where $\widehat{\mathbf{\gamma}_k}$ and ϕ̂k are the restricted maximum likelihood estimators of γk and ϕk under H0 : βk = 0. For the linear regression model, we have $a(\widehat{\phi}_k) = n^{-1}\sum_{i = 1}^n(Y_{ik} - \widehat{\mathbf{\gamma}}_{k}^T\mathbf{X}_{ik})^2$, b′(z) = z and b″(z) = 1. For the logistic regression model, we have a(ϕ) = 1, b′(z) = ez/(1 + ez), b″(z) = ez/(1 + ez)2 (Tang and Lin (2015); Hu et al. (2013)).
To calculate the summary statistics from individual-level data, Get_UV_from_data can be used with required inputs of traits, covariates and genotype. Specifically, traits, covariates and genotype are 3 lists and each sublist contains the information from L study. For ℓth study, traits, covariates and genotype information should be a n × K, n × D and n × m matrix, respectively. The number of traits in each study can be different, but the name of traits must be provided. Same as the name of SNPs in genotype. The missing value in trait or in genotype should be labeled as NA. The order of subject IDs in traits, covariates and genotype should be the same.
Here, we use a simulated dataset rawdata to show how to apply Get_UV_from_data function. In rawdata, there are 3 studies, 3 continuous traits and 10 rare variants. Fig. 2 shows the diagram for sample overlap among traits in three studies. Specifically, there are 1500 subjects in study1, but each subject only has one trait measurement. In study2 and study3, the sample size is 500 and each subject has two (LDL and HDL) or three traits (LDL, HDL and TG) measurements.
## LDL HDL TG
## SUBJ1 -2.6166166 NA NA
## SUBJ2 0.8519045 NA NA
## SUBJ3 1.2438145 NA NA
## SUBJ4 -0.5327296 NA NA
## SUBJ5 0.8006589 NA NA
## SUBJ6 0.1560920 NA NA
## cov1 cov2
## SUBJ1 1 -0.29713414
## SUBJ2 0 -0.08074092
## SUBJ3 0 0.55975760
## SUBJ4 0 0.49851956
## SUBJ5 1 -0.07097778
## SUBJ6 1 -0.28825662
## SNP2148 SNP2149 SNP2150 SNP2151 SNP2152 SNP2153 SNP2154 SNP2155 SNP2156
## SUBJ1 0 0 0 0 0 0 0 0 0
## SUBJ2 0 0 0 0 0 0 0 0 0
## SUBJ3 0 0 0 0 0 0 0 0 0
## SUBJ4 0 0 0 0 0 0 0 0 0
## SUBJ5 0 0 0 0 0 0 0 0 0
## SUBJ6 0 0 0 0 0 0 0 0 0
## SNP2157
## SUBJ1 0
## SUBJ2 0
## SUBJ3 0
## SUBJ4 0
## SUBJ5 0
## SUBJ6 0
obs.stat <- Get_UV_from_data(traits = traits.dat,
covariates = cov.dat,
genotype = geno.dat,
covariance = TRUE)
obs.stat$U
## $LDL
## SNP2148 SNP2149 SNP2150 SNP2151 SNP2152 SNP2153 SNP2154
## 0.3438038 1.5488022 1.3252516 0.0000000 2.0216781 -1.8258464 0.6500477
## SNP2155 SNP2156 SNP2157
## 0.0000000 0.1757972 1.5709795
##
## $HDL
## SNP2148 SNP2149 SNP2150 SNP2151 SNP2152 SNP2153
## 0.00000000 0.70525844 1.25507643 0.00000000 -0.03924912 -0.28306623
## SNP2154 SNP2155 SNP2156 SNP2157
## 2.06966868 0.00000000 -12.48457057 0.00000000
##
## $TG
## SNP2148 SNP2149 SNP2150 SNP2151 SNP2152 SNP2153 SNP2154
## 0.0000000 1.3061250 2.2535334 0.0000000 -2.6934948 -0.2769778 1.5616663
## SNP2155 SNP2156 SNP2157
## 0.0000000 -0.3394679 0.0000000
For each trait, if the number of unique values is no more than 2, then this trait will be viewes as a binary trait and a logistic regression will be conducted to calculate the summary statistics. If covariance = TRUE, then a m × m covariance matrix Vk is returned, otherwise only m diagonal elements ${\rm diag}(\mathbf{V}_k)$ is returned.
When the score statistics Uk and their variance (${\rm diag}(\mathbf{V}_k)$) are available, the covariance matrix Vk can be approximated as $$ \mathbf{V}_k \approx {\rm diag}(\mathbf{V}_k)^{1/2} \mathbf{R} {\rm diag}(\mathbf{V}_k)^{1/2}, $$ where R = {Rjℓ}j, ℓ = 1m is the SNP correlation matrix estimated either from the working genotypes or the external reference (Hu et al. (2013)). This transformation has been implemented in function Get_UV_from_varU.
data("varU.example")
attach(varU.example)
obs.stat <- Get_UV_from_varU(U = U, varU = varU, R= R)
obs.stat$U
## $LDL
## 1:150551327 1:150551576 1:150551649 1:150550965 1:150551447 1:150550761
## -268.5912258 31.2170161 -12.7574302 4.2199502 0.8814878 3.2751963
##
## $HDL
## 1:150551327 1:150551576 1:150551649 1:150550965 1:150551447 1:150550761
## 229.0655499 21.4970739 -18.7336436 -0.1519353 -3.7007466 -12.8477062
##
## $TG
## 1:150551327 1:150551576 1:150551649 1:150550965 1:150551447 1:150550761
## -363.297279 -11.531591 -1.618765 -2.521649 1.127816 6.706268
When the genetic effect estimates $\widehat{\mathbf{\beta}}_{k} = \{\widehat{\beta}_{kj} \}_{j=1}^m$ and their standard error sek = {sekj}j = 1m are available, we can approximate Uk = {Ukj}j = 1m and Vk = {Vkjℓ}j, ℓ = 1m as Ukj ≈ β̂kj/sekj2 and Vkjℓ ≈ Rjℓ/(sekjsekℓ) via function Get_UV_from_beta.
data("beta.example")
attach(beta.example)
obs.stat <- Get_UV_from_beta(Beta = Beta, Beta.se = Beta.se, R = R)
detach(beta.example)
It took less than 0.05 second to calculate the summary statistics for a gene with 10 rare variants.
If all the traits are from one study or from multiple studies with overlapping subjects, the genetic effect estimates among traits are correlated. To calculate the covariances of these genetic effects, we need to first compute zeta – the among-trait covariances of Z-scores for the SNPs that are not associated with the traits (details in Luo et al. (2019)). In function Get_zeta, we ask users to input a vector of Z-scores from genome-wide association tests between common variants and each trait. The function will conduct LD pruning (by using the reference independent SNP list in 1000 Genome), remove SNPs with p-value < 0.05 (default p-value cutoff threshold), and compute zeta.
Here we showed a simplified example of estimating matrix zeta, we extracted a subset of 737 null and common SNPs from chromosome 1 in the Global Lipids Genetics Consortium (GLGC, Liu et al. (2017)). After filtering out those dependent SNPs, there are only 137 SNPs remaining. The computation time of estimating zeta with 737 null and common SNPs is less than 4 seconds.
data("zeta.example")
attach(zeta.example)
# Downloading independent common SNPs from 1000Genome data set.
githubURL <- "https://github.com/lan/MTAR/blob/master/indp_snps.1KG.rda?raw=true"
utils::download.file(githubURL,"1kgfile")
load("1kgfile")
zeta1 <- Get_zeta(Zscore = Zscore, Indp_common_snp = indp_snps.1KG)
detach(zeta.example)
To run MTAR function, the score statistics U, their covariance matrix V and minor allele frequency MAF are required. Other inputs are optional. Here an example dataset (MTAR.example) offers all the information needed in MTAR function.
## [1] "annotation" "U" "V"
## [4] "MAF" "genetic_cor.trait" "zeta"
Specifically, genetic_cor.trait represents the genetic correlation information (B. Bulik-Sullivan et al. (2015)). The genetic correlation can be estimated using ldsc software (B. K. Bulik-Sullivan et al. (2015) and URLs). Some websites also archive the genetic correlation among many complex traits (URLs). If no genetic correlation is available, MTAR will use an exchangeable correlation structure among traits. The zeta can be estimated in Step 2 and via Get_zeta. If no zeta is provided, MTAR assumes there is no sample overlap among traits. The p-values of MTAR-O, cMTAR, iMTAR and cctP and ancillary information will be returned. In this example, it took less 5 seconds to calculate the MTAR test p-values for gene PNPLA2.
attach(MTAR.example)
pval <- MTAR(U = U, V = V, MAF = MAF, genetic_cor.trait = genetic_cor.trait, zeta = zeta)
pval
## $MTARO
## [1] 5.226324e-08
##
## $cMTAR
## $cMTAR$p
## [1] 5.368409e-08
##
## $cMTAR$rho1.min
## [1] 0
##
## $cMTAR$rho2.min
## [1] 1
##
##
## $iMTAR
## $iMTAR$p
## [1] 2.599168e-08
##
## $iMTAR$rho1.min
## [1] 0
##
## $iMTAR$rho2.min
## [1] 1
##
##
## $cctP
## [1] 3.329077e-06
ldsc website: https://github.com/bulik/ldsc/wiki/Heritability-and-Genetic-Correlation Genetic correlation: https://media.nature.com/original/nature-assets/ng/journal/v47/n11/extref/ng.3406-S2.csv