A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank
Oct 23, 2020
·
1 min read

Abstract
The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest.
The UK Biobank is a very large, prospective population-based cohort study across the
United Kingdom. It provides unprecedented opportunities for researchers to investigate
the relationship between genotypic information and phenotypes of interest. Multiple
regression methods, compared with genome-wide association studies (GWAS), have already
been showed to greatly improve the prediction performance for a variety of phenotypes.
In the high-dimensional settings, the lasso, since its first proposal in statistics, has
been proved to be an effective method for simultaneous variable selection and
estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose
new challenges for applying the lasso method, as many existing algorithms and their
implementations are not scalable to large applications. In this paper, we propose a
computational framework called batch screening iterative lasso (BASIL) that can take
advantage of any existing lasso solver and easily build a scalable solution for very
large data, including those that are larger than the memory size. We introduce snpnet,
an R package that implements the proposed algorithm on top of glmnet and optimizes for
single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear
model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2
penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive
predictive performance for all four phenotypes considered (height, body mass index,
asthma, high cholesterol) using only a small fraction of the variants compared with
other established polygenic risk score methods.
Type
Publication
Published in PLOS Genetics, 2020
In this project led by Junyang Qian, we developed BASIL, a novel algorithm to fit large-scale L1 penalized (Lasso) regression model using an iterative procedure, and implemented R snpnet package specially designed for genetic data.