grplasso • grplasso

Introduction:

grplasso is a specialized R package designed for fitting penalized regression models in settings involving high-dimensional grouped (or clustered) effects. Unlike traditional lasso approaches that treat all covariates equally, grplasso distinguishes and efficiently handles group structures to improve the computational performance.

This tutorial demonstrates how to apply grplasso in practice using illustrative example datasets.

Installation:

require("devtools")
require("remotes")
remotes::install_github("UM-KevinHe/grplasso", ref = "main")

Quick Start:

In this section, we will explore the fundamental usage of the functions integrated into the current R package, providing a detailed interpretation of the resulting values obtained from these functions. To enhance users’ understanding of the R package, we will employ example datasets, enabling a comprehensive grasp of its functionalities.

library(grplasso)
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) : 
#>   object 'type_sum.accel' not found

1. Generalized Linear Models (GLMs)

To exemplify the process of fitting a generalized linear model, we will utilize the “BinaryData” dataset included in the package. This dataset consists of five predictors, an indicator for provider information, and a binary outcome variable.

data(BinaryData)
data <- BinaryData$data
Y.char <- BinaryData$Y.char # variable name of outcome variable
prov.char <- BinaryData$prov.char # variable name of provider indicator
Z.char <- BinaryData$Z.char # variable names of predictors
head(data)
#>   Y Prov.ID     Z1     Z2     Z3     Z4     Z5
#> 1 1      10 -0.006 -0.159  0.588  0.378 -1.430
#> 2 0      11  0.862 -0.162  0.243  0.503  0.549
#> 3 0      11  0.453  0.961  0.665  0.522  0.859
#> 4 1       8 -0.286  0.001  0.903  0.886 -0.324
#> 5 0      13  0.564 -0.194 -0.062 -0.792 -0.783
#> 6 1      18  0.676  1.095  2.223  1.367  1.060

Without grouped covariate

The pp.lasso() function is employed to fit a generalized linear model when the covariate does not include any group information. When the user does not specify the regularization coefficient $\lambda$ , our function automatically generates a sequence of $\lambda$ values by default. The sequence starts with the largest $\lambda$ , which penalizes all covariates to zero, and then gradually decreases $\lambda$ to allow for variable selection and modeling flexibility.

fit <- pp.lasso(data, Y.char, Z.char, prov.char)

The coef() function serves to provide estimates of the coefficients in the fitted model. The resulting coefficient matrix is structured such that the column names correspond to the $\lambda$ values used in the modeling process.

coef(fit)$beta[, 1:5]
#>    0.132    0.1202    0.1096    0.0998     0.091
#> Z1     0 0.0000000 0.0000000 0.0000000 0.0000000
#> Z2     0 0.0000000 0.0000000 0.0000000 0.0000000
#> Z3     0 0.1120347 0.2148847 0.3098777 0.3980465
#> Z4     0 0.0000000 0.0000000 0.0000000 0.0000000
#> Z5     0 0.0000000 0.0000000 0.0000000 0.0000000
coef(fit)$gamma[1:5, 1:5]
#>        0.132     0.1202     0.1096     0.0998      0.091
#> 1 -0.2411977 -0.2939898 -0.3431091 -0.3889330 -0.4318211
#> 2 -1.9635362 -1.8742433 -1.7984115 -1.7336457 -1.6779754
#> 3 -1.2089403 -1.1883078 -1.1712888 -1.1572109 -1.1455229
#> 4 -1.9600386 -1.8922127 -1.8332183 -1.7815619 -1.7360041
#> 5 -0.5500569 -0.5698456 -0.5894538 -0.6087582 -0.6276779

The plot() function is designed to generate a figure depicting the regularization path. This path illustrates the behavior of the coefficients for each predictor variable as the regularization parameter $\lambda$ varies. By visualizing the regularization path, users can gain valuable insights into the impact of different regularization strengths on the coefficients, aiding in model interpretation and selection.

plot(fit, label = TRUE)

Coefficient selection path for the generalized linear model without grouped covariates

The predict() function is utilized to generate model predictions for a given dataset based on the coefficient estimates obtained from the fitted model. Once the model has been trained using the pp.lasso() function and the coefficients have been estimated, the predict function can be applied to new data to obtain predictions for the outcome variable.

This function offers various types of outputs to suit different analysis needs. For instance, when fitting a penalized logistic regression model, using type = “response” provides the probabilities of “Y = 1” for each observation, while type = “class” provides the predicted class.

predict(fit, data, Z.char, prov.char, lambda = fit$lambda, type = "response")[1:5, 1:5]
#>          0.132    0.1202    0.1096    0.0998     0.091
#> [1,] 0.2874999 0.2996927 0.3106971 0.3206762 0.3297643
#> [2,] 0.2051302 0.2112109 0.2163924 0.2208163 0.2246015
#> [3,] 0.2051302 0.2191951 0.2321638 0.2441339 0.2551989
#> [4,] 0.3947391 0.4208778 0.4449423 0.4671087 0.4875485
#> [5,] 0.2345694 0.2340619 0.2330311 0.2316163 0.2299295
predict(fit, data, Z.char, prov.char, lambda = 0.001, type = "class")[1:5]
#> [1] 1 0 0 1 0

With grouped covariate

The utilization of the grp.lasso() function is similar to the previously mentioned methods, with the added requirement of providing group information for the covariates. When calling the function, users should provide the necessary group information to ensure proper grouping of variables for regularization. However, if the user does not explicitly provide the group information, the grp.lasso() function will automatically assume that each variable is treated as an individual group on its own. This default behavior simplifies the process for users who do not wish to specify explicit groups, ensuring that the function can still be applied effectively without the need for additional input.

group <- BinaryData$group
fit2 <- grp.lasso(data, Y.char, Z.char, prov.char, group = group)
plot(fit2, label = T)

Coefficient selection path for the generalized linear model with grouped covariates

Please note that for both pp.lasso() and grp.lasso() functions, the parameter “prov.char” is optional. In the event that the user does not specify the provider information for the observations, the program will automatically assume that all observations originate from the same health provider, resulting in the generation of one common intercept.

fit.no_prov <- pp.lasso(data, Y.char, Z.char)
#> Warning: Provider information not provided. All data is assumed to originate
#> from a single provider!
coef(fit.no_prov)$beta[, 1:5]
#>    0.2513    0.2289    0.2086    0.1901    0.1732
#> Z1      0 0.0000000 0.0000000 0.0000000 0.0000000
#> Z2      0 0.0000000 0.0000000 0.0000000 0.0000000
#> Z3      0 0.1006904 0.1935559 0.2800125 0.3610722
#> Z4      0 0.0000000 0.0000000 0.0000000 0.0000000
#> Z5      0 0.0000000 0.0000000 0.0000000 0.0000000
coef(fit.no_prov)$gamma[1:5] #"gamma" is treated as the common intercept
#>     0.2513     0.2289     0.2086     0.1901     0.1732 
#> -0.5838804 -0.5896897 -0.5974013 -0.6065479 -0.6167685

Regularization parameter selection for GLM problems

The optimal regularization parameter ( $\lambda$ ) is determined through cross-validation. To find the best $\lambda$ , users can employ either the cv.pp.lasso() or cv.grp.lasso() function, depending on the specific type of model they are working with. These cross-validation functions inherit the parameters required for the model fitting process, providing a seamless and straightforward experience for users. By default, both functions utilize 10-fold cross-validation.

fit <- cv.pp.lasso(data, Y.char, Z.char, prov.char, nfolds = 10)

The plot() function, applied to a cv.pp.lasso or cv.grp.lasso object, generates a figure that enables users to assess how the cross-entropy loss changes with varying values of $\lambda$ . By observing the plot, users can easily identify the point at which the cross-entropy loss is minimized.

plot(fit)

Regularization parameter selection for GLM problems

Indeed, users can directly use the fit$lambda.min command to obtain the optimal value of $\lambda$ .

fit$lambda.min
#> [1] 0.001259591

2. Discrete Survival Models

The pp.DiscSurv() function is utilized for fitting a penalized discrete survival model. In contrast to the current R package, this function does not necessitate data expansion based on discrete time points, resulting in a significant reduction in memory usage and convergence time required for operation.

The DiscTime dataset, included in this package, comprises of 5 covariates, provider information, observation time, and event indicator. We will be using this dataset as an example to illustrate how to utilize it.

data(DiscTime)
data <- DiscTime$data
Event.char <- DiscTime$Event.char
Time.char <- DiscTime$Time.char
head(data)
#>   Prov.ID     Z1     Z2     Z3     Z4     Z5  time status
#> 1       4 -1.298 -1.471 -1.346 -1.775 -1.241  1.03      1
#> 2       5  0.192  0.163  0.953  1.083  0.371 12.50      0
#> 3       4 -2.315 -2.038 -1.629 -2.577 -2.008  0.53      1
#> 4       1 -0.198  1.231  0.633  0.739  0.214  6.74      1
#> 5       2  0.469  0.640 -0.583  0.161  0.927 12.50      0
#> 6       4 -1.445 -1.848 -3.024 -2.117 -1.894  0.53      1

fit <- pp.DiscSurv(data, Event.char, prov.char, Z.char, Time.char)

The pp.DiscSurv() function yields three main sets of coefficients as its primary output. These coefficients pertain to the covariate estimates, log-transformed baseline hazard for various time points, and provider effects. To avoid multicollinearity problems, we designate the first provider as the reference group.

Similar to the GLM fitting functions mentioned earlier, pp.DiscSurv() is also furnished with a coef() function. This function facilitates the provision of coefficient estimates within the fitted penalized discrete survival model across the entire sequence of $\lambda$ values employed in the modeling procedure.

coef(fit)$beta[, 1:5]
#>    0.1601     0.1458     0.1329     0.1211     0.1103
#> Z1      0 -0.1735757 -0.3325774 -0.4793679 -0.6155647
#> Z2      0  0.0000000  0.0000000  0.0000000  0.0000000
#> Z3      0  0.0000000  0.0000000  0.0000000  0.0000000
#> Z4      0  0.0000000  0.0000000  0.0000000  0.0000000
#> Z5      0  0.0000000  0.0000000  0.0000000  0.0000000
coef(fit)$gamma[, 1:5]
#>       0.1601     0.1458    0.1329     0.1211     0.1103
#> 1  0.0000000  0.0000000  0.000000  0.0000000  0.0000000
#> 2 -4.5665147 -4.3516946 -4.162992 -3.9945779 -3.8425705
#> 3 -0.7478311 -0.7116142 -0.682481 -0.6580980 -0.6367834
#> 4  1.2412090  1.1456446  1.059286  0.9815198  0.9121089
#> 5 -2.3399410 -2.1966372 -2.070536 -1.9577270 -1.8555876
coef(fit)$alpha[, 1:5]
#>                 0.1601    0.1458    0.1329    0.1211    0.1103
#> [Time: 0.53] -1.668065 -1.747753 -1.824209 -1.898663 -1.971909
#> [Time: 1.03] -1.480994 -1.526815 -1.571328 -1.615488 -1.659886
#> [Time: 3.92] -1.622145 -1.641556 -1.661848 -1.683563 -1.706991
#> [Time: 6.74] -1.251585 -1.257340 -1.264124 -1.272503 -1.282795
#> [Time: 12.5] -1.781635 -1.780095 -1.780526 -1.783340 -1.788771

Users have the option to utilize the plot() function, which generates a figure illustrating the regularization path. This visual representation showcases the behavior of the coefficients for each predictor variable as $\lambda$ varies.

plot(fit, label = T)

Coefficient selection path for the discrete survival model

The predict() function is employed to generate model predictions for a given dataset using the coefficient estimates obtained from the fitted model. It is essential to note that the discrete time points within the prediction data set must align with the discrete time points used during model fitting. If they do not match, the baseline hazard of mismatches cannot be estimated accurately.

predict(fit, data, Event.char, prov.char, Z.char, Time.char, lambda = fit$lambda, type = "response", which.lambda = fit$lambda[1])[1:5,]
#>   Individual [Time: 0.53] [Time: 1.03] [Time: 3.92] [Time: 6.74] [Time: 12.5]
#> 1          1    0.1586823    0.1852773           NA           NA           NA
#> 2          2    0.1586823    0.1852773    0.1649092    0.2224258    0.1441013
#> 3          3    0.1586823           NA           NA           NA           NA
#> 4          4    0.1586823    0.1852773    0.1649092    0.2224258           NA
#> 5          5    0.1586823    0.1852773    0.1649092    0.2224258    0.1441013

Regularization parameter selection for discrete survival models

The optimal regularization parameter ( $\lambda$ ) is determined through cross-validation, utilizing the cross-validation error as the guiding metric. Users can identify the best $\lambda$ value by employing the cv.pp.DiscSurv() function.

fit <- cv.pp.DiscSurv(data, Event.char, prov.char, Z.char, Time.char, nfolds = 10, trace.cv = T)
#> Starting CV fold #1...
#> Starting CV fold #2...
#> Starting CV fold #3...
#> Starting CV fold #4...
#> Starting CV fold #5...
#> Starting CV fold #6...
#> Starting CV fold #7...
#> Starting CV fold #8...
#> Starting CV fold #9...
#> Starting CV fold #10...

Users can either utilize the plot() function or directly access the fit$lambda.min command to identify the optimal value of $\lambda$ at which the cross-entropy loss is minimized.

plot(fit)

Regularization parameter selection for discrete survival models

fit$lambda.min
#> [1] 0.00243264

Discrete survival model with no grouped (or clusted) information provided

Similar with our GLM-related functions, we present a solution for fitting a discrete survival model without requiring provider information. This approach tackles the issue commonly encountered with existing statistical tools used to fit discrete survival models, which often necessitate data expansion, leading to significantly slow convergence.

For ease of use, we have introduced a new function called DiscSurv(), designed to facilitate a similar user experience to pp.DiscSurv(). However, the key difference lies in the fact that DiscSurv() no longer demands provider information from the user, meaning that all observations are now treated as originating from the same healthcare provider.

Additionally, we have provided the cv.DiscSurv() function to aid users in selecting optimal $\lambda$ . Moreover, the coef(), predict(), and plot() functions have also been thoughtfully incorporated.

fit <- DiscSurv(data, Event.char, Z.char, Time.char) # no provider information required
coef(fit, lambda = fit$lambda)$alpha[, 1:10] #time effect
#>                 0.4261    0.3882    0.3537    0.3223    0.2937    0.2676
#> [Time: 0.53] -1.979501 -1.970055 -1.977003 -1.996580 -2.025956 -2.063289
#> [Time: 1.03] -2.110213 -2.069229 -2.043936 -2.030857 -2.027098 -2.024884
#> [Time: 3.92] -2.484907 -2.418504 -2.367851 -2.329504 -2.300539 -2.271883
#> [Time: 6.74] -2.355695 -2.272311 -2.204520 -2.148985 -2.102832 -2.057118
#> [Time: 12.5] -2.958691 -2.863140 -2.784495 -2.719193 -2.664086 -2.607955
#>                 0.2438    0.2221    0.2024    0.1844
#> [Time: 0.53] -2.108335 -2.159552 -2.215497 -2.275267
#> [Time: 1.03] -2.029997 -2.041139 -2.057146 -2.077164
#> [Time: 3.92] -2.250817 -2.236107 -2.226728 -2.221906
#> [Time: 6.74] -2.019042 -1.987381 -1.961174 -1.939721
#> [Time: 12.5] -2.560084 -2.519189 -2.484282 -2.454649
coef(fit, lambda = fit$lambda)$beta[, 1:10] #covariate coefficient
#>    0.4261     0.3882     0.3537     0.3223       0.2937      0.2676     0.2438
#> Z1      0 -0.1556065 -0.2988513 -0.4321738 -0.555131105 -0.62728969 -0.6971121
#> Z2      0  0.0000000  0.0000000  0.0000000  0.000000000  0.00000000  0.0000000
#> Z3      0  0.0000000  0.0000000  0.0000000  0.000000000  0.00000000  0.0000000
#> Z4      0  0.0000000  0.0000000  0.0000000  0.000000000  0.00000000  0.0000000
#> Z5      0  0.0000000  0.0000000  0.0000000 -0.003068977 -0.06697318 -0.1271421
#>        0.2221     0.2024       0.1844
#> Z1 -0.7648862 -0.8306053 -0.891723581
#> Z2  0.0000000  0.0000000  0.000000000
#> Z3  0.0000000  0.0000000 -0.005878473
#> Z4  0.0000000  0.0000000  0.000000000
#> Z5 -0.1840291 -0.2383354 -0.287531205

plot(fit, label = T)

Coefficient selection path for the discrete survival model with no center information provided

cv.fit <- cv.DiscSurv(data, Event.char, Z.char, Time.char, nfolds = 10, trace.cv = T)
#> Starting CV fold #1...
#> Starting CV fold #2...
#> Starting CV fold #3...
#> Starting CV fold #4...
#> Starting CV fold #5...
#> Starting CV fold #6...
#> Starting CV fold #7...
#> Starting CV fold #8...
#> Starting CV fold #9...
#> Starting CV fold #10...

plot(cv.fit)

Regularization parameter selection for discrete survival models with no center information provided

predict(fit, data, Event.char, Z.char, Time.char, lambda = fit$lambda, type = "response", which.lambda = cv.fit$lambda.min)[1:5,]
#>   Individual [Time: 0.53] [Time: 1.03] [Time: 3.92] [Time: 6.74] [Time: 12.5]
#> 1          1  0.491864462   0.74764296           NA           NA           NA
#> 2          2  0.009860994   0.02957994  0.041247655   0.07816691   0.05422556
#> 3          3  0.952556420           NA           NA           NA           NA
#> 4          4  0.020209741   0.05938188  0.081813820   0.14938494           NA
#> 5          5  0.001765611   0.00538432  0.007582723   0.01483601   0.01007986

3. Stratified Cox Models

Our R package offers the Strat.cox() and cv.strat_cox() functions designed for fitting penalized stratified Cox models. In the context of our intended scenario, each healthcare provider is considered a distinct stratum. The functionality of both Strat.cox() and cv.strat_cox() extends to the incorporation of group information among variables, achieved by configuring the “group” parameter.

We employ the ContTime simulation dataset, which is included within this package, to illustrate the utilization of these two functions. This dataset encompasses five covariates, a provider indicator (which serves as stratum information), as well as follow-up time and event indicators.

data(ContTime)
data <- ContTime$data
head(data)
#>      Prov.ID       Z1       Z2       Z3       Z4       Z5 status      time
#> [1,]      11 3.431888 4.397010 3.800101 4.809899 4.049038      1 0.4984520
#> [2,]       4 2.789097 2.616714 3.401736 3.510431 3.650074      0 3.0000000
#> [3,]       9 5.451523 4.009896 5.473275 4.772835 4.708293      1 0.5987062
#> [4,]      15 2.421634 2.284766 1.747758 2.182607 1.072065      1 0.6406964
#> [5,]       9 5.827436 5.238138 5.536604 5.265083 4.735679      0 3.0000000
#> [6,]      18 2.140707 1.773304 1.792366 1.774502 3.209669      0 3.0000000

fit <- Strat.cox(data, Event.char, Z.char, Time.char, prov.char, group = c(1, 2, 2, 3, 3))

Users can utilize the coef() function to obtain coefficient estimates for the covariates across the complete sequence of $\lambda$ values.

coef(fit)[, 1:5]
#>    0.0394      0.0368     0.0343       0.032      0.0298
#> Z1      0 -0.01983922 -0.0390155 -0.05686517 -0.07358008
#> Z2      0  0.00000000  0.0000000  0.00000000  0.00000000
#> Z3      0  0.00000000  0.0000000  0.00000000  0.00000000
#> Z4      0  0.00000000  0.0000000  0.00000000  0.00000000
#> Z5      0  0.00000000  0.0000000  0.00000000  0.00000000

The plot() function facilitates users in visualizing the regularization path.

plot(fit, label = T)

Coefficient selection path for the stratified Cox model with grouped covariates

It’s important to highlight that the prov.char parameter here is also discretionary. In the absence of user input, the program will automatically assign all observations to a single stratum by default. Consequently, the conventional Cox model will be employed for fitting.

fit.no_stratum <- Strat.cox(data, Event.char, Z.char, Time.char)
#> Warning: Provider information not provided. All data is assumed to originate
#> from a single provider!
coef(fit.no_stratum)[, 1:5]
#>    0.1745    0.1627      0.1517     0.1415        0.132
#> Z1      0 0.0000000 0.000000000 0.00000000 0.0000000000
#> Z2      0 0.0000000 0.000000000 0.00000000 0.0005808148
#> Z3      0 0.0155369 0.025755673 0.03313948 0.0398429622
#> Z4      0 0.0000000 0.004892787 0.01173321 0.0177767495
#> Z5      0 0.0000000 0.000000000 0.00000000 0.0000000000

Likewise, the cv.strat_cox() function can be employed for cross-validation, and the plot() function can be utilized to visualize cross-validation error.

cv.fit <- cv.strat_cox(data, Event.char, Z.char, Time.char, prov.char, group = c(1, 2, 2, 3, 3), nfolds = 10, se = "quick")
cv.fit$lambda.min
#> [1] 0.004228925

plot(cv.fit)

Regularization parameter selection for stratified Cox models