NCC KL Divergence-Based Transfer Learning
ncc.RmdMatched-Set Notation and Model Setup
The nested case–control (NCC) design provides an efficient alternative for fitting Cox proportional hazards models when covariate measurement for the full cohort is costly. At each observed failure time, the failing subject (the case) is retained, while a number of controls are randomly sampled from the corresponding risk set. The resulting sampled matched sets are then analyzed using conditional likelihood.
We first introduce the matched-set notation. Let denote the at-risk set at time in stratum . For the failure at , we randomly sample controls from , excluding the failing subject and matched by stratum. The failing subject together with the sampled controls forms a matched set of size , which we denote by . Since each stratum contains unique failure times, the NCC design yields a total of matched sets across all strata.
Under this construction, the matched set forms a subset of the full at-risk set . Following the probabilistic framework introduced in the Cox KL divergence vignette, we similarly define a sequence of conditional experiments . Here denotes the event that subject is the failing subject in the -th matched set, and collects all failure and censoring information up to time , together with the information that the matched set has been formed by sampling controls from and that exactly one failure occurs in .
Consequently, within the -th matched set, the NCC working model similarly specifies the conditional density within the matched set as a Multinomial distribution with a single trial, i.e.,
where the probability mass assigned to subject is
KL Divergence Formulation
Because the full at-risk set is not used, the conditional probability generally differs from the corresponding conditional probability in the internal Cox working model. Nevertheless, the probabilities remain valid conditional probabilities within the matched set, since
When the matched set is formed by random sampling from , the denominator satisfies
when the risk-set size is large. However, it is important to note that these two probabilities are defined under different conditioning events: conditions on the full at-risk set , whereas conditions on the sampled matched set . Consequently, the two probabilities are not directly comparable and do not satisfy a simple proportional relationship. Numerically, the value of is typically larger than because the matched set constitutes only a subsample of the full risk set, resulting in a smaller normalization set in the denominator. However, such a comparison should be interpreted with caution, since the two probabilities correspond to different conditional experiments. In particular, under the NCC construction the probability is defined only for subjects included in the sampled matched set, whereas subjects in the full risk set who are not sampled into the matched set do not have a corresponding probability defined in this conditional experiment.
To extract information from the external model under the NCC design, we similarly replace the internal risk score by the external risk score obtained from the external coefficient estimates . Analogous to the construction of , we define the corresponding probability mass under the external model within the matched set as
which takes the same multinomial form as in the Cox construction but with the denominator restricted to the matched set .
To quantify the discrepancy between the external and internal conditional probabilities for the -th matched set, we define the KL divergence:
Accumulating over all matched sets gives
where does not involve .
Let denote the indicator that subject is the observed failure in the -th matched set, so that the internal NCC conditional log-likelihood is
Integrated Objective Function
Proposition. Under the NCC construction, the integrated objective function satisfies
where is the internal NCC conditional log-likelihood, denotes the externally induced pseudo-event weight within the -th matched set that can be fully precomputed before optimization, and is the integration weight.
It is worth emphasizing that, although the NCC design is commonly used as a computationally efficient surrogate for the Cox model and yields consistent estimators of the regression coefficients, the KL divergence defined under the NCC construction is not a direct surrogate for the Cox-model KL divergence. This is because the two KL quantities are defined with respect to different conditional experiments. Consequently, the corresponding probability models, and hence the associated KL divergences, are defined on different probability spaces.