General background

Figure 1: Survival analyses at-a-glance

What are the main characteristics of survival data?

Survival data are the data which have three main characteristics:

  1. the dependent variable or response is the waiting time (\(T\)) until an event of interest takes place,

  2. observations are censored, meaning that for some units the event of interest has not taken place at the time the data are analyzed, and

  3. there may be covariates/ predictors/ explanatory variables (\(X\)) whose effect on the waiting time (\(T\)) we wish to assess or control.

What are various types of survival analyses?

Based on each of these three characteristics, survival analysis can be classified into different types.

  • Types of survival analysis based on the waiting time (\(T\)): If the waiting time is a continous random variable (which is perhaps what we have considered most often, e.g, waiting time till death by lung cancer or heart attack), the corresponding analysis is continuous time survival analysis. If the waiting time is discrete, being able to take values only on a discrete grid (say, waiting time till the change of ruling party in the government), one considers discrete time survival analysis.

  • Types of survival analysis based on censoring: Ordinary/conventional survival analysis (such as the standard Kaplan–Meier method) assumes non-informative censoring, which means the censoring happens independently of the individual subject’s risk of the event. In other words, at any time-point, the future risk of the participants being followed, are the same as the future risk of the participants no more being followed (censored because of end-of-study or drop-out); that is, there is no difference betweem them in terms of the event risk – that is, losses to follow-up are rather random and does not provide any information about the event risk. In contrast, in situations where the future risk of the participants being followed are not the same as the future risk of the participants no more being followed, informative censoring is said to have taken place. This happens in presence of one/more competing events, whose occurrence rules out the occurence of the main event and vice versa (for example, in a study with cardiovascular death as the main outcome, death from non-cardiovascular reasons are competing events). In presence of informative censoring, standard Kaplan–Meier analysis leads to bias. In such cases, one may consider competing risks analysis. Further, in settings where the competing events rule out the occurrence of the main event but not vice versa, one may consider semi-competing risks analysis (for example, when the main event is nonterminal but the comepting event is terminal).

  • Types of survival analysis based on the use of covariates: Consider
    Waiting time till event = \(T\),
    Survival function, \(S(t)\) = probability that waiting time is more than \(t\) = \(P(T>t)\),
    Incidence function, \(F(t)\) = probability that event occurs by time \(t\) = \(P(T \leq t)\) = \(1 - S(t)\),
    Event density, \(f(t)\) = rate of events per unit time = \(F'(t)\) [if \(F\) is differentiable, i.e., \(T\) is absolutely continuous],
    Hazard function/rate, \(h(t)\) = instantaneous rate of event occurrence at time \(t\) = \(f(t)/S(t)\).

    Figure 2: Parametric survival models

    Different types of analysis:

    • Consider \(S(t)\) or \(h(t)\) is simply a function of \(t\) (no covariate effect; homogeneous population).

      • assume no specific distribution for \(T\) (that is, nonparametric analysis) –> standard Kaplan–Meier analysis/curve for \(S(t)\) and the log-rank test for comparing \(S(t)\) among different groups.

      • assume some specific distribution for \(T\) (that is, parametric analysis) –> \(T\) follows exponential distribution (i.e., \(h(t)=h\)), or \(T\) follows Weibull distribution (i.e., \(h(t)=hk(ht)^{k-1}\)) etc.

    • Consider the effects of covariates on \(S(t)\) or \(h(t)\) (heterogeneous population). Account for heterogeneity in the model.

      • assume no specific distribution for \(T|X\) (that is, nonparametric analysis) –> survival tree, survival forest analysis.

      • assume some specific distribution for \(T|X\) (that is, parametric regression model) –> \(h(t;x)=h\exp(x^T\beta)\) for exponential regression analysis, \(h(t;x)=k(ht)^{k-1}\exp(x^T\beta)\) for Weibull regression analysis etc.

      • assume \(h(t;x)=h_0(t)\exp(x^T\beta)\) (that is, semi-parametric regression model) –> Relative risk or Cox regression analysis (Cox proportional hazards model). Proportional hazards model assumes that the effect of a covariate is to multiply the hazard by some constant. [If you generalize as \(h_j(t;x)=h_{0j}(t)\exp(x^T\beta)\), or \(h(t;x(t))=h_0(t)\exp({x(t)}^T\beta)\), proportional hazards do not hold any more, but still Cox regression.]

      • assume \(h(t;x)=h_0(t\exp(-x^T\beta)) \exp(-x^T\beta)\) –> Accelerated failure time (AFT) model. AFT model assumes that the effect of a covariate is to accelerate or decelerate the life course of a disease by some constant.

      • in addition to the observed heterogeneity (observed X’s), assume a random effect (frailty term) in the model in order to account for unobserved heterogeneity (unobserved X’s that affect the outcome). \(h(t;x | Z)=Zh_0(t)\exp(x^T\beta)\) –> Frailty models. Unlike the Cox regression and AFT models, under frailty models, individuals with the same value of the covariates may have different distributions.

      • assume \(h(t;x)=h_0(t) + x^T\beta\) –> Additive hazards model.

      • neural network models / deep learning models.

Competing risks (CR) analysis measures

Incidence function:

For standard survival analysis, one considers
Incidence function = probability of the event happening by time \(t\):
\(F(t)=P(T \leq t)\). (Survival function is \(S(t)=1-F(t)\).)

For CR analysis, one considers
Cumulative incidence function of cause \(k\) = probability of the event happening by time \(t\) due to cause \(k\):
\(F_k(t)=P(T \leq t,\epsilon=k)\).
Of course, the sum of cumulative incidences of all competing causes is the total incidence, that is, \(F(t) = \sum_k F_k(t)\).

Further, cause-specific mortality of cause \(k\) = expected number of life years lost because of cause \(k\):
\(M_k(t) = \int_0^t F_k(s)ds\). (This is what a competing risk forest uses as predicted value, as we will see below)

Hazard function:

For standard survival analysis, one considers
Hazard function = instantaneous rate of event occurrence at time \(t\):
\(h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T \leq t + \Delta t \mid T \geq t )}{\Delta t} = f(t)/S(t)\),
where \(f(t)\) is the event density at time \(t\).

For CR analysis, one considers

Cause-specific hazard (CSH) function of cause \(k\) = instantaneous rate of event occurrence at time \(t\) due to cause \(k\):
\(h_k^{CSH}(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T \leq t + \Delta t, \epsilon = k \mid T \geq t )}{\Delta t} = f_k(t)/S(t)\),
where \(f_k(t)\) is the cause-specific event density at time \(t\). Of course, the sum of the cause-specific hazards of all competing causes is the overall hazard, that is, \(h(t) = \sum_k h_k^{CSH}(t)\).

Also, cumulative hazard function (or integrated hazard function) for cause \(k\) = \(\int_t h_k^{CSH}(t)dt\).

Note: The competing risk forest uses cause-specific hazard. Fine-and-Gray model uses subdistribution hazard (below).

Subdistribution hazard (SDH) function = instantaneous rate of event occurrence at time \(t\) due to cause \(k\) provided no cause \(k\) event happened thus far:
\(h_k^{SDH}(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T \leq t + \Delta t, \epsilon = k \mid T \geq t \; \cup \; (T \leq t \cap \epsilon \neq 1))}{\Delta t}\).

Difference between CSH and SDH is that CSH removes from the risk-set subjects who experienced the event by time \(t\), whereas SDH removes from the risk-set subjects who experienced the event by time \(t\) through cause \(k\) only (not through other causes). For SDH, if the subject has experienced the event by any other cause (that is, competing event has occured), the subject remains in the risk set. In other words, the subjects experiencing a competing event stay in the risk set forever. This leads to SDH being smaller than the CSH since the risk set is artificially inflated. The higher the competing hazard, the more subjects experience a competing event and the more SDH is lowered in comparison to the CSH.

Competing risk modeling approaches:

  • One approach characterizes the joint distribution of the observed data in terms of cause-specific hazard functions (Prentice, et al., 1978). The likelihood is expressed as

    \[L = \prod_{i=1}^n \left\{ \bigg({\lambda_j}_i (t_i;z_i)\bigg)^{\delta_i} \; S(t_i;z_i^{\ast}) \right\},\]

    where \(\lambda_j\) is \(j\)th cause-specific hazard, \(\delta_i\) is a censoring indicator and \(z^{\ast}=z^{\ast}(t) = \{ z(u): u\leq t \}\).

  • Another approach postulates a mixture model that expresses the distribution of (D, T) in terms of the marginal distribution of failure type and the conditional distribution of time to failure, given type of failure. The likelihood is expressed as

    \[L = \prod_{i=1}^n \left\{ \prod_{j=1}^J \bigg(\pi_j\bar{f}_j(t_i)\bigg)^{I(d_i=j)} \; \left( \sum_{k=1}^J \pi_j \bar{S}_j(t_i) \right)^{I(d_i=0)} \right\},\]

    where \(\pi_j=P(D=j)\) is the marginal event type probability, \(\bar{f}_j(t)=f(t|D=j)\) and \(\bar{S}_j(t)=S(t|D=j)\) are respectively the density function and the survivor function of the conditional event time distribution given an individual failed from an event of type \(j\), \(d_i=0\) indicates a censored observation.

 


Random forest

What is decision tree learning?

Predictive modeling approach that uses a decision tree as a predictive model. Decision trees models, or simply decision trees, are built by recursive binary partitioning of the data. The basic principle is that – at every step, among all possible partitions of the data, that partition is chosen which maximizes homogeneity within the resulting two groups, or in other words, which creates the best well-separated groups. Different algorithms use different metrics for measuring “best.” The response/target variable (\(Y\)) and the feature/predictor variable (\(X\)) can be both discrete or continuous. If the target variable is continuous, the tree is called regression tree and if the target variable is discrete, the tree is called classification tree. Combinedly, they are called Classification And Regression Tree (CART) (after Breiman 1984).

Figure 3: Decision tree building by recursive binary partitioning

How to build a decision tree?

Consider building a regression tree. Suppose \(Y\) is a continuous response and \(X\) is a predictor. Find out the \(X\)-threshold that divides the data into two groups such that the sum of squares of residuals (RSS = sum[(y - group mean)^2]) of all observations is minimized. This best threshold is called the root node, or simply root of the tree. Then we have two groups. Within each group do the same. In this way we can keep on dividing the data into groups until we have groups of single observations. But that would overfit. So, divide a group further only if the group has at least \(m\) variables, where \(m\) is generally set as 20. A group which can be further divided is called an internal node, or simply node, and a group which cannot further be divided is called a terminal node, or leaf. So, all groups with less than 20 observations become a leaf. For a leaf, the target/modeled value is the mean of the \(Y\)-values of the group members.

For building classification trees, instead of considering RMS, you consider the Gini impurity (different from Gini coefficient) as a measure of homogeneity within the groups.

What are splitting rule, size, depth and minimal depth for a decision tree?

A splitting rule is a rule that is used to split a node into two daughter nodes. For regression trees, the rule can be determined in terms of mean squared error (MSE); for classification trees, the rule can be determined in terms of Gini impurity. The size of a decision tree is the number of nodes in it. The depth of a decision tree is the length of the longest path from the root to a leaf. Minimal depth of a variable is the depth of the first split using that variable relative to the root node of a decision tree. Predictively more important variables are expected to be selected earlier in the tree for splitting. So, minimal depth is a measure of predictiveness of a variable. The smaller the minimal depth, the greater the impact the variable has on prediction. Minimal depth = 0 implies that this variable has been selected for splitting at the root node.

What is the difference between regression and regression trees?

Linear regression (parametric analysis) assumes a model of the form

\(f(X) = \beta_0 + \sum_{j=1}^p X_j\beta_j,\) (linear combination of the covariates)

whereas regression trees (nonparametric analysis) assume a model of the form

\(f(X) = \sum_{m=1}^M c_m 1_{(X \in R_m)},\) (linear combination of the indicator functions of the sets that constitute a partition of the covariate space)

where \(R_1,\ldots,R_m\) represent a partition of feature/covariate space.

If the relationship between the features and the response is well approximated by a linear model, then an approach such as linear regression will likely work well, and will outperform a method such as a regression tree that does not exploit this linear structure. If instead there is a highly non-linear and complex relationship between the features and the response as indicated by model, then decision trees may outperform classical approaches.

What is bootstrap aggregation or bagging?

  • First create a bootstrapped dataset from the original dataset (of size \(n\)). To do this, repeatedly select random samples of same size (\(n\)) from the original dataset with replacement.

  • Now create a decision tree for each dataset in the bootstrapped dataset.

  • For prediction, predict using all the trees, and then vote for the final prediction.

What is the standard random forest building procedure?

Figure 4: Random forest

Random forest is a generalization of decision tree and involves building multiple decision trees from the data. Random forest is used to overcome individual decision tree’s tendency to overfit.

  • First create a bootstrapped dataset from the original dataset (of size \(n\)). To do this, repeatedly select random samples of same size (\(n\)) from the original dataset with replacement.

  • Now create a decision tree for each dataset in the bootstrapped dataset, but for each tree, only use a random subset of \(m\) variables at each step. For a random forest, you randomize over the samples, randomize over the X variables in determining nodes, and also randomize over the number of X variables to consider in determining nodes. However, later you determine the optimal number of X variables to consider.

  • For prediction based on random forest, predict using all the trees under the forest, and then vote for the final prediction.

If a random forest is built using \(m = p\), then this amounts simply to bagging.

What are in-bag and out-of-bag (oob) samples for a random forest?

According to bootstrap theory, if you draw a sample of size \(n\) from \(N\) elements with replacement, the sample will contain on the average approximately \((1-e^{-1})n ≈ 0.632n\) unique elements.

In a random forest, for each bootstrap subsample, the observations which are present in that subsample are called in-bag observations, or in-bag sample for that subsample; and the observations which are absent in that subsample are called out-of-bag observations, or out-of-bag sample for that subsample. A decision tree is constructed based on the in-bag sample only. But, after the tree is built, prediction can be done for both in-bag and out-of-bag samples using that tree.

What is variable importance (VIMP) for a random forest?

Variable importance measures the increase (or decrease) in prediction error for the forest when a variable is randomly “noised-up” (Breiman 2001). Noising-up can be done by (a) randomly permuting the values in that variable column, (b) randomly assigning the observations to daughter nodes whenever that variable is selected to split a node, and (c) inversely assigning the observations to daughter nodes whenever that variable is selected to split a node.

VIMP for variable \(j\) \(=\) prediction error from tree with noised-up \(j\) \(-\) prediction error from the original tree

A large positive VIMP shows that the prediction accuracy of the forest is substantially degraded when a variable is noised-up; thus a large VIMP indicates a potentially predictive variable. (Note: Generally, if a variable is not important, there will not be any significant change in prediction error, and so VIMP will be close to 0 (positive or negative). But if a variable is important, noising up it will cause an increase in prediction error, and so VIMP will be a significantly large positive number.)

What is the competing risks forest building procedure?

Competing risks forest is built basically in the same way as standard random forest, the only difference being in the splitting rule and prediction measures computed in the leaves.

  • Draw \(B\) bootstrap samples from the learning data.

  • Grow a competing risk tree for each bootstrap sample. At each node of the tree, randomly select \(M (\leq p)\) candidate variables. The node is split using that candidate variable which maximizes a competing risk splitting rule (given below).

  • Grow each tree fully under the constraint that a terminal node should not have less than \(n_0 (>0)\) unique cases (that means, don’t split a node any more if it leads to a daughter node with less than \(n_0\) cases ).

  • Then, based on each tree, compute the estimates of cause-specific cumulative incidence function, cause-specific cumulative hazard function, and cause-specific mortality.

  • Take the average of each estimate over the \(B\) trees to obtain its ensemble estimate.

What are the splitting rules for competing risks forest?

Two splitting rules are available for growing a competing risks forest (Ishwaran et al. 2014).

    1. log-rank splitting rule: Compares the estimates of cause-specific hazard function between two the groups.
    1. Gray’s splitting rule: Compares the estimates of cause-specific cumulative incidence function between two the groups.

References and other resources