Survival analysis

Authors: Rehan Ali Ansari, Prabhat Singhal

Time-to-event models are a valuable tool for organizations seeking to understand the time it takes for specific events, such as equipment failure, customer churn, or employee turnover. These models allow organizations to analyze time-to-event data and identify factors that influence the likelihood of the event occurring. By doing so, organizations can estimate the risk of the event occurring at any given time and develop strategies to mitigate these risks. Overall, time-to-event models offer valuable insights into event occurrence times and factors, making them a helpful tool for organizations seeking to manage risk and develop effective strategies.

The primary goal of time-to-event models, also known as survival analysis, is to estimate the probability of an event occurring at a specific time, given the characteristics of the study population. In these models, the outcome of interest is often called “survival time,” which is the time from a specific starting point until an event’s occurrence or the end of follow-up.

These models focus on time-to-event data and consider censoring, which occurs when an event has yet to happen for some individuals in the study. Unlike traditional machine learning models, survival analysis models the probability of an event occurring at any given time rather than making predictions or classifications based on input data. Traditional machine learning models can be used with any data but do not typically consider censoring.

Data:

A typical survival analysis model expects the following pieces of information to estimate the likelihood of an event:

  • Time-to-event data: The duration of time from the start of the study until the event of interest occurs (such as death, employee churn, or machine failure).
  • Censored data: The data corresponds to individuals where time-to-event is unknown. In other words, these individuals have yet to experience the event at the end of the study period.
  • Additional covariates: Other variables that are thought to be related to the event of interest, such as demographic information (age, gender, etc.), behavioral information (recency, frequency, product usage, engagement), financial information and external factors (economic conditions, industry trends, and competition).

The selection of additional covariates depends on the data’s characteristics and the analysis’s goals. Some variables may be more important than others, and the choice of variables should be based on the context of the problem and the data availability.

Modeling survival

Survival modes are used to model the probability that an event of interest will occur at a given time and can be used to predict future events. Two important concepts in survival analysis are the survival function and the hazard function.

The survival function, denoted as S(t), represents an individual’s probability of survival beyond a certain time t. It is defined as:

S(t) = P(T > t)

where T is the time of the event of interest.

The hazard function, denoted as h(t), represents the immediate risk of the event occurring at time t, given that the individual has survived up to that time. It is defined as:

h(t) = P(T = t | T > t)

The hazard function gives us the probability of the event occurring at time t, given that the individual has not experienced the event yet.

The Cox proportional hazard (CPH) model, also known as the Cox model or the Cox regression model, is a popular statistical method for survival analysis. In the CPH model, the hazard function is assumed to have a particular form:

h(t|X) = h0(t) * exp(X’ * beta)

Where:

  • h0(t) is the baseline hazard function, representing the risk of the event occurring at time t without any covariates.
  • X is a vector of covariates (i.e., explanatory variables) that are thought to be related to the event of interest.
  • beta is a vector of parameters that represents the effect of the covariates on the hazard function.

The CPH model determines how a unit change in a covariate will affect an individual’s survival probability. CPH is a semi-parametric model comprising two parts first one is the baseline hazard function, which is the non-parametric part of the model. Recall that the hazard function shows the risk or probability of an event occurring over future periods. The parametric part of the model is the covariates’ relationship with the baseline hazard function l. This relationship is a time-invariant factor (or coefficient) directly affecting the baseline hazard.

In a CPH model, a covariate can be either constant or time-varying, depending on the data collection method. For example, if data is updated monthly, age would be considered a time-varying covariate, but if data is updated yearly, it would be considered constant. Both constant and time-varying covariates can be used to train a CPH model. Constant covariates models can predict future risk scores while time-varying models can provide instantaneous risk scores. However, if an organization is interested in month-on-month risk scores and the individual attributes remain constant until updated, the CPH model with invariant covariates is more beneficial. On the other hand, if the features vary every month, the time-varying model would be more effective.

Application areas / How can we help?

Survival analysis has traditionally been used in clinical studies where the event of interest is the death of a patient, and the duration of time until this event occurs is the length of time the patient is under the observation of a doctor. For censored patients, survival analysis can be used to determine their life expectancy for future months, helping doctors adjust treatments accordingly.

However, the applications of survival analysis are wider than clinical studies. It can also be applied to various organizational contexts, such as predicting an employee’s departure from a company, the failure of a machine, customers closing their bank accounts, or customers reducing activity on social/e-commerce platforms. By analyzing time-to-event data, organizations can identify the factors that influence the likelihood of these events occurring and estimate the risk of the event arising at any given time.

At Fractal, we employ various approaches to address the specific needs of each use case. Despite the variety of methods used, the fundamental principles of modeling remain consistent.

Conclusion

By utilizing time-to-event models, organizations can gain valuable insights into the factors influencing the likelihood of specific events. These insights can help organizations develop effective risk management strategies, optimize operations, better understand customer behavior, and enhance decision-making.

Recommended Reads

Display-image-5-1-e1667562708971.webp

Data drift: Identifying. Preventing. Automating

quantum-computing-is-here-is-your-business-ready.png

Quantum computing is here. Is your business ready?

MicrosoftTeams-image-41-scaled-1.jpg

Get climate risk ready: How your business can ramp up climate reporting

Like this content?
Stay updated.

    *By clicking "subscribe" you consent to allow Fractal to store and process your information as per our privacy policy