A Crash Course in Survival Analysis: Customer Churn (Part III)
Joshua Cortez, a member of our Data Science Team, has put together a series of blogs on using survival analysis to predict customer churn. This is the third and final blog of this series.
Knowing how long customers stay is all well and good, but what if we want to know the factors that influence churn? What if we want to predict how long a given customer will churn? Survival regression help us do just that.
Here we’ll specifically focus on Cox regression, which uses Cox’s Proportional Hazard Model to perform survival regression. The main insight the Cox regression model gives us are its coefficients. The (exponential of the) coefficients correspond to the hazard ratios. What does the hazard ratio mean? It is a relative measure of the instantaneous rate of failure. Don’t worry if that sounds confusing, it’s better to consider an example.
For example, let’s say we’ve fitted a Cox regression model to our example telco data set, and one of the variables is gender. This variable takes on two values: 1 for male, and 0 for female. What does it mean if hazard ratio of gender is 1.10? It means that at any time, whether it is 6 months since signing up or 12 months since signing up, males are 10% more likely to churn versus females.
The hazard ratio is a relative measure, not an absolute measure. So it should be looked at to as how females fare in relation to males (or vice versa).
Proportional hazards assumption
Cox regression has a very important assumption, the proportional hazards assumption. The variables in this model should first be tested on whether or not they follow this assumption.
It means that the hazard ratio of all variables should be constant over time. For example, we have the variable “dependents” that is 1 if the customer has dependents (e.g. children) and 0 if he/she doesn’t have dependents. If the proportional hazards assumption holds, then, at any time, those with dependents are 30% more likely to churn than those without dependents. It doesn’t happen that for the first month up to the third month, those with dependents are 30% more likely, then from the third month to the sixth month, it changes to 15%. The hazard ratio should be more or less the same across time.
It’s possible to test this assumption using a statistical test in the survival package in R.
The test says that only the following variables satisfy the proportional hazards assumption: PaperlessBilling, SeniorCitizen, Dependents, and Gender. Let’s now use these variables to fit a model.
Side note: It’s unfortunate that we’ll have to leave out the other variables, but there are other methods (stratified cox regression, cox regression with time-dependent covariates, pseudo-observations) that can incorporate variables that don’t follow the proportional hazards assumption. These however are out of scope of these blog posts.
R’s Cox regression results
After filtering the variables, we can (finally) fit a model and interpret its results. Here are the results from calling the coxph function in R.
Here’s what the cox regression model tells us:
- Gender isn’t a good indicator of churn. This confirms what we saw earlier when we compared the survival curves between female and male customers. They’re equally likely to churn.
- Senior citizenship, having dependents, and having paperless billing are indicative of churn. We can also quantify their effects.
- Senior citizens are 30% more likely to churn than non-senior citizens
- Customers without dependents are twice as likely to churn more than those with dependents. This also validates the survival curves earlier. The difference is that now we have a number to compare both groups.
- Customers with paperless billing are 1.8 times more likely to churn than those without paperless billing.
Taking these insights to action
We saw three significant factors to churn. We can leave out the Senior citizen factor and that leaves us with two to examine. Senior citizens might be more at risk of churning not because they’re willingly opting out of the subscription service, but because they’re passing away. Furthermore, the hazard ratio of the senior citizen factor is lower (30%) compared to the other two anyway.
The paperless billing factor is surprising since you’d expect customers to prefer the convenience and speed of receiving their bills online. There are various possibilities as to why this is happening. However, the business should investigate if their paperless billing processes are properly implemented. Also, as the majority of the customers are under paperless billing this is an important issue.
We see from the graph below that there are more than twice as many customers without dependents than with dependents. At the same time, those without dependents are twice as likely to churn. This means that there should be more efforts to retain this segment of customers.
These are just some of the ways that survival analysis can be used to address business problems. Barry Leventhal has recommended other use cases:
- Business Planning
- Forecast monthly number of lapses and use to monitor current lapse rates.
- Lifetime Value (LTV) prediction
- Derive LTV predictions by combining expected survival times with monthly revenues.
- Active customers
- Predict each customer’s time to next purchase, and use to identify “active” vs. “inactive” customers.
- Campaign evaluation
- Monitor effects of campaigns on survival rates.
With that, I hope this branch of statistics can be useful to a problem that you’re solving.
The following resources were extremely helpful in making these posts. Check them out if you want to learn more about survival analysis.
- Customer Churn Dataset
- Dayne Batten’s blog posts on survival analysis in the business setting
- Here’s a good resource on stratified cox regression, and this course on survival analysis in general has great content.
- Barry Leventhal’s presentation with business insights
- The Survival Package in R
- The Lifelines Package in Python