What is Predictive Modelling?
Predictive modeling knowledge is one of the most sought-after skill today. It is in demand these days. It is being used in almost every domain ranging from finance, retail to manufacturing. It is being looked as a method of solving complex business problems. It helps to grow businesses e.g. predictive acquisition model, optimization engine to solve network problem etc.
What are the essential steps in a predictive modeling project?
It consists of the following steps –
- Establish business objective of a predictive model
- Pull Historical Data – Internal and External
- Select Observation and Performance Window
- Create newly derived variables
- Split Data into Training, Validation and Test Samples
- Clean Data – Treatment of Missing Values and Outliers
- Variable Reduction / Selection
- Variable Transformation
- Develop Model
- Validate Model
- Check Model Performance
- Deploy Model
- Monitor Model
Explain the problem statement of your project. What are the financial impacts of it?
Cover the objective or main goal of your predictive model. Compare monetary benefits of the predictive model vs. No-model. Also highlights the non-monetary benefits (if any).
Difference between Linear and Logistic Regression?
Two main differences are as follows –
Linear regression requires the dependent variable to be continuous i.e. numeric values (no categories or groups). While Binary logistic regression requires the dependent variable to be binary – two categories only (0/1). Multinomial or ordinary logistic regression can have dependent variable with more than two categories.
Linear regression is based on least square estimation which says regression coefficients should be chosen in such a way that it minimizes the sum of the squared distances of each observed response to its fitted value. While logistic regression is based on Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood).
How to treat outliers?
There are several methods to treat outliers –
- Percentile Capping
- Box-Plot Method
- Mean plus minus 3 Standard Deviation
- Weight of Evidence
What is multi co-linearity and how to deal it?
Multi co-linearity implies high correlation between independent variables. It is one of the assumptions in linear and logistic regression. It can be identified by looking at VIF score of variables. VIF > 2.5 implies moderate co-linearity issue. VIF >5 is considered as high co-linearity.
It can be handled by iterative process: first step – remove variable having highest VIF and then check VIF of remaining variables. If VIF of remaining variables > 2.5, then follow the same first step until VIF < =2.5
Explain co-linearity between continuous and categorical variables?
Co-linearity between categorical and continuous variables is very common. The choice of reference category for dummy variables affects multi co-linearity. It means changing the reference category of dummy variables can avoid co-linearity. Pick a reference category with highest proportion of cases.
What are the applications of predictive modeling?
Predictive modeling is mostly used in the following areas –
- Acquisition – Cross Sell / Up Sell
- Retention – Predictive Attrition Model
- Customer Lifetime Value Model
- Next Best Offer
- Market Mix Model
- Pricing Model
- Campaign Response Model
- Probability of Customers defaulting on loan
- Segment customers based on their homogenous attributes
- Demand Forecasting
- Usage Simulation
- Underwriting
- Optimization – Optimize Network
Is VIF a correct method to compute co-linearity in this case?
VIF is not a correct method in this case. VIFs should only be run for continuous variables. The t-test method can be used to check co-linearity between continuous and dummy variable.
Difference between Factor Analysis and PCA?
The main 3 difference between these two techniques are as follows –
- In Principal Components Analysis, the components are calculated as linear combinations of the original variables. In Factor Analysis, the original variables are defined as linear combinations of the factors.
- Principal Components Analysis is used as a variable reduction technique whereas Factor Analysis is used to understand what constructs underlie the data.
In Principal Components Analysis, the goal is to explain as much of the total variance in the variables as possible. The goal in Factor Analysis is to explain the co-variances or correlations between the variables.