SAS Predictive Modelling Interview Questions and Answers

1. What is Predictive Modelling?

Predictive modeling knowledge is one of the most sought-after skills today. It is in demand these days. It is being used in almost every domain ranging from finance, retail to manufacturing. It is being looked at as a method of solving complex business problems. It helps to grow businesses e.g. predictive acquisition model, optimization engine to solve network problems, etc.

2. What are the essential steps in a predictive modeling project?

It consists of the following steps –

  • Establish business objective of a predictive model
  • Pull Historical Data – Internal and External
  • Select Observation and Performance Window
  • Create newly derived variables
  • Split Data into Training, Validation and Test Samples
  • Clean Data – Treatment of Missing Values and Outliers
  • Variable Reduction / Selection
  • Variable Transformation
  • Develop Model
  • Validate Model
  • Check Model Performance
  • Deploy Model
  • Monitor Model

3. Explain the problem statement of your project. What are the financial impacts of it?

Cover the objective or main goal of your predictive model. Compare monetary benefits of the predictive model vs. No-model. Also highlights the non-monetary benefits (if any).

4. Difference between Linear and Logistic Regression?

Two main differences are as follows –

 Linear regression requires the dependent variable to be continuous i.e. numeric values (no categories or groups). While Binary logistic regression requires the dependent variable to be binary – two categories only (0/1). Multinomial or ordinary logistic regression can have dependent variables with more than two categories.

Linear regression is based on least square estimation which says regression coefficients should be chosen in such a way that it minimizes the sum of the squared distances of each observed response to its fitted value. While logistic regression is based on Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes the probability of Y given X (likelihood).

5. How to treat outliers?

There are several methods to treat outliers –

  • Percentile Capping
  • Box-Plot Method
  • Mean plus minus 3 Standard Deviation
  • Weight of Evidence

6. What is multicollinearity and how to deal with it?

Multi co-linearity implies a high correlation between independent variables. It is one of the assumptions in linear and logistic regression. It can be identified by looking at the VIF score of variables. VIF > 2.5 implies moderate co-linearity issues. VIF >5 is considered as high collinearity.

It can be handled by iterative process: the first step – remove variable having highest VIF and then check VIF of remaining variables. If VIF of remaining variables > 2.5, then follow the same first step until VIF < =2.5

7. Explain collinearity between continuous and categorical variables?

Collinearity between categorical and continuous variables is very common. The choice of reference category for dummy variables affects multicollinearity. It means changing the reference category of dummy variables can avoid collinearity. Pick a reference category with the highest proportion of cases.

8. What are the applications of predictive modeling?

Predictive modeling is mostly used in the following areas –

  • Acquisition – Cross Sell / Up Sell
  • Retention – Predictive Attrition Model
  • Customer Lifetime Value Model
  • Next Best Offer
  • Market Mix Model
  • Pricing Model
  • Campaign Response Model
  • Probability of Customers defaulting on loan
  • Segment customers based on their homogenous attributes
  • Demand Forecasting
  • Usage Simulation
  • Underwriting
  • Optimization – Optimize Network

9. Is VIF a correct method to compute co-linearity in this case?

VIF is not a correct method in this case. this should only be run for continuous variables. The t-test method can be used to check collinearity between continuous and dummy variables.

10. Difference between Factor Analysis and PCA?

The main 3 difference between these two techniques are as follows –

  • In Principal Components Analysis, the components are calculated as linear combinations of the original variables. In Factor Analysis, the original variables are defined as linear combinations of the factors.
  • Principal Components Analysis is used as a variable reduction technique whereas Factor Analysis is used to understand what constructs underlie the data.
  • In Principal Components Analysis, the goal is to explain as much of the total variance in the variables as possible. The goal in Factor Analysis is to explain the co-variances or correlations between the variables.