Generally speaking, in order to avoid multicollinearity, one of the dummy variables is dropped through the drop_first parameter of pd.get_dummies.We will use a particular naming convention for all variables: original variable name, colon, category name.Note a couple of points regarding the way we create dummy variables: #Balance forecasting model for credit card purchases updateNext, we will create dummy variables of the four final categorical variables and update the test dataset through all the functions applied so far to the training dataset. #Balance forecasting model for credit card purchases codeRefer to my previous article for further details.Ī code snippet for the work performed so far follows: Splitting our data before any data cleaning or missing value imputation prevents any data leakage from the test set to the training set and results in more accurate model evaluation. This is achieved through the train_test_split function’s stratify parameter. Accordingly, in addition to random shuffled sampling, we will also stratify the train/test split so that the distribution of good and bad loans in the test set is the same as that in the pre-split data. Image 1 above shows us that our data, as expected, is heavily skewed towards good loans. This approach follows the best model evaluation practice. We will perform Repeated Stratified k Fold testing on the training test to preliminary evaluate our model while the test set will remain untouched till final model evaluation. Let us now split our data into the following sets: training (80%) and test (20%). Status:Charged OffĪll the other values will be classified as good (or 1). A quick look at its unique values and their proportion thereof confirms the same.īased on domain knowledge, we will classify loans with the following loan_status values as being in default (or 0): Identify Target Variableīased on the data exploration, our target variable appears to be loan_status. Since our objective here is to predict the future probability of default, having such features in our model will be counterintuitive, as these will not be observed until the default event has occurred
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |