Logistic Regression
By Chi Kit Yeung in Statistics Python Machine Learning
August 17, 2024
Introduction
Logistic Regression is a form of supervised machine learning where we try to predict a categorical dependent variable, using one or more independent variables. In other words, it is a model that predicts the probability of a binary outcome (between 1 or 0, True or False).
Brainstorming some problems that may be answered using logistic regression:
- Will the customer buy anything?
- Will it rain?
- Will this student get admitted?
- Predict if person will land a data scientist job
I realized that the example problems all expect a ‘Yes’ or ‘No’ answer but the prediction is more of a ‘True’ or ‘False’ categorization.
Assumptions
The logistic regression shares all the same assumptions as with the linear regression assumptions aside from the first ’linearity’ assumption.
Maximum Likelihood Estimation (MLE)
MLE is the method used to estimate the parameter values of the logistic regression model. What this means in layman terms is it’s the method used to find the logistic function that best explains the data. The method runs through multiple iterations of a likelihood function until it finds a function that returns the maximum likelihood estimation.
Log-likelihood
- The value of Log-likelihood is almost always negative
- The higher the value, the better.
LL-Null
This is the Log-likelihood value of a model with no independent variables. This value is useful as a comparison tool against the calculated Log-likelihood value to see if the model has any explanatory power.
LLR p-value (Log Likelihood Ratio)
The value is based on the Log-likelihood value of the model and the LL-Null. It measures if the values are statistically different. The lower the better.
Pseudo R-squared
Unlike the linear regression, the logistic regression does not have a statistic that can be likened to the R-squared. In this case, there is the Pseudo R-squared. A good Pseudo R-squared value is between 0.2 and 0.4.
Performing the Logistic Regression
Using StatsModel Package
Importing the packages
import pandas as pd
import statsmodels.api as sm
import numpy as np
Load the data
data = pd.read_csv('so_and_so.csv')
Declare the dependent and independent variables
y = data['y']
x1 = data['indep_A']
The Regression Itself
x = sm.add_constant(x1)
model = sm.Logit(y, x)
results_log = model.fit()
Interpreting and evaluating the model by generating the summary table.
>>> results_log.summary()
"""
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 518
Model: Logit Df Residuals: 516
Method: MLE Df Model: 1
Date: Sat, 17 Aug 2024 Pseudo R-squ.: 0.2121
Time: 18:44:43 Log-Likelihood: -282.89
converged: True LL-Null: -359.05
Covariance Type: nonrobust LLR p-value: 5.387e-35
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -1.7001 0.192 -8.863 0.000 -2.076 -1.324
indep_A 0.0051 0.001 9.159 0.000 0.004 0.006
==============================================================================
"""
Create a model with multiple independent variables.
x1 = data[['indep_A', 'indep_B', 'indep_C', 'indep_D', 'indep_E']]
x = sm.add_constant(x1)
multi_model = sm.Logit(y, x)
multi_results = multi_model.fit()
>>> multi_results.summary()
"""
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 518
Model: Logit Df Residuals: 512
Method: MLE Df Model: 5
Date: Sat, 17 Aug 2024 Pseudo R-squ.: 0.5143
Time: 18:46:53 Log-Likelihood: -174.39
converged: True LL-Null: -359.05
Covariance Type: nonrobust LLR p-value: 1.211e-77
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
const -0.0211 0.311 -0.068 0.946 -0.631 0.589
indep_A 0.0070 0.001 9.381 0.000 0.006 0.008
indep_B -0.8001 0.089 -8.943 0.000 -0.975 -0.625
indep_C -1.8322 0.330 -5.556 0.000 -2.478 -1.186
indep_D 2.3585 1.088 2.169 0.030 0.227 4.490
indep_E 1.5363 0.501 3.067 0.002 0.554 2.518
=================================================================================
"""
Generating Predictions
Predictions can be generated using the predict
method.
multi_results.predict(x)
x
is a dataframe with the same features used to train the model in the same order. Be sure to also add a constant column to the beginning of the dataframe using statsmodels’ addconstant function. The predictions will be returned as a Series of float values that represent the probability (eg. something like 0.787412). The results can be rounded to obtain the prediction as boolean values.
round(multi_results.predict(x))
Using Scikit Learn
This was actually not included in the online course so I searched for the method independently.
# Importing the packages
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Define the dependent and independent variables
y = data['dependent']
x = data[['indep_A', 'indep_B', 'indep_C', 'indep_D', 'indep_E']]
# Scale and Perform the Regression
log = LogisticRegression()
scaler = StandardScaler()
scaler.fit(x)
x_scaled = scaler.transform(x)
log.fit(x_scaled, y)
# Generate Predictions
log.predict(x_scaled)
Checking the Accuracy
Confusion Matrix
def confusion_matrix(data,actual_values,model):
"""
Confusion matrix
Parameters
----------
data: data frame or array
data is a data frame formatted in the same way as your input data (without the actual values)
e.g. const, var1, var2, etc. Order is very important!
actual_values: data frame or array
These are the actual values from the test_data
In the case of a logistic regression, it should be a single column with 0s and 1s
model: a LogitResults object
this is the variable where you have the fitted model
e.g. results_log in this course
----------
"""
#Predict the values using the Logit model
pred_values = model.predict(data)
# Specify the bins
bins=np.array([0,0.5,1])
# Create a histogram, where if values are between 0 and 0.5 tell will be considered 0
# if they are between 0.5 and 1, they will be considered 1
cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]
# Calculate the accuracy
accuracy = (cm[0,0]+cm[1,1])/cm.sum()
# Return the confusion matrix and
return cm, accuracy
[TBA]