Logistic Regression

By Chi Kit Yeung in Statistics Python Machine Learning

August 17, 2024

Introduction

Logistic Regression is a form of supervised machine learning where we try to predict a categorical dependent variable, using one or more independent variables. In other words, it is a model that predicts the probability of a binary outcome (between 1 or 0, True or False).

Brainstorming some problems that may be answered using logistic regression:

Will the customer buy anything?
Will it rain?
Will this student get admitted?
Predict if person will land a data scientist job

I realized that the example problems all expect a ‘Yes’ or ‘No’ answer but the prediction is more of a ‘True’ or ‘False’ categorization.

Assumptions

The logistic regression shares all the same assumptions as with the linear regression assumptions aside from the first ’linearity’ assumption.

Maximum Likelihood Estimation (MLE)

MLE is the method used to estimate the parameter values of the logistic regression model. What this means in layman terms is it’s the method used to find the logistic function that best explains the data. The method runs through multiple iterations of a likelihood function until it finds a function that returns the maximum likelihood estimation.

Log-likelihood

The value of Log-likelihood is almost always negative
The higher the value, the better.

LL-Null

This is the Log-likelihood value of a model with no independent variables. This value is useful as a comparison tool against the calculated Log-likelihood value to see if the model has any explanatory power.

LLR p-value (Log Likelihood Ratio)

The value is based on the Log-likelihood value of the model and the LL-Null. It measures if the values are statistically different. The lower the better.

Pseudo R-squared

Unlike the linear regression, the logistic regression does not have a statistic that can be likened to the R-squared. In this case, there is the Pseudo R-squared. A good Pseudo R-squared value is between 0.2 and 0.4.

Performing the Logistic Regression

Using StatsModel Package

Importing the packages

import pandas as pd
import statsmodels.api as sm
import numpy as np

Load the data

data = pd.read_csv('so_and_so.csv')

Declare the dependent and independent variables

y = data['y']
x1 = data['indep_A']

The Regression Itself

x = sm.add_constant(x1)

model = sm.Logit(y, x)
results_log = model.fit()

Interpreting and evaluating the model by generating the summary table.

>>> results_log.summary()
"""
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  518
Model:                          Logit   Df Residuals:                      516
Method:                           MLE   Df Model:                            1
Date:                Sat, 17 Aug 2024   Pseudo R-squ.:                  0.2121
Time:                        18:44:43   Log-Likelihood:                -282.89
converged:                       True   LL-Null:                       -359.05
Covariance Type:            nonrobust   LLR p-value:                 5.387e-35
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.7001      0.192     -8.863      0.000      -2.076      -1.324
indep_A        0.0051      0.001      9.159      0.000       0.004       0.006
==============================================================================
"""

Create a model with multiple independent variables.

x1 = data[['indep_A', 'indep_B', 'indep_C', 'indep_D', 'indep_E']]
x = sm.add_constant(x1)

multi_model = sm.Logit(y, x)
multi_results = multi_model.fit()

>>> multi_results.summary()
"""
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  518
Model:                          Logit   Df Residuals:                      512
Method:                           MLE   Df Model:                            5
Date:                Sat, 17 Aug 2024   Pseudo R-squ.:                  0.5143
Time:                        18:46:53   Log-Likelihood:                -174.39
converged:                       True   LL-Null:                       -359.05
Covariance Type:            nonrobust   LLR p-value:                 1.211e-77
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -0.0211      0.311     -0.068      0.946      -0.631       0.589
indep_A           0.0070      0.001      9.381      0.000       0.006       0.008
indep_B          -0.8001      0.089     -8.943      0.000      -0.975      -0.625
indep_C          -1.8322      0.330     -5.556      0.000      -2.478      -1.186
indep_D           2.3585      1.088      2.169      0.030       0.227       4.490
indep_E           1.5363      0.501      3.067      0.002       0.554       2.518
=================================================================================
"""

Generating Predictions

Predictions can be generated using the predict method.

multi_results.predict(x)

x is a dataframe with the same features used to train the model in the same order. Be sure to also add a constant column to the beginning of the dataframe using statsmodels’ addconstant function. The predictions will be returned as a Series of float values that represent the probability (eg. something like 0.787412). The results can be rounded to obtain the prediction as boolean values.

round(multi_results.predict(x))

Using Scikit Learn

This was actually not included in the online course so I searched for the method independently.

# Importing the packages
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Define the dependent and independent variables
 y = data['dependent']
 x = data[['indep_A', 'indep_B', 'indep_C', 'indep_D', 'indep_E']]

# Scale and Perform the Regression
log = LogisticRegression()
scaler = StandardScaler()

scaler.fit(x)
x_scaled = scaler.transform(x)

log.fit(x_scaled, y)

# Generate Predictions

log.predict(x_scaled)

Checking the Accuracy

Confusion Matrix

def confusion_matrix(data,actual_values,model):
    """
    Confusion matrix 
    
    Parameters
    ----------
    data: data frame or array
        data is a data frame formatted in the same way as your input data (without the actual values)
        e.g. const, var1, var2, etc. Order is very important!
    actual_values: data frame or array
        These are the actual values from the test_data
        In the case of a logistic regression, it should be a single column with 0s and 1s
        
    model: a LogitResults object
        this is the variable where you have the fitted model 
        e.g. results_log in this course
    ----------
    """
    #Predict the values using the Logit model
    pred_values = model.predict(data)
    # Specify the bins 
    bins=np.array([0,0.5,1])
    # Create a histogram, where if values are between 0 and 0.5 tell will be considered 0
    # if they are between 0.5 and 1, they will be considered 1
    cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]
    # Calculate the accuracy
    accuracy = (cm[0,0]+cm[1,1])/cm.sum()
    # Return the confusion matrix and 
    return cm, accuracy

[TBA]