Logistic Regression Analysis¶
Logistic Regression is a classification algorithm which is used to predict a variable that can take binary values, such as: 0 or 1. This algorithm is available out-of-the-box on the Add Analysis panel. It provides data scientists the ability to run the algorithm on their datasets without coding. Logistic Regression provides a less computationally expensive option for classification tasks, and is easier to interpret by non-data scientists.
Cases where Logistic Regression provides a good option include:
- Analyzing the sentiment of a given statement (Positive/Negative)
- Predicting who Bob is going to vote for (Democrat/Republican)
- Determining the probability that a student will enroll in a master’s program based on his academic, extra-curricular activities, enrollment in online courses, projects, work experience etc., (Probability values of each classification instead of direct yes/no)
- Determining the probability of an employee staying in the current job (Loyal/Not Loyal)
Logistic Regression Examples¶
An example where Logistic Regression might be useful is illustrated in the following table. In the data below, the target column is Income > $100k. Based on the given details about an individual such as education level, type of industry, city of employment, and the job role, we can predict whether the person will earn more than $100k annually or not.
Education | Industry | City | Role | Income > $100K |
---|---|---|---|---|
Bachelors | Retail | Charlotte | Sales Executive | No |
Bachelors | IT | New York | Software Engineer | No |
Masters | IT | San Francisco | Data Scientist | Yes |
Bachelors | Transportation | Durham | NA | No |
PhD | Education | Chicago | Professor | Yes |
PhD | IT | New York | AI Research Scientist | Yes |
A visual representation of the results of a Logistic Regression analysis is shown below. In this example, the data is almost linearly separable. While this analysis could be run using a non-linear classifier, such as Neural Network, the alternative is computationally very expensive.
Logistic Regression Analysis Panel Illustration¶
The following illustrates the Fill in Parameters tab on the Logistic Regression Analysis panel.
Definitions of Logistic Regression Input Parameters¶
The following table defines the input parameters for the Logistic Regression algorithm.
Parameter | Definition |
---|---|
Prediction Column | The dataset column that you want to use the logistic regression algorithm on to make a prediction. |
Input Columns | Columns in the dataset that you want to use as supporting data points to calculate the prediction. |
Split | The percentage of data used for training versus prediction. Minimum value is 0.1. An entry of 0.8 means 80% of the data will be used for training, and 20% for prediction. |
Max Iteration | Logistic Regression uses a Gradient Descent Iterative algorithm to find the optimum Solution space. This variable is the maximum limit on the number of iterations. If optimal parameters are found prior to reaching max iterations, the algorithm stops iterating forward. Enter a number between 10 and 1000. |
Seed | Unique random number. It is used to get the same results for multiple executions given the same seed number. |
Alpha | Alpha is a parameter used for Regularization that linearly combines the L1 and L2 penalties of the Lasso and Ridge methods. |
Lambda | Lambda is a parameter used for Regularization. It is used to avoid over fitting. The larger the lambda, the more the coefficients or regression are shrunk toward zero. When the value is 0, regularization is disabled and ordinary linear models are fitted. |
Regularization Parameters¶
The following table describes the results of the various values for the Regularization parameters described above.
Lambda Value | Alpha Value | Results |
---|---|---|
0 | Any value | No regularization - Alpha is ignored |
> 0 | = = 0 | Ridge Regression |
> 0 | = = 1 | LASSO |
> 0 | 0 < Alpha < 1 | Elastic Net Penalty |
How to Access the Logistic Regression Panel¶
Follow these steps to access the Logistic Regression panel:
- Navigate to the Manage Analytics Workspaces page. The Workspaces window opens.
- Start a cluster, if one is not already running.
- Open a Workspace or create a new one.
- Click Add Analysis Panel. The Select an Algorithm dialog box opens.
- Select Logistic Regression and click OK. The Logistic Regression panel opens.
How to Run a Logistic Regression Analysis¶
Follow these steps to run a Logistic Regression analysis:
- On the Logistic Regression panel, select a dataset from the Dataset drop-down list.
- Select the column to run the analysis on from the Prediction Column dialog.
- Select the columns to include in the result set from the Input Columns dialog.
- Enter a value between 0.1 and 1.0 in the Split field. This value determines what percentage of the data is used for training and what percentage is used in the actual analysis.
- Enter a numeric value in the Seed field. This entry is used to identify the run so that the analysis can be repeated in the future.
- Enter a numeric value in the Alpha field.
- Enter a numeric value in the Lambda field.
- Enter a number in the Max Iterations field to indicate the maximum number of times the analysis should run against this dataset.
- Click the Run button at the top of the panel. The status changes to “Running” and your analysis runs. Once the job has completed, you can view the results on the Metrics tab.
Training Metrics Parameter Definitions¶
The following table defines the parameters used in the training metrics.
Parameter | Definition |
---|---|
AUC | Area under the ROC curve |
ROC | The point in the ROC curve which gives the best combination of True Positive and False Positive rates |
FMeasure | F1 score |
Evaluation Metrics Parameter Definitions¶
The following table defines the parameters used in the evaluation metrics.
Parameter | Definition |
---|---|
Accuracy | Fraction indicating the correct predictions |
True Positive Rate | Number of correct positive predictions |
False Positive Rate | Number of incorrect positive predictions |
True Negative Rate | Number of correct negative predictions |
False Negative Rate | Number of incorrect negative predictions |
FMeasure | F1 Score |
Except where otherwise noted, content on this site is licensed under the Development License Agreement.