Linear Regression Analysis¶
Linear Regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables, usually denoted as X. In practice, linear regression can be used in two ways: prediction and feature engineering.
For prediction, linear regression can be used to fit a predictive model to an observed dataset of y and X values. When the model is developed, and a new value for X is given, the model can predict the value of y.
For feature engineering, when given a variable y, and a number of variables X1, X2...Xn (collectively denoted as X in matrix representation) that may be related to y are given, linear regression analysis can quantify the strength of the relationship between y and X, and help determine which X variable(s) may have no relationship with y, and identify subsets of X that may contain redundant information about y.
Linear regression models are often fitted using the least squares approach, but it can also be fitted using some other penalized version of the least squares loss functions as in Ridge Regression (L2) and LASSO (L1). In our system, it is controlled by setting the second regulation (alpha) parameter from 0 to 1.
Lambda and Alpha Values and Results¶
This table describes the type of penalized model that results, based on the values specified for the alpha and lambda options.
Lambda Value | Alpha Value | Result |
---|---|---|
0 | Any value | No regularization. Alpha is ignored. |
> 0 | == 0 | Ridge Regression |
> 0 | == 1 | LASSO |
> 0 | 0 < Alpha < 1 | Elastic Net Penalty |
Defining Input Parameters¶
This table describes the input parameters on the Fill in Parameters tab, which are required to run a linear regression on a dataset.
Field | Input Description |
---|---|
Prediction Column | Column in the dataset on which you want to make a prediction using the linear regression algorithm. |
Input Columns | Columns in the dataset you want to use as supporting data points to calculate the prediction. |
Split | The percentage of data used for training versus prediction. Minimum value is 0.1. An entry of 0.8 means 80% of the data will be used for training, and 20% for prediction. |
Max Iteration | Linear Regression uses a Gradient Descent Iterative algorithm to find the optimum Solution space. This variable is the maximum limit on the number of iterations. If optimal parameters are found prior to reaching max Iterations, the algorithm stops iterating forward. Enter a number between 10 and 1000. |
Seed | Unique random number. It is used to get the same results for multiple executions given the same seed number. |
Alpha | Alpha is a parameter used for Regularization that linearly combines the L1 and L2 penalties of the lasso and ridge methods. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. For alpha in (0,1) the penalty is a combination of L1 and L2. |
Lambda | Lambda is a parameter used for Regularization. It is used to avoid overfitting. The larger the lambda, the more the coefficients for regression are shrunk toward zero. When the value is 0 == >, regularization is disabled and ordinary linear models are fitted. |
Metrics Tab | Displays the metric values for your analysis. |
Training Metrics | The Metrics tab shows the values for the following Training metrics: - RootMeanSquareError - R-Square - MeanAbsoluteError - Co-efficients |
Evaluation Metrics | The Metrics tab shows the values for the following Evaluation metrics: - RootMeanSquareError - R-Square - MeanAbsoluteError |
Analysis Panel with Linear Regression Illustration¶
This image shows an example of the Fill in Parameters tab on the analysis panel, when using the Linear Regression algorithm.
Understanding Output Metrics¶
Once a job is complete, the Metrics tab displays the results of the analysis.
Linear Regression algorithm metrics include:
-
Training Metrics: The name of each feature column you selected as a data point and its numeric value.
- RootMeanSquareError: Deviation of the residuals (prediction errors) used to verify experimental results.
- R-Square: Statistical measure of how close the data are to the fitted regression line.
- MeanAbsoluteError: A measure of the difference between two continuous variables.
- Co-efficients: The constant represents the rate of change of one variable (y) as a function of changes in the other (x); it is the slope of the regression line.
-
Evaluation Metrics: The metrics used in the Linear Regression algorithm and their calculated values, which include:
- R-Square: Statistical measure of how close the data are to the fitted regression line.
- MeanAbsoluteError: A measure of the difference between two continuous variables.
- RootMeanSquareError: Deviation of the residuals (prediction errors) used to verify experimental results.
Analysis Panel Metrics Tab Illustration¶
This image shows the Metrics tab from a linear regression analysis:
How to Run a Linear Regression Analysis¶
Follow these steps to run an analysis using linear regression:
- Start a cluster.
- Open or create a workspace.
- Click the Add Analysis Panel button and select Linear Regression from the dialog box. The Linear Regression analysis panel opens.
- Select the dataset you want to use from the Select a Dataset list.
- Enter parameters in all (required) input fields on the Fill in Parameters tab. See the Defining Your Input Parameters section for details.
- Click Run. A Job Submitted success message displays, the panel updates as the job progresses, and the Metrics tab displays the results.
Except where otherwise noted, content on this site is licensed under the Development License Agreement.