What is Regression?
The statistical process with the help of which we are in position to predict (or estimate) the values of one variable-called “Dependent Variable” from known values of another variable(s) – called “Independent Variable(s)” is known as Regression.
For example, if we know that the effort and size in a software project are correlated, we may find out the expected amount of effort for a given size of project.
When there is more than one independent variable, its called “Multiple Regression”.
What is Regression analysis?
Regression Analysis is a method of finding the line of best fit for a set of data.
It is a mathematical procedure that produces two results.
First it produces an equation to match the data gathered. There are different types of analysis (linear, quadratic, cubic, exponential, etc.). So one may want to check them to see which one matches the collected data most closely.
Second, regression analysis (or multiple regression analysis, if more than one independent variables are involved) may produce numbers to indicate how closely the new formula fits the data.
For example, the dependent variable might be overall satisfaction and the independent variables be price, quality, value for money, delivery time and staff knowledge.
The multiple regression analysis would then identify the relationship between the dependent variable and the independent variables – this is presented as an equation or model (formula) that might look like this:
Overall satisfaction = 1.37 * price rating + 0.91 * quality rating + 0.64 * delivery time rating + 2.42 (a constant)
Approach to Regression Analysis
The standard approach in regression analysis is:
Gather/take data – Past data for given independent variable(s) and corresponding dependent variable is collected.
Determine the form of equation to fit – We plot the dependent and independent data sets (in case of multiple independent variables, take one variable at a time) on a special graph called a scatter plot which shows the existence (or otherwise) of statistical relationships between variables. Examine the pattern being formed by these sets.
Fit an equation – Depending on the number of independent variables, a simple (Y=a + bX) or multiple regression equation (Y = a + b1*X1 + b2*X2 + … + bp*Xp) is selected.
Evaluate the fit using statistics – such as Coefficient of Determination (R), Standard Error of Estimate (SE), etc.
The first number is the correlation coefficient, r. This is the linear correlation coefficient, for use in indicating how closely the data fits a straight line. The closer r is to 1 (for a positive correlation) or -1 (for a negative correlation) the better the fit. A value of 0 indicates no fit at all.
Second is R (r2), the coefficient of determination, which indicates how closely the curve fits the data. It’s values range from 0 to 1, with 1 being a perfect fit.
Standard error of estimate is a measure developed to measure the reliability and accuracy of the regression equation to predict the value of dependent variable for a given value(s) of independent variable(s). It measures the variability of the observed values of dependent variable (Y) around the regression line.
Use the equation to predict the value of dependent variable for given value(s) of independent variable(s).