Assignment 5: Checking Statistical Assumptions

The mtcars dataset included in R contains information extracted from the 1974 Motor Trend US magazine, and contains 32 observations of different vehicles (1973-74 models) regarding their fuel consumption data and other 10 car aspects.
For this assignment, a multiple linear regression model was built to predict mpg (miles per gallon of fuel), based on cyl (number of cylinders), hp (gross horsepower), wt (weight of the vehicle in lb/1000), qsec (1/4 mile time) and am (transmission: automatic or manual). The predictors were chosen based on an educated guess.

In order to make a diagnosis of this model, several plots were made:
The residuals vs. fitted plot shows that all observations distribute randomly around the line, which might indicate homoscedasticity. However, it is important to note that the fitted line is not a straight line, which already suggests that a linear model might not be the best approach in this case. The homoscedasticity assumption is better explored in the scale-location plot, which is derived from the first plot but includes the square root of the standardized residuals on the y-axis. As the plot shows, a slightly upward sloping line suggest that there is in fact no homoscedasticity. If it were homoscedasticity, the line should be straight and horizontal.
Furthermore, the qqplot shows if the standardized residuals follow a normal distribution or not. In this case, they do not seem to follow a normal distribution. Moreover, there are might be some outliers at the extremes (like the Chrysler Imperial or the Corolla). Finally, the last plot shows Cook’s distance, which is a function of the leverage and standardized residual associated with each data point, and it is used to estimate of the influence of a data point when performing OLS regression.
Additionally, to test for outliers, it is possible to compute the Bonferroni p-value for the most extreme observation (Chrysler Imperial), which returns a t-test statistic of 2.138038, an unadjusted p-value of 0.042901 and a bonferroni p-value of > 1, which indicates that there is not enough evidence to reject the null hypothesis that there are no outliers, so we must assume that Chrysler Imperial is not an outlier.

Although there is an intense debate on how should the VIF be interpreted, a good rule of thumb is that the GVIF should not exceed 5, or the GVIF^(1/(2*Df)) should not exceed 2. The square root of the GVIF denotes how much larger the standard error is, compared with what it would be if that variable were uncorrelated with the other predictor variables in the model. Therefore, the larger the GVIF or the GVIF^(1/(2*Df)), the higher the multicollinearity. According to this rule of thumb, in this case hp, wt, and qsec show multicollinearity.
Finally, to evaluate non-linearity, it is possible to do the component + residuals plot and the ceres plot:
Both plots show strong non-linearity (the green line suggests the desired regression line for this model, against a straight discontinuous red line for an ideal linear model). These plots show once more that this model should not be analyzed as a linear model. In conclusion, this model does not fulfill the assumptions for OLS regression.

