I gathered a small data set of the amount of liquor consumption (wine and liquor) and the death rate due to Cirrhosis. I analyzed the correlation between the variables and did a multiple regression to find the predictability of the Cirrhosis death rate. I tried several different combinations of variables to find the regression with the most accuracy and significance. the best output i got was with Cirrhosis death rate as the dependent variable, and wine consumption, liquor consumption, and population as the independent variables. It produced an R square of .78 and an F statistic of 50.84, with all P-values close to zero. Something that i did not expect was that wine consumption had a stronger correlation (in the correlation output and scatter plot) and more significant t-stat to Cirrhosis death rate than liquor consumption did. My y-intercept formula produced was y = 10.6829 + .3489×1 + 2.0171×2 + .1765×3.
Here is the excel spreadsheet for this assignment:
Multicollinearity- when two or more independent variables of a multiple regression model are highly correlated.
In the Pueblo real estate data Improvement value and total value are highly correlated (0.93637)and land value and total value also correlate(0.46456). this is because the total value is the land and improvement values added together. If you took out either total value or land and improvement value, it would produce a more accurate model.
Other correlations: Total Rooms with bedrooms (0.7940127) Total sqft with first floor and other finished area (both around .7) and Total sqft with above first floor area (0.63174).
Pg. 598 #3
My final answer came out with a r value of .18, and an F stat of 7.04
I took out Total assets, Total Revenue, and Return on Equity. they had higher correlations and were not very significant in the model.
Excel portion of homework – Copy of correlation homework 2-6-14