- Login, find course Web page, run S-PLUS
- Use the Commands Window to execute commands
- Load data sets

- use S-PLUS to plot bivariate data with a scatterplot
- use S-PLUS to fit a multiple regression model
- use S-PLUS to make residual plots

- to interpret residual plots to determine if fitting a linear model is justified
- to understand the relationship between correlation and causation

- Which variable do you expect will be more closely related with life expectancy?
Predict whether the association between each explanatory variable and the response variable
is positive or negative.
- Load in the Life Expectancy data set.
Attach the data set.

> attach(television)

in the Commands Window. - Make a plot with tv on the x axis and life on the y axis.
> plot(tv,life)

Is the association positive or negative? Is this what you expected?Does it look like a linear relationship is adequate, or is a nonlinear relationship better?

If a linear relationship is inadequate, try both a reciprocal and a log transformation to see which is better. The reciprocal would be televisions per person.

> plot(1/tv,life)

> plot(log(tv),life)

Do the same for the physician variable.

> plot(phys,life)

Is there a negative or positive association?> plot(1/phys,life)

> plot(log(phys),life)

Which transformation makes the relationship with life expectancy most linear? - Use S-PLUS to fit a model with both log(tv) and log(phys)
as explanatory variables.
- Use your mouse to select Statistics:Regression:Linear....
- In the Formula box type
life ~ log(tv) + log(phys)

This means ``life expectancy in years is modeled as a a linear function of log(tv) and log(phys)''. An intercept is included by default. - Click on the Plots tab.
- Click on the plot Residuals versus Fitted Values
- Click on OK.
- Read the Report Window and look at the graphs.

Examine the residual plot. Do you see much of a pattern?

In the Report Window, there will be a table labeled "Coefficients" with the fitted parameter values.

Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 90.6222 4.3557 20.8056 0.0000 log(phys) -2.2589 0.7474 -3.0221 0.0047 log(tv) -2.9156 0.5907 -4.9358 0.0000

The column headed "Value" has the slope and intercept of the regression line. These are statistics that can be used to describe the relationship between these variables.

The column headed "Std. Error" has the estimated standard errors of the estimated coefficients.

The column headed "t value" is the t statistic of the hypothesis test that tests if the true parameter value is 0.

The column headed "Pr(>|t|)" is the two-sided p-value of the hypothesis test.

Are both variables useful for making predictions on life expectancy?

Notice that television has a larger (absolute) t value and a smaller p-value.

Comment on the following conclusion.

Our model is

(life expectancy in years) = 90.6 - 2.26 log(people per physician) - 2.92 log(people per television)

Doubling the number of televisions in a poor country is cheaper than doubling the number of physicians. If we doubled the number of televisions, this would halve the number of people per television which would affect the life expectancy by -2.92 log(0.5) = 2 years. We can increase life expectancy in poor countries by shipping lots of televisions!

Is this conclusion justified?

- Discuss the difference between association and causation. There is a negative association between the number of people per television and life expectancy. Does this mean that one variable has a causal relationship with the other? If not, what else might explain the assocation?

Last modified: April 20, 2001

Bret Larget, larget@mathcs.duq.edu