Introduction to Biostatistics
Exploratory Data Analysis I: Histograms, Means, and Medians
This lab assumes that you already know how to:
- login to the system computers;
- use a browser to find the course Web page;
- start the S-PLUS software;
- and move back and forth between the browser and S-PLUS.
This lab will teach you to:
- enter data directly into S-PLUS;
- enter data into a simple text editor to read into S-PLUS;
- to read in a prepared data set from the course Web page;
- to use S-PLUS to draw histograms;
- to use S-PLUS to calculate means and medians.
In this lab you should learn to:
- estimate the mean and the median of a distribution from a histogram;
- understand the qualitative differences
between the mean and the median;
- use a histogram to describe the shape of a distribution;
- use a histogram to identify skewness in a distribution.
- Entering data directly into S-PLUS.
The following data is from Exercise 2-4 on page 35
of the textbook and represents the time until death in hours
for thirteen sheep that were fed a toxic weed as part of an experiment.
44 27 24 24 36 36 44 44 120 29 36 36 36
Follow these steps to create a variable named deathTime with this data.
- Open a Commands Window by clicking on the button with the ">x" symbol
if you do not already have one open.
- In the Commands Window, create a variable called deathTime
You can put spaces or single carriage returns between numbers.
A carriage return on a blank line ends the input.
See the example below. (Some of the characters below are computer output.)
> deathTime <- scan()
> 1: 44 27 24 24 36
> 6: 36 44 44 120 29
> 11: 36 36 36
- Calculating means and medians in the Commands Window.
You can calculate the mean and median.
- Creating a file and reading the file into S-PLUS.
Exercise 2-5 on page 35 contains ten columns of ten cholinesterase indices.
Assume that the first column are measurements from men and that the second column
are measurements from women.
Ignore the final eight columns of data.
You can enter this data with two variables into a file to read into S-PLUS
following these steps.
- Click on the Start Button and select Programs:Accessories:NotePad
- Enter the data into the file including a header row with the variable names.
- Save this file to the Desktop naming the file
by selecting Save As... from the File menu.
- Import the data into S-PLUS by selecting Import Data from the File menu.
- Calculating Means and Medians in the Commands Window.
To refer to variables in a data set by name, you need to attach the data set.
You can find the mean of all the index measurements.
You can find the mean of the index measurements separately
for males and females.
The square brackets select the subset of the index variable
for which the logical statement inside is true.
- Read in data from the Web page.
Find the HARVEST data set on the course Web page and save it to the Desktop.
Import the data into S-PLUS.
- Using S-PLUS to draw a histogram.
A histogram is a bar graph
for displaying the distribution
of a single quantitative variable.
Make a histogram in S-PLUS following these steps.
- Click on the ``2D Plots'' button,
which is on the ruler and has a small picture with a bar graph
and a jagged line.
This opens up the Plots2D palette.
- Click the histogram button which has a picture of a little histogram.
A graphics window and a dialog box will open.
- Click the little arrow next to ``Data Set''
and then click on the name of the data frame where your variable is.
- Click on the little arrow next to ``x Column(s)''
and then click on the name of the variable.
- Finally, click on the OK button.
Often, the default choice of the number of bars is not good.
You can follow these steps to make a better graph.
- Complete the first four steps above.
- Click on the ``Options'' tab.
- Change ``Number of bars'' from ``Auto'' to a number, such as 15.
- If the variable is integer valued, select ``Integer'' instead of ``Continuous''.
- Click on the OK button.
- Interpreting histograms.
The center of a histogram may be described in two ways.
The median is the location that divides the shaded area
of the histogram in half.
The mean is the location at which the histogram would
balance if the histogram were made from a uniform solid material.
If a histogram looks similar to its mirror image,
we say the histogram is symmetrical.
If the left half of the data is more spread than the right half of the data,
we say that the distribution is skewed to the left
Also, if the right half of the data is more spread than the left half of the data,
we say that the distribution is skewed to the right.
Make histograms of the variables SBPCB, DBPCB, and HRCB.
Which is most symmetrical?
Which is skewed to the right?
Which is skewed to the left?
- Calculating means and medians when there are missing values.
The HARVEST data set includes many missing values,
because every individual was not measured at every time point in the study,
and for some individuals, smoking or exercise information was not collected.
Missing data is represented by the code ``NA'' in S-PLUS.
If you ask S-PLUS to calculate the mean or median of a variable
that includes missing data, it gives ``NA'' as the result.
You can override this behavior with the option
which removes missing values before calculation.
>  NA
Print this answer sheet to record your answers.
- Obtain the heights (in inches) of fifteen male and fifteen female students
Create a text file (using NotePad or another text editor)
with this information
in a format ready to read into S-PLUS.
The file should begin something like this.
- Read the data into S-PLUS
and use S-PLUS to calculate the mean and median heights
of all thirty individuals combined
as well as for the males and females separately.
- Load the cereal data set into S-PLUS.
Find a variable that is skewed to the right
by plotting its histogram.
Calculate the mean and median of this variable.
Is the mean larger than the median as expected?
Further S-PLUS help is available in this
Last modified: January 4, 2001