Organizing data in R: MarkDown Page Without Figures

Organizing data in R
========================================================

The basic tabular data structure (rows correspond to observations, columns to variables) is called a `data.frame` in `R`.

All `R` distributions provide the `datasets` packages which contains several sample datasets, see

```r
help(package = "datasets")
```

In an interactive session this will bring up the index of help pages for the package.

An alternative is to list the names of objects in a package

```r
ls("package:datasets")
```

```
## [1] "ability.cov" "airmiles"
## [3] "AirPassengers" "airquality"
## [5] "anscombe" "attenu"
## [7] "attitude" "austres"
## [9] "beaver1" "beaver2"
## [11] "BJsales" "BJsales.lead"
## [13] "BOD" "cars"
## [15] "ChickWeight" "chickwts"
## [17] "co2" "CO2"
## [19] "crimtab" "discoveries"
## [21] "DNase" "esoph"
## [23] "euro" "euro.cross"
## [25] "eurodist" "EuStockMarkets"
## [27] "faithful" "fdeaths"
## [29] "Formaldehyde" "freeny"
## [31] "freeny.x" "freeny.y"
## [33] "HairEyeColor" "Harman23.cor"
## [35] "Harman74.cor" "Indometh"
## [37] "infert" "InsectSprays"
## [39] "iris" "iris3"
## [41] "islands" "JohnsonJohnson"
## [43] "LakeHuron" "ldeaths"
## [45] "lh" "LifeCycleSavings"
## [47] "Loblolly" "longley"
## [49] "lynx" "mdeaths"
## [51] "morley" "mtcars"
## [53] "nhtemp" "Nile"
## [55] "nottem" "occupationalStatus"
## [57] "Orange" "OrchardSprays"
## [59] "PlantGrowth" "precip"
## [61] "presidents" "pressure"
## [63] "Puromycin" "quakes"
## [65] "randu" "rivers"
## [67] "rock" "Seatbelts"
## [69] "sleep" "stack.loss"
## [71] "stack.x" "stackloss"
## [73] "state.abb" "state.area"
## [75] "state.center" "state.division"
## [77] "state.name" "state.region"
## [79] "state.x77" "sunspot.month"
## [81] "sunspot.year" "sunspots"
## [83] "swiss" "Theoph"
## [85] "Titanic" "ToothGrowth"
## [87] "treering" "trees"
## [89] "UCBAdmissions" "UKDriverDeaths"
## [91] "UKgas" "USAccDeaths"
## [93] "USArrests" "USJudgeRatings"
## [95] "USPersonalExpenditure" "uspop"
## [97] "VADeaths" "volcano"
## [99] "warpbreaks" "women"
## [101] "WorldPhones" "WWWusage"
```

or, often of more interest, list the names and a brief description of the structure

```r
ls.str("package:datasets")
```

```
## ability.cov : List of 3
## $ cov : num [1:6, 1:6] 24.64 5.99 33.52 6.02 20.75 ...
## $ center: num [1:6] 0 0 0 0 0 0
## $ n.obs : num 112
## airmiles : Time-Series [1:24] from 1937 to 1960: 412 480 683 1052 1385 ...
## AirPassengers : Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...
## airquality : 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## anscombe : 'data.frame': 11 obs. of 8 variables:
## $ x1: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x2: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x3: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x4: num 8 8 8 8 8 8 8 19 8 8 ...
## $ y1: num 8.04 6.95 7.58 8.81 8.33 ...
## $ y2: num 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
## $ y3: num 7.46 6.77 12.74 7.11 7.81 ...
## $ y4: num 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
## attenu : 'data.frame': 182 obs. of 5 variables:
## $ event : num 1 2 2 2 2 2 2 2 2 2 ...
## $ mag : num 7 7.4 7.4 7.4 7.4 7.4 7.4 7.4 7.4 7.4 ...
## $ station: Factor w/ 117 levels "1008","1011",..: 24 13 15 68 39 74 22 1 8 55 ...
## $ dist : num 12 148 42 85 107 109 156 224 293 359 ...
## $ accel : num 0.359 0.014 0.196 0.135 0.062 0.054 0.014 0.018 0.01 0.004 ...
## attitude : 'data.frame': 30 obs. of 7 variables:
## $ rating : num 43 63 71 61 81 43 58 71 72 67 ...
## $ complaints: num 51 64 70 63 78 55 67 75 82 61 ...
## $ privileges: num 30 51 68 45 56 49 42 50 72 45 ...
## $ learning : num 39 54 69 47 66 44 56 55 67 47 ...
## $ raises : num 61 63 76 54 71 54 66 70 71 62 ...
## $ critical : num 92 73 86 84 83 49 68 66 83 80 ...
## $ advance : num 45 47 48 35 47 34 35 41 31 41 ...
## austres : Time-Series [1:89] from 1971 to 1993: 13067 13130 13198 13254 13304 ...
## beaver1 : 'data.frame': 114 obs. of 4 variables:
## $ day : num 346 346 346 346 346 346 346 346 346 346 ...
## $ time : num 840 850 900 910 920 930 940 950 1000 1010 ...
## $ temp : num 36.3 36.3 36.4 36.4 36.5 ...
## $ activ: num 0 0 0 0 0 0 0 0 0 0 ...
## beaver2 : 'data.frame': 100 obs. of 4 variables:
## $ day : num 307 307 307 307 307 307 307 307 307 307 ...
## $ time : num 930 940 950 1000 1010 1020 1030 1040 1050 1100 ...
## $ temp : num 36.6 36.7 36.9 37.1 37.2 ...
## $ activ: num 0 0 0 0 0 0 0 0 0 0 ...
## BJsales : Time-Series [1:150] from 1 to 150: 200 200 199 199 199 ...
## BJsales.lead : Time-Series [1:150] from 1 to 150: 10.01 10.07 10.32 9.75 10.33 ...
## BOD : 'data.frame': 6 obs. of 2 variables:
## $ Time : num 1 2 3 4 5 7
## $ demand: num 8.3 10.3 19 16 15.6 19.8
## cars : 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
## ChickWeight : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 578 obs. of 4 variables:
## $ weight: num 42 51 59 64 76 93 106 125 149 171 ...
## $ Time : num 0 2 4 6 8 10 12 14 16 18 ...
## $ Chick : Ord.factor w/ 50 levels "18" ## $ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## chickwts : 'data.frame': 71 obs. of 2 variables:
## $ weight: num 179 160 136 227 217 168 108 124 143 140 ...
## $ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
## co2 : Time-Series [1:468] from 1959 to 1998: 315 316 316 318 318 ...
## CO2 : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 84 obs. of 5 variables:
## $ Plant : Ord.factor w/ 12 levels "Qn1" ## $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
## $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
## $ conc : num 95 175 250 350 500 675 1000 95 175 250 ...
## $ uptake : num 16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
## crimtab : 'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ...
## discoveries : Time-Series [1:100] from 1860 to 1959: 5 3 0 2 0 3 2 3 6 1 ...
## DNase : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 176 obs. of 3 variables:
## $ Run : Ord.factor w/ 11 levels "10" ## $ conc : num 0.0488 0.0488 0.1953 0.1953 0.3906 ...
## $ density: num 0.017 0.018 0.121 0.124 0.206 0.215 0.377 0.374 0.614 0.609 ...
## esoph : 'data.frame': 88 obs. of 5 variables:
## $ agegp : Ord.factor w/ 6 levels "25-34" ## $ alcgp : Ord.factor w/ 4 levels "0-39g/day" ## $ tobgp : Ord.factor w/ 4 levels "0-9g/day" ## $ ncases : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ncontrols: num 40 10 6 5 27 7 4 7 2 1 ...
## euro : Named num [1:11] 13.76 40.34 1.96 166.39 5.95 ...
## euro.cross : num [1:11, 1:11] 1 0.3411 7.0355 0.0827 2.3143 ...
## eurodist : Class 'dist' atomic [1:210] 3313 2963 3175 3339 2762 ...
## EuStockMarkets : mts [1:1860, 1:4] 1629 1614 1607 1621 1618 ...
## faithful : 'data.frame': 272 obs. of 2 variables:
## $ eruptions: num 3.6 1.8 3.33 2.28 4.53 ...
## $ waiting : num 79 54 74 62 85 55 88 85 51 85 ...
## fdeaths : Time-Series [1:72] from 1974 to 1980: 901 689 827 677 522 406 441 393 387 582 ...
## Formaldehyde : 'data.frame': 6 obs. of 2 variables:
## $ carb : num 0.1 0.3 0.5 0.6 0.7 0.9
## $ optden: num 0.086 0.269 0.446 0.538 0.626 0.782
## freeny : 'data.frame': 39 obs. of 5 variables:
## $ y : Time-Series from 1962 to 1972: 8.79 8.79 8.81 8.81 8.91 ...
## $ lag.quarterly.revenue: num 8.8 8.79 8.79 8.81 8.81 ...
## $ price.index : num 4.71 4.7 4.69 4.69 4.64 ...
## $ income.level : num 5.82 5.83 5.83 5.84 5.85 ...
## $ market.potential : num 13 13 13 13 13 ...
## freeny.x : num [1:39, 1:4] 8.8 8.79 8.79 8.81 8.81 ...
## freeny.y : Time-Series [1:39] from 1962 to 1972: 8.79 8.79 8.81 8.81 8.91 ...
## HairEyeColor : table [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
## Harman23.cor : List of 3
## $ cov : num [1:8, 1:8] 1 0.846 0.805 0.859 0.473 0.398 0.301 0.382 0.846 1 ...
## $ center: num [1:8] 0 0 0 0 0 0 0 0
## $ n.obs : num 305
## Harman74.cor : List of 3
## $ cov : num [1:24, 1:24] 1 0.318 0.403 0.468 0.321 0.335 0.304 0.332 0.326 0.116 ...
## $ center: num [1:24] 0 0 0 0 0 0 0 0 0 0 ...
## $ n.obs : num 145
## Indometh : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 66 obs. of 3 variables:
## $ Subject: Ord.factor w/ 6 levels "1" ## $ time : num 0.25 0.5 0.75 1 1.25 2 3 4 5 6 ...
## $ conc : num 1.5 0.94 0.78 0.48 0.37 0.19 0.12 0.11 0.08 0.07 ...
## infert : 'data.frame': 248 obs. of 8 variables:
## $ education : Factor w/ 3 levels "0-5yrs","6-11yrs",..: 1 1 1 1 2 2 2 2 2 2 ...
## $ age : num 26 42 39 34 35 36 23 32 21 28 ...
## $ parity : num 6 1 6 4 3 4 1 2 1 2 ...
## $ induced : num 1 1 2 2 1 2 0 0 0 0 ...
## $ case : num 1 1 1 1 1 1 1 1 1 1 ...
## $ spontaneous : num 2 0 0 0 1 1 0 0 1 0 ...
## $ stratum : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pooled.stratum: num 3 1 4 2 32 36 6 22 5 19 ...
## InsectSprays : 'data.frame': 72 obs. of 2 variables:
## $ count: num 10 7 20 14 14 12 10 23 17 20 ...
## $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
## iris : 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## iris3 : num [1:50, 1:4, 1:3] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## islands : Named num [1:48] 11506 5500 16988 2968 16 ...
## JohnsonJohnson : Time-Series [1:84] from 1960 to 1981: 0.71 0.63 0.85 0.44 0.61 0.69 0.92 0.55 0.72 0.77 ...
## LakeHuron : Time-Series [1:98] from 1875 to 1972: 580 582 581 581 580 ...
## ldeaths : Time-Series [1:72] from 1974 to 1980: 3035 2552 2704 2554 2014 ...
## lh : Time-Series [1:48] from 1 to 48: 2.4 2.4 2.4 2.2 2.1 1.5 2.3 2.3 2.5 2 ...
## LifeCycleSavings : 'data.frame': 50 obs. of 5 variables:
## $ sr : num 11.43 12.07 13.17 5.75 12.88 ...
## $ pop15: num 29.4 23.3 23.8 41.9 42.2 ...
## $ pop75: num 2.87 4.41 4.43 1.67 0.83 2.85 1.34 0.67 1.06 1.14 ...
## $ dpi : num 2330 1508 2108 189 728 ...
## $ ddpi : num 2.87 3.93 3.82 0.22 4.56 2.43 2.67 6.51 3.08 2.8 ...
## Loblolly : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 84 obs. of 3 variables:
## $ height: num 4.51 10.89 28.72 41.74 52.7 ...
## $ age : num 3 5 10 15 20 25 3 5 10 15 ...
## $ Seed : Ord.factor w/ 14 levels "329" ## longley : 'data.frame': 16 obs. of 7 variables:
## $ GNP.deflator: num 83 88.5 88.2 89.5 96.2 ...
## $ GNP : num 234 259 258 285 329 ...
## $ Unemployed : num 236 232 368 335 210 ...
## $ Armed.Forces: num 159 146 162 165 310 ...
## $ Population : num 108 109 110 111 112 ...
## $ Year : int 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 ...
## $ Employed : num 60.3 61.1 60.2 61.2 63.2 ...
## lynx : Time-Series [1:114] from 1821 to 1934: 269 321 585 871 1475 ...
## mdeaths : Time-Series [1:72] from 1974 to 1980: 2134 1863 1877 1877 1492 ...
## morley : 'data.frame': 100 obs. of 3 variables:
## $ Expt : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Run : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Speed: int 850 740 900 1070 930 850 950 980 980 880 ...
## mtcars : 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## nhtemp : Time-Series [1:60] from 1912 to 1971: 49.9 52.3 49.4 51.1 49.4 47.9 49.8 50.9 49.3 51.9 ...
## Nile : Time-Series [1:100] from 1871 to 1970: 1120 1160 963 1210 1160 1160 813 1230 1370 1140 ...
## nottem : Time-Series [1:240] from 1920 to 1940: 40.6 40.8 44.4 46.7 54.1 58.5 57.7 56.4 54.3 50.5 ...
## occupationalStatus : 'table' int [1:8, 1:8] 50 16 12 11 2 12 0 0 19 40 ...
## Orange : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 35 obs. of 3 variables:
## $ Tree : Ord.factor w/ 5 levels "3" ## $ age : num 118 484 664 1004 1231 ...
## $ circumference: num 30 58 87 115 120 142 145 33 69 111 ...
## OrchardSprays : 'data.frame': 64 obs. of 4 variables:
## $ decrease : num 57 95 8 69 92 90 15 2 84 6 ...
## $ rowpos : num 1 2 3 4 5 6 7 8 1 2 ...
## $ colpos : num 1 1 1 1 1 1 1 1 2 2 ...
## $ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1 3 2 ...
## PlantGrowth : 'data.frame': 30 obs. of 2 variables:
## $ weight: num 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
## $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
## precip : Named num [1:70] 67 54.7 7 48.5 14 17.2 20.7 13 43.4 40.2 ...
## presidents : Time-Series [1:120] from 1945 to 1975: NA 87 82 75 63 50 43 32 35 60 ...
## pressure : 'data.frame': 19 obs. of 2 variables:
## $ temperature: num 0 20 40 60 80 100 120 140 160 180 ...
## $ pressure : num 0.0002 0.0012 0.006 0.03 0.09 0.27 0.75 1.85 4.2 8.8 ...
## Puromycin : 'data.frame': 23 obs. of 3 variables:
## $ conc : num 0.02 0.02 0.06 0.06 0.11 0.11 0.22 0.22 0.56 0.56 ...
## $ rate : num 76 47 97 107 123 139 159 152 191 201 ...
## $ state: Factor w/ 2 levels "treated","untreated": 1 1 1 1 1 1 1 1 1 1 ...
## quakes : 'data.frame': 1000 obs. of 5 variables:
## $ lat : num -20.4 -20.6 -26 -18 -20.4 ...
## $ long : num 182 181 184 182 182 ...
## $ depth : int 562 650 42 626 649 195 82 194 211 622 ...
## $ mag : num 4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
## $ stations: int 41 15 43 19 11 12 43 15 35 19 ...
## randu : 'data.frame': 400 obs. of 3 variables:
## $ x: num 0.000031 0.044495 0.82244 0.322291 0.393595 ...
## $ y: num 0.000183 0.155732 0.873416 0.648545 0.826873 ...
## $ z: num 0.000824 0.533939 0.838542 0.990648 0.418881 ...
## rivers : num [1:141] 735 320 325 392 524 ...
## rock : 'data.frame': 48 obs. of 4 variables:
## $ area : int 4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...
## $ peri : num 2792 3893 3931 3869 3949 ...
## $ shape: num 0.0903 0.1486 0.1833 0.1171 0.1224 ...
## $ perm : num 6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...
## Seatbelts : mts [1:192, 1:8] 107 97 102 87 119 106 110 106 107 134 ...
## sleep : 'data.frame': 20 obs. of 3 variables:
## $ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
## $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## stack.loss : num [1:21] 42 37 37 28 18 18 19 20 15 14 ...
## stack.x : num [1:21, 1:3] 80 80 75 62 62 62 62 62 58 58 ...
## stackloss : 'data.frame': 21 obs. of 4 variables:
## $ Air.Flow : num 80 80 75 62 62 62 62 62 58 58 ...
## $ Water.Temp: num 27 27 25 24 22 23 24 24 23 18 ...
## $ Acid.Conc.: num 89 88 90 87 87 87 93 93 87 80 ...
## $ stack.loss: num 42 37 37 28 18 18 19 20 15 14 ...
## state.abb : chr [1:50] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" ...
## state.area : num [1:50] 51609 589757 113909 53104 158693 ...
## state.center : List of 2
## $ x: num [1:50] -86.8 -127.2 -111.6 -92.3 -119.8 ...
## $ y: num [1:50] 32.6 49.2 34.2 34.7 36.5 ...
## state.division : Factor w/ 9 levels "New England",..: 4 9 8 5 9 8 1 3 3 3 ...
## state.name : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## state.region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## state.x77 : num [1:50, 1:8] 3615 365 2212 2110 21198 ...
## sunspot.month : Time-Series [1:2988] from 1749 to 1998: 58 62.6 70 55.7 85 83.5 94.8 66.3 75.9 75.5 ...
## sunspot.year : Time-Series [1:289] from 1700 to 1988: 5 11 16 23 36 58 29 20 10 8 ...
## sunspots : Time-Series [1:2820] from 1749 to 1984: 58 62.6 70 55.7 85 83.5 94.8 66.3 75.9 75.5 ...
## swiss : 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
## Theoph : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 132 obs. of 5 variables:
## $ Subject: Ord.factor w/ 12 levels "6" ## $ Wt : num 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 ...
## $ Dose : num 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 ...
## $ Time : num 0 0.25 0.57 1.12 2.02 ...
## $ conc : num 0.74 2.84 6.57 10.5 9.66 8.58 8.36 7.47 6.89 5.94 ...
## Titanic : table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
## ToothGrowth : 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## treering : Time-Series [1:7980] from -6000 to 1979: 1.34 1.08 1.54 1.32 1.41 ...
## trees : 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
## UCBAdmissions : table [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
## UKDriverDeaths : Time-Series [1:192] from 1969 to 1985: 1687 1508 1507 1385 1632 ...
## UKgas : Time-Series [1:108] from 1960 to 1987: 160.1 129.7 84.8 120.1 160.1 ...
## USAccDeaths : Time-Series [1:72] from 1973 to 1979: 9007 8106 8928 9137 10017 ...
## USArrests : 'data.frame': 50 obs. of 4 variables:
## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
## USJudgeRatings : 'data.frame': 43 obs. of 12 variables:
## $ CONT: num 5.7 6.8 7.2 6.8 7.3 6.2 10.6 7 7.3 8.2 ...
## $ INTG: num 7.9 8.9 8.1 8.8 6.4 8.8 9 5.9 8.9 7.9 ...
## $ DMNR: num 7.7 8.8 7.8 8.5 4.3 8.7 8.9 4.9 8.9 6.7 ...
## $ DILG: num 7.3 8.5 7.8 8.8 6.5 8.5 8.7 5.1 8.7 8.1 ...
## $ CFMG: num 7.1 7.8 7.5 8.3 6 7.9 8.5 5.4 8.6 7.9 ...
## $ DECI: num 7.4 8.1 7.6 8.5 6.2 8 8.5 5.9 8.5 8 ...
## $ PREP: num 7.1 8 7.5 8.7 5.7 8.1 8.5 4.8 8.4 7.9 ...
## $ FAMI: num 7.1 8 7.5 8.7 5.7 8 8.5 5.1 8.4 8.1 ...
## $ ORAL: num 7.1 7.8 7.3 8.4 5.1 8 8.6 4.7 8.4 7.7 ...
## $ WRIT: num 7 7.9 7.4 8.5 5.3 8 8.4 4.9 8.5 7.8 ...
## $ PHYS: num 8.3 8.5 7.9 8.8 5.5 8.6 9.1 6.8 8.8 8.5 ...
## $ RTEN: num 7.8 8.7 7.8 8.7 4.8 8.6 9 5 8.8 7.9 ...
## USPersonalExpenditure : num [1:5, 1:5] 22.2 10.5 3.53 1.04 0.341 44.5 15.5 5.76 1.98 0.974 ...
## uspop : Time-Series [1:19] from 1790 to 1970: 3.93 5.31 7.24 9.64 12.9 17.1 23.2 31.4 39.8 50.2 ...
## VADeaths : num [1:5, 1:4] 11.7 18.1 26.9 41 66 8.7 11.7 20.3 30.9 54.3 ...
## volcano : num [1:87, 1:61] 100 101 102 103 104 105 105 106 107 108 ...
## warpbreaks : 'data.frame': 54 obs. of 3 variables:
## $ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
## $ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
## $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
## women : 'data.frame': 15 obs. of 2 variables:
## $ height: num 58 59 60 61 62 63 64 65 66 67 ...
## $ weight: num 115 117 120 123 126 129 132 135 139 142 ...
## WorldPhones : num [1:7, 1:7] 45939 60423 64721 68484 71799 ...
## WWWusage : Time-Series [1:100] from 1 to 100: 88 84 85 85 84 85 83 85 88 89 ...
```

When examining a new `R` package, `ls.str` is a good way to begin.

Note that in the calls to `ls` and `ls.str` the package name is given as a character string `"package:datasets"`. This convention is also used in describing which packages are attached in a session.

```r
sessionInfo()
```

```
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.3
##
## loaded via a namespace (and not attached):
## [1] codetools_0.2-8 digest_0.6.3 evaluate_0.4.4 formatR_0.8
## [5] stringr_0.6.2 tools_3.0.1
```

## Initial examination of data

The `str` function and the data sets help page, if it exists, are where I begin examining data

```r
str(ToothGrowth)
```

```
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
```

We see that `supp`, the type of supplement, is a factor, as it should be, and both `dose` and `len`, the response are numeric. It looks as if `dose` may have only a few levels

```r
xtabs(~dose, ToothGrowth)
```

```
## dose
## 0.5 1 2
## 20 20 20
```

and, indeed, these data are typical text-book data from a small, carefully balanced experiment.

```r
xtabs(~supp + dose, ToothGrowth)
```

```
## dose
## supp 0.5 1 2
## OJ 10 10 10
## VC 10 10 10
```

## Visualization of `ToothGrowth`

I usually start with the `lattice` graphics package for visualization because I am familiar with it.

```r
library(lattice)
```

The `ggplot2` package is widely used and deservedly so. I do not recommend using the base graphics capabilities.

The `ToothGrowth` data consist of a numeric response, `len`, one categorical covariate, `supp`, and one covariate, `dose`, that could be considered numeric or categorical.

If we want to consider `dose` on a continuous scale we could create an interaction plot (`type=c("g","p","a")`)
plot of chunk interactionplot

The shape of the curves (and choice of levels) indicates that the logarithm of the dose may be a better scale.
plot of chunk interaction2

The only problem with this plot is that it wastes space on the horizontal axis. An alternative is to use the horizontal axis for the response, as in, for example, boxplots

plot of chunk bwplotplot of chunk bwplot

or dotplots

plot of chunk dotplotsplot of chunk dotplotsplot of chunk dotplots

or comparative density plots
plot of chunk densityplotsplot of chunk densityplots

## Reading data over the Internet

One can give a URL instead of a file name as an argument to functions such as `read.csv` and `read.delim`. Consider the data at http://www-personal.umich.edu/~bwest/classroom.csv

```r
str(class ```

```
## 'data.frame': 1190 obs. of 12 variables:
## $ sex : int 1 0 1 0 0 1 0 0 1 0 ...
## $ minority: int 1 1 1 1 1 1 1 1 1 1 ...
## $ mathkind: int 448 460 511 449 425 450 452 443 422 480 ...
## $ mathgain: int 32 109 56 83 53 65 51 66 88 -7 ...
## $ ses : num 0.46 -0.27 -0.03 -0.38 -0.03 0.76 -0.03 0.2 0.64 0.13 ...
## $ yearstea: num 1 1 1 2 2 2 2 2 2 2 ...
## $ mathknow: num NA NA NA -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 ...
## $ housepov: num 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 ...
## $ mathprep: num 2 2 2 3.25 3.25 3.25 3.25 3.25 3.25 3.25 ...
## $ classid : int 160 160 160 217 217 217 217 217 217 217 ...
## $ schoolid: int 1 1 1 1 1 1 1 1 1 1 ...
## $ childid : int 1 2 3 4 5 6 7 8 9 10 ...
```

Data sets like this use artificial numeric coding of variables that are in fact categorical. If we summarize these data

```r
summary(class)
```

```
## sex minority mathkind mathgain
## Min. :0.000 Min. :0.000 Min. :290 Min. :-110.0
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:439 1st Qu.: 35.0
## Median :1.000 Median :1.000 Median :466 Median : 56.0
## Mean :0.506 Mean :0.677 Mean :467 Mean : 57.6
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:495 3rd Qu.: 77.0
## Max. :1.000 Max. :1.000 Max. :629 Max. : 253.0
##
## ses yearstea mathknow housepov
## Min. :-1.610 Min. : 0.0 Min. :-2.50 Min. :0.012
## 1st Qu.:-0.490 1st Qu.: 4.0 1st Qu.:-0.72 1st Qu.:0.085
## Median :-0.030 Median :10.0 Median :-0.13 Median :0.127
## Mean :-0.013 Mean :12.2 Mean : 0.03 Mean :0.178
## 3rd Qu.: 0.398 3rd Qu.:20.0 3rd Qu.: 0.85 3rd Qu.:0.255
## Max. : 3.210 Max. :40.0 Max. : 2.61 Max. :0.564
## NA's :109
## mathprep classid schoolid childid
## Min. :1.00 Min. : 1 Min. : 1.0 Min. : 1
## 1st Qu.:2.00 1st Qu.: 80 1st Qu.: 26.0 1st Qu.: 298
## Median :2.30 Median :157 Median : 54.0 Median : 596
## Mean :2.61 Mean :158 Mean : 52.9 Mean : 596
## 3rd Qu.:3.00 3rd Qu.:239 3rd Qu.: 79.0 3rd Qu.: 893
## Max. :6.00 Max. :312 Max. :107.0 Max. :1190
##
```

we get nonsensical numerical summaries of characteristics like `sex`. We should change these variables to factors.

```r
class sex minority classid schoolid childid })
str(class)
```

```
## 'data.frame': 1190 obs. of 12 variables:
## $ sex : Factor w/ 2 levels "M","F": 2 1 2 1 1 2 1 1 2 1 ...
## $ minority: Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ mathkind: int 448 460 511 449 425 450 452 443 422 480 ...
## $ mathgain: int 32 109 56 83 53 65 51 66 88 -7 ...
## $ ses : num 0.46 -0.27 -0.03 -0.38 -0.03 0.76 -0.03 0.2 0.64 0.13 ...
## $ yearstea: num 1 1 1 2 2 2 2 2 2 2 ...
## $ mathknow: num NA NA NA -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 ...
## $ housepov: num 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 ...
## $ mathprep: num 2 2 2 3.25 3.25 3.25 3.25 3.25 3.25 3.25 ...
## $ classid : Factor w/ 312 levels "1","2","3","4",..: 160 160 160 217 217 217 217 217 217 217 ...
## $ schoolid: Factor w/ 107 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ childid : Factor w/ 1190 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
```

```r
summary(class)
```

```
## sex minority mathkind mathgain ses
## M:588 N:384 Min. :290 Min. :-110.0 Min. :-1.610
## F:602 Y:806 1st Qu.:439 1st Qu.: 35.0 1st Qu.:-0.490
## Median :466 Median : 56.0 Median :-0.030
## Mean :467 Mean : 57.6 Mean :-0.013
## 3rd Qu.:495 3rd Qu.: 77.0 3rd Qu.: 0.398
## Max. :629 Max. : 253.0 Max. : 3.210
##
## yearstea mathknow housepov mathprep
## Min. : 0.0 Min. :-2.50 Min. :0.012 Min. :1.00
## 1st Qu.: 4.0 1st Qu.:-0.72 1st Qu.:0.085 1st Qu.:2.00
## Median :10.0 Median :-0.13 Median :0.127 Median :2.30
## Mean :12.2 Mean : 0.03 Mean :0.178 Mean :2.61
## 3rd Qu.:20.0 3rd Qu.: 0.85 3rd Qu.:0.255 3rd Qu.:3.00
## Max. :40.0 Max. : 2.61 Max. :0.564 Max. :6.00
## NA's :109
## classid schoolid childid
## 26 : 10 11 : 31 1 : 1
## 42 : 10 12 : 27 2 : 1
## 13 : 9 71 : 27 3 : 1
## 189 : 9 76 : 27 4 : 1
## 205 : 9 77 : 24 5 : 1
## 253 : 9 31 : 22 6 : 1
## (Other):1134 (Other):1032 (Other):1184
```

The `childid` variable is redundant but there is no harm in retaining it.

For a categorical variable the summary is a frequency table. If the number of levels is large, the ones with the largest counts are listed first. Thus the largest number of students sampled from a single class is 10. To look at the distribution of the counts we can apply `xtabs` twice.

```r
xtabs(~xtabs(~classid, class))
```

```
## xtabs(~classid, class)
## 1 2 3 4 5 6 7 8 9 10
## 42 53 53 61 39 31 14 13 4 2
```

Out of the 312 classrooms, 42 have only one student in the study, whose purpose is to determine the effects of teacher training on student performance.

### Class-specific and school-specific variables

Many of the variables are characteristics of teachers and should be constant within a class. We should check that this is true.

```r
str(classvars "mathprep", "classid", "schoolid"))))
```

```
## 'data.frame': 312 obs. of 6 variables:
## $ yearstea: num 1 2 1 2 12.5 ...
## $ mathknow: num NA -0.11 -1.25 -0.72 NA 0.45 0.99 1.61 1.14 -1.05 ...
## $ housepov: num 0.082 0.082 0.082 0.082 0.082 0.086 0.086 0.086 0.086 0.365 ...
## $ mathprep: num 2 3.25 2.5 2.33 2.3 3.83 2.25 3 2.17 2 ...
## $ classid : Factor w/ 312 levels "1","2","3","4",..: 160 217 197 211 307 11 137 145 228 48 ...
## $ schoolid: Factor w/ 107 levels "1","2","3","4",..: 1 1 2 2 2 3 3 3 3 4 ...
```

```r
summary(classvars)
```

```
## yearstea mathknow housepov mathprep
## Min. : 0.0 Min. :-2.50 Min. :0.012 Min. :1.00
## 1st Qu.: 4.0 1st Qu.:-0.76 1st Qu.:0.085 1st Qu.:2.00
## Median :10.0 Median :-0.19 Median :0.142 Median :2.30
## Mean :12.3 Mean :-0.08 Mean :0.191 Mean :2.58
## 3rd Qu.:20.0 3rd Qu.: 0.62 3rd Qu.:0.263 3rd Qu.:3.00
## Max. :40.0 Max. : 2.61 Max. :0.564 Max. :6.00
## NA's :27
## classid schoolid
## 1 : 1 11 : 9
## 2 : 1 12 : 5
## 3 : 1 15 : 5
## 4 : 1 17 : 5
## 5 : 1 33 : 5
## 6 : 1 46 : 5
## (Other):306 (Other):278
```

```r
xtabs(~xtabs(~schoolid, classvars))
```

```
## xtabs(~schoolid, classvars)
## 1 2 3 4 5 9
## 13 34 26 21 12 1
```

The important information from the summary is that there are 312 rows in this dataframe, corresponding to the 312 classes. If any of the other variables were not constant within class we would have a greater number of rows.

We also see that the number of classes sampled per school is highly unbalanced and a large proportion of the schools have only one or two classes sampled.

A check on the school-specific variables shows they are consistent

```r
str(schoolvars ```

```
## 'data.frame': 107 obs. of 2 variables:
## $ housepov: num 0.082 0.082 0.086 0.365 0.511 0.044 0.148 0.085 0.537 0.346 ...
## $ schoolid: Factor w/ 107 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
```

plot of chunk housepovdensplot of chunk housepovdens