Organizing data in R: HTML Page



Organizing data in R

Organizing data in R

The basic tabular data structure (rows correspond to observations, columns to variables) is called a data.frame in R.

All R distributions provide the datasets packages which contains several sample datasets, see

help(package = "datasets")

In an interactive session this will bring up the index of help pages for the package.

An alternative is to list the names of objects in a package

ls("package:datasets")
##   [1] "ability.cov"           "airmiles"             
##   [3] "AirPassengers"         "airquality"           
##   [5] "anscombe"              "attenu"               
##   [7] "attitude"              "austres"              
##   [9] "beaver1"               "beaver2"              
##  [11] "BJsales"               "BJsales.lead"         
##  [13] "BOD"                   "cars"                 
##  [15] "ChickWeight"           "chickwts"             
##  [17] "co2"                   "CO2"                  
##  [19] "crimtab"               "discoveries"          
##  [21] "DNase"                 "esoph"                
##  [23] "euro"                  "euro.cross"           
##  [25] "eurodist"              "EuStockMarkets"       
##  [27] "faithful"              "fdeaths"              
##  [29] "Formaldehyde"          "freeny"               
##  [31] "freeny.x"              "freeny.y"             
##  [33] "HairEyeColor"          "Harman23.cor"         
##  [35] "Harman74.cor"          "Indometh"             
##  [37] "infert"                "InsectSprays"         
##  [39] "iris"                  "iris3"                
##  [41] "islands"               "JohnsonJohnson"       
##  [43] "LakeHuron"             "ldeaths"              
##  [45] "lh"                    "LifeCycleSavings"     
##  [47] "Loblolly"              "longley"              
##  [49] "lynx"                  "mdeaths"              
##  [51] "morley"                "mtcars"               
##  [53] "nhtemp"                "Nile"                 
##  [55] "nottem"                "occupationalStatus"   
##  [57] "Orange"                "OrchardSprays"        
##  [59] "PlantGrowth"           "precip"               
##  [61] "presidents"            "pressure"             
##  [63] "Puromycin"             "quakes"               
##  [65] "randu"                 "rivers"               
##  [67] "rock"                  "Seatbelts"            
##  [69] "sleep"                 "stack.loss"           
##  [71] "stack.x"               "stackloss"            
##  [73] "state.abb"             "state.area"           
##  [75] "state.center"          "state.division"       
##  [77] "state.name"            "state.region"         
##  [79] "state.x77"             "sunspot.month"        
##  [81] "sunspot.year"          "sunspots"             
##  [83] "swiss"                 "Theoph"               
##  [85] "Titanic"               "ToothGrowth"          
##  [87] "treering"              "trees"                
##  [89] "UCBAdmissions"         "UKDriverDeaths"       
##  [91] "UKgas"                 "USAccDeaths"          
##  [93] "USArrests"             "USJudgeRatings"       
##  [95] "USPersonalExpenditure" "uspop"                
##  [97] "VADeaths"              "volcano"              
##  [99] "warpbreaks"            "women"                
## [101] "WorldPhones"           "WWWusage"

or, often of more interest, list the names and a brief description of the structure

ls.str("package:datasets")
## ability.cov : List of 3
##  $ cov   : num [1:6, 1:6] 24.64 5.99 33.52 6.02 20.75 ...
##  $ center: num [1:6] 0 0 0 0 0 0
##  $ n.obs : num 112
## airmiles :  Time-Series [1:24] from 1937 to 1960: 412 480 683 1052 1385 ...
## AirPassengers :  Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...
## airquality : 'data.frame':   153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
## anscombe : 'data.frame': 11 obs. of  8 variables:
##  $ x1: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x2: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x3: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x4: num  8 8 8 8 8 8 8 19 8 8 ...
##  $ y1: num  8.04 6.95 7.58 8.81 8.33 ...
##  $ y2: num  9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
##  $ y3: num  7.46 6.77 12.74 7.11 7.81 ...
##  $ y4: num  6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
## attenu : 'data.frame':   182 obs. of  5 variables:
##  $ event  : num  1 2 2 2 2 2 2 2 2 2 ...
##  $ mag    : num  7 7.4 7.4 7.4 7.4 7.4 7.4 7.4 7.4 7.4 ...
##  $ station: Factor w/ 117 levels "1008","1011",..: 24 13 15 68 39 74 22 1 8 55 ...
##  $ dist   : num  12 148 42 85 107 109 156 224 293 359 ...
##  $ accel  : num  0.359 0.014 0.196 0.135 0.062 0.054 0.014 0.018 0.01 0.004 ...
## attitude : 'data.frame': 30 obs. of  7 variables:
##  $ rating    : num  43 63 71 61 81 43 58 71 72 67 ...
##  $ complaints: num  51 64 70 63 78 55 67 75 82 61 ...
##  $ privileges: num  30 51 68 45 56 49 42 50 72 45 ...
##  $ learning  : num  39 54 69 47 66 44 56 55 67 47 ...
##  $ raises    : num  61 63 76 54 71 54 66 70 71 62 ...
##  $ critical  : num  92 73 86 84 83 49 68 66 83 80 ...
##  $ advance   : num  45 47 48 35 47 34 35 41 31 41 ...
## austres :  Time-Series [1:89] from 1971 to 1993: 13067 13130 13198 13254 13304 ...
## beaver1 : 'data.frame':  114 obs. of  4 variables:
##  $ day  : num  346 346 346 346 346 346 346 346 346 346 ...
##  $ time : num  840 850 900 910 920 930 940 950 1000 1010 ...
##  $ temp : num  36.3 36.3 36.4 36.4 36.5 ...
##  $ activ: num  0 0 0 0 0 0 0 0 0 0 ...
## beaver2 : 'data.frame':  100 obs. of  4 variables:
##  $ day  : num  307 307 307 307 307 307 307 307 307 307 ...
##  $ time : num  930 940 950 1000 1010 1020 1030 1040 1050 1100 ...
##  $ temp : num  36.6 36.7 36.9 37.1 37.2 ...
##  $ activ: num  0 0 0 0 0 0 0 0 0 0 ...
## BJsales :  Time-Series [1:150] from 1 to 150: 200 200 199 199 199 ...
## BJsales.lead :  Time-Series [1:150] from 1 to 150: 10.01 10.07 10.32 9.75 10.33 ...
## BOD : 'data.frame':  6 obs. of  2 variables:
##  $ Time  : num  1 2 3 4 5 7
##  $ demand: num  8.3 10.3 19 16 15.6 19.8
## cars : 'data.frame': 50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
## ChickWeight : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 578 obs. of  4 variables:
##  $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
##  $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
##  $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## chickwts : 'data.frame': 71 obs. of  2 variables:
##  $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
##  $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
## co2 :  Time-Series [1:468] from 1959 to 1998: 315 316 316 318 318 ...
## CO2 : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 84 obs. of  5 variables:
##  $ Plant    : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
##  $ conc     : num  95 175 250 350 500 675 1000 95 175 250 ...
##  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
## crimtab :  'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ...
## discoveries :  Time-Series [1:100] from 1860 to 1959: 5 3 0 2 0 3 2 3 6 1 ...
## DNase : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   176 obs. of  3 variables:
##  $ Run    : Ord.factor w/ 11 levels "10"<"11"<"9"<..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ conc   : num  0.0488 0.0488 0.1953 0.1953 0.3906 ...
##  $ density: num  0.017 0.018 0.121 0.124 0.206 0.215 0.377 0.374 0.614 0.609 ...
## esoph : 'data.frame':    88 obs. of  5 variables:
##  $ agegp    : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alcgp    : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ tobgp    : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ ncases   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ncontrols: num  40 10 6 5 27 7 4 7 2 1 ...
## euro :  Named num [1:11] 13.76 40.34 1.96 166.39 5.95 ...
## euro.cross :  num [1:11, 1:11] 1 0.3411 7.0355 0.0827 2.3143 ...
## eurodist : Class 'dist'  atomic [1:210] 3313 2963 3175 3339 2762 ...
## EuStockMarkets :  mts [1:1860, 1:4] 1629 1614 1607 1621 1618 ...
## faithful : 'data.frame': 272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...
## fdeaths :  Time-Series [1:72] from 1974 to 1980: 901 689 827 677 522 406 441 393 387 582 ...
## Formaldehyde : 'data.frame': 6 obs. of  2 variables:
##  $ carb  : num  0.1 0.3 0.5 0.6 0.7 0.9
##  $ optden: num  0.086 0.269 0.446 0.538 0.626 0.782
## freeny : 'data.frame':   39 obs. of  5 variables:
##  $ y                    : Time-Series  from 1962 to 1972: 8.79 8.79 8.81 8.81 8.91 ...
##  $ lag.quarterly.revenue: num  8.8 8.79 8.79 8.81 8.81 ...
##  $ price.index          : num  4.71 4.7 4.69 4.69 4.64 ...
##  $ income.level         : num  5.82 5.83 5.83 5.84 5.85 ...
##  $ market.potential     : num  13 13 13 13 13 ...
## freeny.x :  num [1:39, 1:4] 8.8 8.79 8.79 8.81 8.81 ...
## freeny.y :  Time-Series [1:39] from 1962 to 1972: 8.79 8.79 8.81 8.81 8.91 ...
## HairEyeColor :  table [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
## Harman23.cor : List of 3
##  $ cov   : num [1:8, 1:8] 1 0.846 0.805 0.859 0.473 0.398 0.301 0.382 0.846 1 ...
##  $ center: num [1:8] 0 0 0 0 0 0 0 0
##  $ n.obs : num 305
## Harman74.cor : List of 3
##  $ cov   : num [1:24, 1:24] 1 0.318 0.403 0.468 0.321 0.335 0.304 0.332 0.326 0.116 ...
##  $ center: num [1:24] 0 0 0 0 0 0 0 0 0 0 ...
##  $ n.obs : num 145
## Indometh : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':    66 obs. of  3 variables:
##  $ Subject: Ord.factor w/ 6 levels "1"<"4"<"2"<"5"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ time   : num  0.25 0.5 0.75 1 1.25 2 3 4 5 6 ...
##  $ conc   : num  1.5 0.94 0.78 0.48 0.37 0.19 0.12 0.11 0.08 0.07 ...
## infert : 'data.frame':   248 obs. of  8 variables:
##  $ education     : Factor w/ 3 levels "0-5yrs","6-11yrs",..: 1 1 1 1 2 2 2 2 2 2 ...
##  $ age           : num  26 42 39 34 35 36 23 32 21 28 ...
##  $ parity        : num  6 1 6 4 3 4 1 2 1 2 ...
##  $ induced       : num  1 1 2 2 1 2 0 0 0 0 ...
##  $ case          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ spontaneous   : num  2 0 0 0 1 1 0 0 1 0 ...
##  $ stratum       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pooled.stratum: num  3 1 4 2 32 36 6 22 5 19 ...
## InsectSprays : 'data.frame': 72 obs. of  2 variables:
##  $ count: num  10 7 20 14 14 12 10 23 17 20 ...
##  $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
## iris : 'data.frame': 150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## iris3 :  num [1:50, 1:4, 1:3] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## islands :  Named num [1:48] 11506 5500 16988 2968 16 ...
## JohnsonJohnson :  Time-Series [1:84] from 1960 to 1981: 0.71 0.63 0.85 0.44 0.61 0.69 0.92 0.55 0.72 0.77 ...
## LakeHuron :  Time-Series [1:98] from 1875 to 1972: 580 582 581 581 580 ...
## ldeaths :  Time-Series [1:72] from 1974 to 1980: 3035 2552 2704 2554 2014 ...
## lh :  Time-Series [1:48] from 1 to 48: 2.4 2.4 2.4 2.2 2.1 1.5 2.3 2.3 2.5 2 ...
## LifeCycleSavings : 'data.frame': 50 obs. of  5 variables:
##  $ sr   : num  11.43 12.07 13.17 5.75 12.88 ...
##  $ pop15: num  29.4 23.3 23.8 41.9 42.2 ...
##  $ pop75: num  2.87 4.41 4.43 1.67 0.83 2.85 1.34 0.67 1.06 1.14 ...
##  $ dpi  : num  2330 1508 2108 189 728 ...
##  $ ddpi : num  2.87 3.93 3.82 0.22 4.56 2.43 2.67 6.51 3.08 2.8 ...
## Loblolly : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':    84 obs. of  3 variables:
##  $ height: num  4.51 10.89 28.72 41.74 52.7 ...
##  $ age   : num  3 5 10 15 20 25 3 5 10 15 ...
##  $ Seed  : Ord.factor w/ 14 levels "329"<"327"<"325"<..: 10 10 10 10 10 10 13 13 13 13 ...
## longley : 'data.frame':  16 obs. of  7 variables:
##  $ GNP.deflator: num  83 88.5 88.2 89.5 96.2 ...
##  $ GNP         : num  234 259 258 285 329 ...
##  $ Unemployed  : num  236 232 368 335 210 ...
##  $ Armed.Forces: num  159 146 162 165 310 ...
##  $ Population  : num  108 109 110 111 112 ...
##  $ Year        : int  1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 ...
##  $ Employed    : num  60.3 61.1 60.2 61.2 63.2 ...
## lynx :  Time-Series [1:114] from 1821 to 1934: 269 321 585 871 1475 ...
## mdeaths :  Time-Series [1:72] from 1974 to 1980: 2134 1863 1877 1877 1492 ...
## morley : 'data.frame':   100 obs. of  3 variables:
##  $ Expt : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Run  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Speed: int  850 740 900 1070 930 850 950 980 980 880 ...
## mtcars : 'data.frame':   32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
## nhtemp :  Time-Series [1:60] from 1912 to 1971: 49.9 52.3 49.4 51.1 49.4 47.9 49.8 50.9 49.3 51.9 ...
## Nile :  Time-Series [1:100] from 1871 to 1970: 1120 1160 963 1210 1160 1160 813 1230 1370 1140 ...
## nottem :  Time-Series [1:240] from 1920 to 1940: 40.6 40.8 44.4 46.7 54.1 58.5 57.7 56.4 54.3 50.5 ...
## occupationalStatus :  'table' int [1:8, 1:8] 50 16 12 11 2 12 0 0 19 40 ...
## Orange : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':  35 obs. of  3 variables:
##  $ Tree         : Ord.factor w/ 5 levels "3"<"1"<"5"<"2"<..: 2 2 2 2 2 2 2 4 4 4 ...
##  $ age          : num  118 484 664 1004 1231 ...
##  $ circumference: num  30 58 87 115 120 142 145 33 69 111 ...
## OrchardSprays : 'data.frame':    64 obs. of  4 variables:
##  $ decrease : num  57 95 8 69 92 90 15 2 84 6 ...
##  $ rowpos   : num  1 2 3 4 5 6 7 8 1 2 ...
##  $ colpos   : num  1 1 1 1 1 1 1 1 2 2 ...
##  $ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1 3 2 ...
## PlantGrowth : 'data.frame':  30 obs. of  2 variables:
##  $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
##  $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
## precip :  Named num [1:70] 67 54.7 7 48.5 14 17.2 20.7 13 43.4 40.2 ...
## presidents :  Time-Series [1:120] from 1945 to 1975: NA 87 82 75 63 50 43 32 35 60 ...
## pressure : 'data.frame': 19 obs. of  2 variables:
##  $ temperature: num  0 20 40 60 80 100 120 140 160 180 ...
##  $ pressure   : num  0.0002 0.0012 0.006 0.03 0.09 0.27 0.75 1.85 4.2 8.8 ...
## Puromycin : 'data.frame':    23 obs. of  3 variables:
##  $ conc : num  0.02 0.02 0.06 0.06 0.11 0.11 0.22 0.22 0.56 0.56 ...
##  $ rate : num  76 47 97 107 123 139 159 152 191 201 ...
##  $ state: Factor w/ 2 levels "treated","untreated": 1 1 1 1 1 1 1 1 1 1 ...
## quakes : 'data.frame':   1000 obs. of  5 variables:
##  $ lat     : num  -20.4 -20.6 -26 -18 -20.4 ...
##  $ long    : num  182 181 184 182 182 ...
##  $ depth   : int  562 650 42 626 649 195 82 194 211 622 ...
##  $ mag     : num  4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
##  $ stations: int  41 15 43 19 11 12 43 15 35 19 ...
## randu : 'data.frame':    400 obs. of  3 variables:
##  $ x: num  0.000031 0.044495 0.82244 0.322291 0.393595 ...
##  $ y: num  0.000183 0.155732 0.873416 0.648545 0.826873 ...
##  $ z: num  0.000824 0.533939 0.838542 0.990648 0.418881 ...
## rivers :  num [1:141] 735 320 325 392 524 ...
## rock : 'data.frame': 48 obs. of  4 variables:
##  $ area : int  4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...
##  $ peri : num  2792 3893 3931 3869 3949 ...
##  $ shape: num  0.0903 0.1486 0.1833 0.1171 0.1224 ...
##  $ perm : num  6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...
## Seatbelts :  mts [1:192, 1:8] 107 97 102 87 119 106 110 106 107 134 ...
## sleep : 'data.frame':    20 obs. of  3 variables:
##  $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
##  $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ID   : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## stack.loss :  num [1:21] 42 37 37 28 18 18 19 20 15 14 ...
## stack.x :  num [1:21, 1:3] 80 80 75 62 62 62 62 62 58 58 ...
## stackloss : 'data.frame':    21 obs. of  4 variables:
##  $ Air.Flow  : num  80 80 75 62 62 62 62 62 58 58 ...
##  $ Water.Temp: num  27 27 25 24 22 23 24 24 23 18 ...
##  $ Acid.Conc.: num  89 88 90 87 87 87 93 93 87 80 ...
##  $ stack.loss: num  42 37 37 28 18 18 19 20 15 14 ...
## state.abb :  chr [1:50] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" ...
## state.area :  num [1:50] 51609 589757 113909 53104 158693 ...
## state.center : List of 2
##  $ x: num [1:50] -86.8 -127.2 -111.6 -92.3 -119.8 ...
##  $ y: num [1:50] 32.6 49.2 34.2 34.7 36.5 ...
## state.division :  Factor w/ 9 levels "New England",..: 4 9 8 5 9 8 1 3 3 3 ...
## state.name :  chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## state.region :  Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## state.x77 :  num [1:50, 1:8] 3615 365 2212 2110 21198 ...
## sunspot.month :  Time-Series [1:2988] from 1749 to 1998: 58 62.6 70 55.7 85 83.5 94.8 66.3 75.9 75.5 ...
## sunspot.year :  Time-Series [1:289] from 1700 to 1988: 5 11 16 23 36 58 29 20 10 8 ...
## sunspots :  Time-Series [1:2820] from 1749 to 1984: 58 62.6 70 55.7 85 83.5 94.8 66.3 75.9 75.5 ...
## swiss : 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
## Theoph : Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':  132 obs. of  5 variables:
##  $ Subject: Ord.factor w/ 12 levels "6"<"7"<"8"<"11"<..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ Wt     : num  79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 ...
##  $ Dose   : num  4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 ...
##  $ Time   : num  0 0.25 0.57 1.12 2.02 ...
##  $ conc   : num  0.74 2.84 6.57 10.5 9.66 8.58 8.36 7.47 6.89 5.94 ...
## Titanic :  table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
## ToothGrowth : 'data.frame':  60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## treering :  Time-Series [1:7980] from -6000 to 1979: 1.34 1.08 1.54 1.32 1.41 ...
## trees : 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
## UCBAdmissions :  table [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
## UKDriverDeaths :  Time-Series [1:192] from 1969 to 1985: 1687 1508 1507 1385 1632 ...
## UKgas :  Time-Series [1:108] from 1960 to 1987: 160.1 129.7 84.8 120.1 160.1 ...
## USAccDeaths :  Time-Series [1:72] from 1973 to 1979: 9007 8106 8928 9137 10017 ...
## USArrests : 'data.frame':    50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
## USJudgeRatings : 'data.frame':   43 obs. of  12 variables:
##  $ CONT: num  5.7 6.8 7.2 6.8 7.3 6.2 10.6 7 7.3 8.2 ...
##  $ INTG: num  7.9 8.9 8.1 8.8 6.4 8.8 9 5.9 8.9 7.9 ...
##  $ DMNR: num  7.7 8.8 7.8 8.5 4.3 8.7 8.9 4.9 8.9 6.7 ...
##  $ DILG: num  7.3 8.5 7.8 8.8 6.5 8.5 8.7 5.1 8.7 8.1 ...
##  $ CFMG: num  7.1 7.8 7.5 8.3 6 7.9 8.5 5.4 8.6 7.9 ...
##  $ DECI: num  7.4 8.1 7.6 8.5 6.2 8 8.5 5.9 8.5 8 ...
##  $ PREP: num  7.1 8 7.5 8.7 5.7 8.1 8.5 4.8 8.4 7.9 ...
##  $ FAMI: num  7.1 8 7.5 8.7 5.7 8 8.5 5.1 8.4 8.1 ...
##  $ ORAL: num  7.1 7.8 7.3 8.4 5.1 8 8.6 4.7 8.4 7.7 ...
##  $ WRIT: num  7 7.9 7.4 8.5 5.3 8 8.4 4.9 8.5 7.8 ...
##  $ PHYS: num  8.3 8.5 7.9 8.8 5.5 8.6 9.1 6.8 8.8 8.5 ...
##  $ RTEN: num  7.8 8.7 7.8 8.7 4.8 8.6 9 5 8.8 7.9 ...
## USPersonalExpenditure :  num [1:5, 1:5] 22.2 10.5 3.53 1.04 0.341 44.5 15.5 5.76 1.98 0.974 ...
## uspop :  Time-Series [1:19] from 1790 to 1970: 3.93 5.31 7.24 9.64 12.9 17.1 23.2 31.4 39.8 50.2 ...
## VADeaths :  num [1:5, 1:4] 11.7 18.1 26.9 41 66 8.7 11.7 20.3 30.9 54.3 ...
## volcano :  num [1:87, 1:61] 100 101 102 103 104 105 105 106 107 108 ...
## warpbreaks : 'data.frame':   54 obs. of  3 variables:
##  $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...
##  $ wool   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
## women : 'data.frame':    15 obs. of  2 variables:
##  $ height: num  58 59 60 61 62 63 64 65 66 67 ...
##  $ weight: num  115 117 120 123 126 129 132 135 139 142 ...
## WorldPhones :  num [1:7, 1:7] 45939 60423 64721 68484 71799 ...
## WWWusage :  Time-Series [1:100] from 1 to 100: 88 84 85 85 84 85 83 85 88 89 ...

When examining a new R package, ls.str is a good way to begin.

Note that in the calls to ls and ls.str the package name is given as a character string "package:datasets". This convention is also used in describing which packages are attached in a session.

sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.3
## 
## loaded via a namespace (and not attached):
## [1] codetools_0.2-8 digest_0.6.3    evaluate_0.4.4  formatR_0.8    
## [5] stringr_0.6.2   tools_3.0.1

Initial examination of data

The str function and the data sets help page, if it exists, are where I begin examining data

str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

We see that supp, the type of supplement, is a factor, as it should be, and both dose and len, the response are numeric. It looks as if dose may have only a few levels

xtabs(~dose, ToothGrowth)
## dose
## 0.5   1   2 
##  20  20  20

and, indeed, these data are typical text-book data from a small, carefully balanced experiment.

xtabs(~supp + dose, ToothGrowth)
##     dose
## supp 0.5  1  2
##   OJ  10 10 10
##   VC  10 10 10

Visualization of ToothGrowth

I usually start with the lattice graphics package for visualization because I am familiar with it.

library(lattice)

The ggplot2 package is widely used and deservedly so. I do not recommend using the base graphics capabilities.

The ToothGrowth data consist of a numeric response, len, one categorical covariate, supp, and one covariate, dose, that could be considered numeric or categorical.

If we want to consider dose on a continuous scale we could create an interaction plot (type=c("g","p","a"))
plot of chunk interactionplot

The shape of the curves (and choice of levels) indicates that the logarithm of the dose may be a better scale.
plot of chunk interaction2

The only problem with this plot is that it wastes space on the horizontal axis. An alternative is to use the horizontal axis for the response, as in, for example, boxplots

plot of chunk bwplotplot of chunk bwplot

or dotplots

plot of chunk dotplotsplot of chunk dotplotsplot of chunk dotplots

or comparative density plots
plot of chunk densityplotsplot of chunk densityplots

Reading data over the Internet

One can give a URL instead of a file name as an argument to functions such as read.csv and read.delim. Consider the data at http://www-personal.umich.edu/~bwest/classroom.csv

str(class <- read.csv("http://www-personal.umich.edu/~bwest/classroom.csv"))
## 'data.frame':    1190 obs. of  12 variables:
##  $ sex     : int  1 0 1 0 0 1 0 0 1 0 ...
##  $ minority: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ mathkind: int  448 460 511 449 425 450 452 443 422 480 ...
##  $ mathgain: int  32 109 56 83 53 65 51 66 88 -7 ...
##  $ ses     : num  0.46 -0.27 -0.03 -0.38 -0.03 0.76 -0.03 0.2 0.64 0.13 ...
##  $ yearstea: num  1 1 1 2 2 2 2 2 2 2 ...
##  $ mathknow: num  NA NA NA -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 ...
##  $ housepov: num  0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 ...
##  $ mathprep: num  2 2 2 3.25 3.25 3.25 3.25 3.25 3.25 3.25 ...
##  $ classid : int  160 160 160 217 217 217 217 217 217 217 ...
##  $ schoolid: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ childid : int  1 2 3 4 5 6 7 8 9 10 ...

Data sets like this use artificial numeric coding of variables that are in fact categorical. If we summarize these data

summary(class)
##       sex           minority        mathkind      mathgain     
##  Min.   :0.000   Min.   :0.000   Min.   :290   Min.   :-110.0  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:439   1st Qu.:  35.0  
##  Median :1.000   Median :1.000   Median :466   Median :  56.0  
##  Mean   :0.506   Mean   :0.677   Mean   :467   Mean   :  57.6  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:495   3rd Qu.:  77.0  
##  Max.   :1.000   Max.   :1.000   Max.   :629   Max.   : 253.0  
##                                                                
##       ses            yearstea       mathknow        housepov    
##  Min.   :-1.610   Min.   : 0.0   Min.   :-2.50   Min.   :0.012  
##  1st Qu.:-0.490   1st Qu.: 4.0   1st Qu.:-0.72   1st Qu.:0.085  
##  Median :-0.030   Median :10.0   Median :-0.13   Median :0.127  
##  Mean   :-0.013   Mean   :12.2   Mean   : 0.03   Mean   :0.178  
##  3rd Qu.: 0.398   3rd Qu.:20.0   3rd Qu.: 0.85   3rd Qu.:0.255  
##  Max.   : 3.210   Max.   :40.0   Max.   : 2.61   Max.   :0.564  
##                                  NA's   :109                    
##     mathprep       classid       schoolid        childid    
##  Min.   :1.00   Min.   :  1   Min.   :  1.0   Min.   :   1  
##  1st Qu.:2.00   1st Qu.: 80   1st Qu.: 26.0   1st Qu.: 298  
##  Median :2.30   Median :157   Median : 54.0   Median : 596  
##  Mean   :2.61   Mean   :158   Mean   : 52.9   Mean   : 596  
##  3rd Qu.:3.00   3rd Qu.:239   3rd Qu.: 79.0   3rd Qu.: 893  
##  Max.   :6.00   Max.   :312   Max.   :107.0   Max.   :1190  
## 

we get nonsensical numerical summaries of characteristics like sex. We should change these variables to factors.

class <- within(class, {
    sex <- factor(sex, labels = c("M", "F"))
    minority <- factor(minority, labels = c("N", "Y"))
    classid <- factor(classid)
    schoolid <- factor(schoolid)
    childid <- factor(childid)
})
str(class)
## 'data.frame':    1190 obs. of  12 variables:
##  $ sex     : Factor w/ 2 levels "M","F": 2 1 2 1 1 2 1 1 2 1 ...
##  $ minority: Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mathkind: int  448 460 511 449 425 450 452 443 422 480 ...
##  $ mathgain: int  32 109 56 83 53 65 51 66 88 -7 ...
##  $ ses     : num  0.46 -0.27 -0.03 -0.38 -0.03 0.76 -0.03 0.2 0.64 0.13 ...
##  $ yearstea: num  1 1 1 2 2 2 2 2 2 2 ...
##  $ mathknow: num  NA NA NA -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 ...
##  $ housepov: num  0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 ...
##  $ mathprep: num  2 2 2 3.25 3.25 3.25 3.25 3.25 3.25 3.25 ...
##  $ classid : Factor w/ 312 levels "1","2","3","4",..: 160 160 160 217 217 217 217 217 217 217 ...
##  $ schoolid: Factor w/ 107 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ childid : Factor w/ 1190 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
summary(class)
##  sex     minority    mathkind      mathgain           ses        
##  M:588   N:384    Min.   :290   Min.   :-110.0   Min.   :-1.610  
##  F:602   Y:806    1st Qu.:439   1st Qu.:  35.0   1st Qu.:-0.490  
##                   Median :466   Median :  56.0   Median :-0.030  
##                   Mean   :467   Mean   :  57.6   Mean   :-0.013  
##                   3rd Qu.:495   3rd Qu.:  77.0   3rd Qu.: 0.398  
##                   Max.   :629   Max.   : 253.0   Max.   : 3.210  
##                                                                  
##     yearstea       mathknow        housepov        mathprep   
##  Min.   : 0.0   Min.   :-2.50   Min.   :0.012   Min.   :1.00  
##  1st Qu.: 4.0   1st Qu.:-0.72   1st Qu.:0.085   1st Qu.:2.00  
##  Median :10.0   Median :-0.13   Median :0.127   Median :2.30  
##  Mean   :12.2   Mean   : 0.03   Mean   :0.178   Mean   :2.61  
##  3rd Qu.:20.0   3rd Qu.: 0.85   3rd Qu.:0.255   3rd Qu.:3.00  
##  Max.   :40.0   Max.   : 2.61   Max.   :0.564   Max.   :6.00  
##                 NA's   :109                                   
##     classid        schoolid       childid    
##  26     :  10   11     :  31   1      :   1  
##  42     :  10   12     :  27   2      :   1  
##  13     :   9   71     :  27   3      :   1  
##  189    :   9   76     :  27   4      :   1  
##  205    :   9   77     :  24   5      :   1  
##  253    :   9   31     :  22   6      :   1  
##  (Other):1134   (Other):1032   (Other):1184

The childid variable is redundant but there is no harm in retaining it.

For a categorical variable the summary is a frequency table. If the number of levels is large, the ones with the largest counts are listed first. Thus the largest number of students sampled from a single class is 10. To look at the distribution of the counts we can apply xtabs twice.

xtabs(~xtabs(~classid, class))
## xtabs(~classid, class)
##  1  2  3  4  5  6  7  8  9 10 
## 42 53 53 61 39 31 14 13  4  2

Out of the 312 classrooms, 42 have only one student in the study, whose purpose is to determine the effects of teacher training on student performance.

Class-specific and school-specific variables

Many of the variables are characteristics of teachers and should be constant within a class. We should check that this is true.

str(classvars <- unique(subset(class, select = c("yearstea", "mathknow", "housepov", 
    "mathprep", "classid", "schoolid"))))
## 'data.frame':    312 obs. of  6 variables:
##  $ yearstea: num  1 2 1 2 12.5 ...
##  $ mathknow: num  NA -0.11 -1.25 -0.72 NA 0.45 0.99 1.61 1.14 -1.05 ...
##  $ housepov: num  0.082 0.082 0.082 0.082 0.082 0.086 0.086 0.086 0.086 0.365 ...
##  $ mathprep: num  2 3.25 2.5 2.33 2.3 3.83 2.25 3 2.17 2 ...
##  $ classid : Factor w/ 312 levels "1","2","3","4",..: 160 217 197 211 307 11 137 145 228 48 ...
##  $ schoolid: Factor w/ 107 levels "1","2","3","4",..: 1 1 2 2 2 3 3 3 3 4 ...
summary(classvars)
##     yearstea       mathknow        housepov        mathprep   
##  Min.   : 0.0   Min.   :-2.50   Min.   :0.012   Min.   :1.00  
##  1st Qu.: 4.0   1st Qu.:-0.76   1st Qu.:0.085   1st Qu.:2.00  
##  Median :10.0   Median :-0.19   Median :0.142   Median :2.30  
##  Mean   :12.3   Mean   :-0.08   Mean   :0.191   Mean   :2.58  
##  3rd Qu.:20.0   3rd Qu.: 0.62   3rd Qu.:0.263   3rd Qu.:3.00  
##  Max.   :40.0   Max.   : 2.61   Max.   :0.564   Max.   :6.00  
##                 NA's   :27                                    
##     classid       schoolid  
##  1      :  1   11     :  9  
##  2      :  1   12     :  5  
##  3      :  1   15     :  5  
##  4      :  1   17     :  5  
##  5      :  1   33     :  5  
##  6      :  1   46     :  5  
##  (Other):306   (Other):278
xtabs(~xtabs(~schoolid, classvars))
## xtabs(~schoolid, classvars)
##  1  2  3  4  5  9 
## 13 34 26 21 12  1

The important information from the summary is that there are 312 rows in this dataframe, corresponding to the 312 classes. If any of the other variables were not constant within class we would have a greater number of rows.

We also see that the number of classes sampled per school is highly unbalanced and a large proportion of the schools have only one or two classes sampled.

A check on the school-specific variables shows they are consistent

str(schoolvars <- unique(subset(classvars, select = c("housepov", "schoolid"))))
## 'data.frame':    107 obs. of  2 variables:
##  $ housepov: num  0.082 0.082 0.086 0.365 0.511 0.044 0.148 0.085 0.537 0.346 ...
##  $ schoolid: Factor w/ 107 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...

plot of chunk housepovdensplot of chunk housepovdens