## Stat 692 Fall 2014 Notes

These are Yandell's notes from Fall 2014 sessions to augment lectures. These are open to read, somewhat for my benefit and yours. Comments welcome. First link is to Bates's GitHub [1] document; second is to notes below by date.

- Introduction to R and RStudio (Intro.Rmd [2])
- Organizing Data in R (Data.Rmd [3]; 2014-09-12)
- Data Visualization with ggplot2 (ggplot2.Rmd [4]; 2014-09-19)
- Quadrat Data Project (Quadrat.Rmd [5]; 2014-09-26)
- Linear Models (lm.Rmd [6]; 2014-10-10)
- Reading Big Data (ReadingBigData.Rmd [7] (2014-10-17)
- Sleep Transcriptome Data Project (sleepTranscriptome.Rmd [8])
- Multi-factor experiments (ch09.Rmd [9]; 2014-10-31)
- Regression models (ch10.Rmd [10]; 2014-11-07)
- Plotting Densities (densities.Rmd [11]; 2014-11-14)
- Plotting PMF's and Probability Densities (PMFs.Rmd [12]; 2014-11-21)
- Simulations (Simulations.Rmd [13], SimSpeed.Rmd [14])
- [Thanksgiving]
- Creating Packages with R (2014-12-05)
- Special Topics Suggested by Students (2014-12-12)

## 2014-09-12

github notes

cd

git clone https://github.com/dmbates/stat692 [1]

cd stat692

git pull

See also github links at http://www.stat.wisc.edu/network-skills#beyondR [15]

Add notes how to clone git repositorywithin Rstudio (if possible)

===============

Using Install button to get package polynom

Kernighan and Plaugher

Elements of Programming Style

http://en.m.wikipedia.org/wiki/The_Elements_of_Programming_Style [16]

stopifnot()

see Doug's Data.Rmd at his github

help for package

search path

datasets for intro classes--what is out there

count.fields

$ as "syntactic sugar" vs [[]]

## 2014-09-19

moodle.wisc.edu course page

first homework assignment at moodle

Data.Rmd

with() better than attach() ... detach()

within() to change assignments in structure

count.fields()

manual: input/output ... http://cran.r-project.org/doc/manuals/r-release/R-data.html [17]

read.table: tell what fields are

data organization

wide and long formats

long format--need example

factor and ordered factor

subset

factors, can use drop=TRUE to explicily drop unused levels

or apply factor() to a factor()

stack/unstack; reshape

xtabs(~xtabs(~schoolid, classroom))

xtabs(~xtabs(~classid, classroom))

data visualization

ggplot2.org

pima indian data from faraway

qplot

empirical density plot

use factors for things that should be factors

don't do dumb things with boxplots

## 2014-09-26

homework comments

be sure to document what you did so you will later recall details

childid problem

classroom
minority
childid
schoolid
classid
}

str(rownames(classroom))

## note: evaluates as character string

all(str(rownames(classroom) == classroom$childid))

loops

R prefers vectorized calculation--work on the whole object

look ahead

cd /afs/cs.wisc.edu/p/stat/data

ls

cd MS.exam

ls -l

cd s11

ls -l

linux basic tools

head file.txt

wc -l file.txt

ls -lh

wc

less or zless

zcat file.gz | wc -l

homework

ex2.Rmd

details of Rmd document

"---" YAML metadata section at top

html_document to pdf_document

preliminary section

set options

```{r initial,echo=FALSE,print=FALSE,cache=FALSE}

library(knitr)

library(ggplot2)

load("classroom.rda")

opts_chunk$set(cache=TRUE)

options(width=76, show,signif.stars=FALSE, str=strOptions(strict.width="cut"))

set.seed(1234

```

comments on cache and chunks

see Rstudio home->products->R packages

knitr (caution--poor navigation of document)

R markdown (read this one)

full specification of pandoc markdown

note: Rstudio has lookahead for options as you type

```{r hist1,echo=FALSE,fig.align="center",fig.cap="Histogram of blah blah",

fig.height=2.5,fig.width=6.5,fig.lp="fig:hist1",fig.pos="top"}

qplot(mathkind,data=classroom, geom="histogram")

```

somehow this did not work; will fix later

back to ggplot2 slides

bivariate plots

log-log scale spreads data out nicely

comparative box plots--horizontal

regression or ancova

stat500 data with geom_smooth(method="lm") + coord_equal()

watch aspect ratio (studied by Bill Cleveland): should be ~45 degree angle

add reference line using geom_abline

note Galton's regression toward the mean

ancova

cathedral data

geom_point(aes(color=style)) + geom_smooth(aes=(col=style),method="lm")

different panels or facets using formula (left=rows, right=columns)

facet_grid(.~style)

Tufte's small multiples

Cleveland: details of how to do this well

## 2014-10-10

assignment due NEXT Tuesday to produce several figures

look at MS exam data

ex: Spring 2011 MS exam problem on Quadrat Summary

auto conversion

think of names of variables and number of unique values--are they labels?

note "X" prefix on variable identifier

single bracket on data.frame gets data.frame: summ[1] vs summ[[1]]

the first one is like using subset() on a data.frame

single vs double ampersand (&)

single: component-wise AND

double: programmatic AND (evaluates first as scalar; if FALSE do not evaluate second)

are there any entries that have Browsed but not Present?

nrow(subset(summ, X2m_SppPresent == "N" & X2m_SPPBrowsed == "Y"))

-----------------------

Linear Models

brain weight data

ggplot, scale_x_log10()

geom_smooth(): mostly linear but a bit of a bump

linear: geom_smooth(method="lm")

model fitting functions

style suggestion for reproducibility

use data option with data.frame after cleaned

keep Rmd document with steps of cleaning and steps in analysis

note explicit inclusion of intercept as "~ 1 +"

model.matrix()

gaussian model: y ~ N(Xb, s^2 I)

linear model form works here, but not for more general models (e.g. Poisson)

str() of model.matrix -- notice "assign" attribute

many subtle things go on behind the scenes

model matrix more appropriate name than "design matrix"

suppress intercept: "~ 0 +"

coef(summary(fm1)) includes variability but coef(fm1) does not

vcov(fm1) provides variance-covariance matrix of coefficients

note: stderr() is error message stream, NOT standard error

offset(log(BodyWt)) to test if slope is 1

plot of model fit: get 4 of 6 possible diagnostic plots

think about aspect ratio

diagnostics useful, but designed for blind selection of "best" model

often better these days to use visualization tools to explore

methods for lm objects

generic functions such as print, summary, plot

what methods does a particular class have?

non-visible functions (not in NAMESPACE)

extractor functions such as coef, deviance, logLik

use methods that exist to "future proof" your code

role of methods can change in the future

provide extractors in the code

## 2014-10-17

Rmd and figures

short wide figures more flexible

---

title:

author:

date:

preamble

output:

pdf_document:

fig_caption:yes

fig_height: 3

---

preliminary R section

set options, load libraries, create initial objects (such as plots)

## subsection (can be numbered if desired--see manual)

now look at report and plots for kindergarten math scores

notice outlying points--what is going on?

test has minimum and maximum scores

minimum is allegedly adjusted for possibility of guessing: artificial value

notice more variability in boys than girls

style: figure captions as complete sentences, explaining plot. Enough information so busy reader can read caption and comprehend what is going on.

side-by-side plots: perceive shape but not shift

histogram vs quantile-quantile vs histogram vs boxplots

note minority vs non-minority

scatter plot of score vs gain (=difference)

notice collinearity: pre and post scores each have variability

low initial scores have high gains

loess curve: see http://en.m.wikipedia.org/wiki/Local_regression [18]

be careful of interpreting intercepts: are they meaningful?

scatterplot of pre vs post scores (more appropriate)

use theme(aspect = 1) or coord_equal() to set aspect ratio to one

facet_grid(sex ~ .) puts one on top of the other

don't know how to put labels on left rather than top

facets use the same axes: can look at patterns between

experiment: what communicates best

is the variability among schools, classrooms, individuals?

be careful about confounding these nested sources of variability

coord_flip()

what kind of spread within classroom of individuals?

careful of outliers and leverage if plotting gains

gains are not effective measure of instructor!

think about ordering of classrooms

by ID: meaningless

by mean

school11
classid
note: reorder has third argument for function to use for reordering

digression on multilevel modelling/hierarchical modeling

caution on longitudinal data and migration

==========================

Dealing with Big Data

what is Big Data? (http://www.stat.wisc.edu/bigdata)

volume: very large

velocity: arriving and changing rapidly

variety: many sources

veracity:

validity

volatility

Big Data in R

R is slow interpreter and is single threaded

in-memory storate, conservative copying

R is not sutied to real-time processing of data streams

Moore's law

number of transistors on IC would double in approximately 18 months

roughly interpreted as speed of processor

density of circuits is now incredible (sub-micron geometries)

density of memory has exploded (solid state drive)

but cannot speed up clock rate (related to dissipation of energy)

modern computing architectures evolving quickly

algorithms designed around how to move data

to use effectively, want to spread out the calculation

goal: distribute work across multiple cores

divide and conquer: map-reduce, also known as hadoop

use threads, but have to prepare them

can use snow, mpi, to farm out jobs

big volume data in R

read "R Data Import/Export" http://cran.r-project.org/doc/manuals/r-release/R-data.html

archive.ics.uci.edu/ml/machine-learning-databases/00216

4.5 Gb size

next time: linear models,extractors,contrasts

## 2014-10-31

Linear Models

QR decomposition

see notes

log(BodyWt) as response for brain weight data

methods for qr objects

note Q1 and Q2 are orthogonal, which implies directly that resid and fitted are orthogonal

decomposing into orthogonal subspaces

Fisher with poor eyesight spent evenings thinking rather than reading

statistics is not a formula, instead it refers to the model

note: Householder usually yields negative intercept, due to reflection

style: extract pieces from model rather than going back to original data

slides show how to extract RSS, coefficients, etc.

very useful for simulations with same predictors (even if shuffled)

model.frame and model.response: caution on missing data

model.frame removes missing values and applied functions/transformations

model.response only pulls out response for non-missing cases

relation of linear algebra (Math 340) to numerical linear algebra

never evaluate inverses except R1 if interested in (X'X)^-1

computational rank--tolerances "close" to zero

basic linear algebra subroutines

LAPACK benchmark

Revolution R Open (http://www.revolutionanalytics.com/revolution-r-open [19]) has accelerated BLAS using multi-threading

but does not affect LS calculations due to rank deficiencies with factors (more later)

Categorical covariates

degrees of freedom: remove 1 for intercept

JEdward Deming: burning the toast and then scraping it--dummy variables

xtable package by David Dahl

```{r nm, results="asis", echo=FALSE}

options(xtable.type="HTML")

xtable(fm2)

oldclass
class(fm2)
xtable(fm2)

class(fm2)
```

first call gives ANOVA table; second call gives coefficient table

Model matrix for factor spray

coef = difference of levels with first level, which is a contrast

orthogonal contrasts, orthogonal polynomials

contr.treatment and contr.poly are most often used

contr.SAS matches SAS

contr.sum(2) for 2-level factorial designs

Box, Hunter and Hunter book: http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471718130.html

details of 2^3 model matrix and design matrix

## 2014-11-07

questions on homework

facet wrap

ch 9: multi-factor experiments

9.1 anova

formula ".~." is old formula, for instance in update()

battery separator data

options(contrasts = c("contr.treatment","contr.helmert"))

summary(fm1
notice 3-way interaction is highly significant

heredity principle: retain all parents of higher order terms

model.matrix(fm1) gives you encoding

contrast choice gives you orthogonal matrix

more stable

each coefficient is independent

followup: looking at subsets (temp == "high")

blocking factors

gl() generate levels

ftable(xtabs()) useful to flatten tables

blocking factor--op--always goes in first

unreplicated multi-factor design

have to give up something, usually highest order interactions

9.2 2^k factorial designs

saturated model if use all interactions--no d.f. for error

QQ-plot: outliers suggest significant effects

normal plot has challenges with sign of signals

half-normal is more stable: distribution is chi = square root of chi-square

should go through the origin

look for points above line through origin

9,3 fractional factorial design

run only part of the experiment (save money and time)

but be careful how you lay it out (plan ahead)

control what main effects and interactions are aliased (completely confounding)

see notes for more details--this is material for Stat 424 for instance

ch 10 regression models

10,1 inference for a regression line

timetemp data: ancova

are the lines different?

are they parallel?

is there a non-negligible slope?

Venables: analysis of variance is really about comparing models, not terms in models

10.2 other regression models

note I(x^2) to create second-order numeric interaction

look at Rmd code to see details

confidence intervals confint()

predict() to get prediction values and confidence intervals

design is according to variables

model.matrix is derived from design

continuous and categorical variates

back to timetemp data

factor coefficient typeOEM is important: repaired vs OEM

change in slope measured by typeOEM:temp

## 2014-11-14

SimSpeed.Rmd for today

case study with large simulation of student

good packages to study:

datatable package

dplyr and magrittr packages by Hadley Wickham

using alr4 packages

why not attach package?

keeps workspace cleaner

use "LazyData" in DESCRIPTION file for package

then only loads what you need

if that is not done, then

data("ufc", package="alr4")

will explicitly load "ufc" without other stuff in package

str(ufc)

library(ggplot2)

## clean up data

qplot(Species, Dbh, data=ufc,geom="boxplot",xlab="Tree Species",

ylab="Diameter at breast height (mm)") + coord_flip()

## reorder in meaningful way

qplot(reorder(Species,Dbh), Dbh, data=ufc,geom="boxplot",xlab="Tree Species",

ylab="Diameter at breast height (mm)") + coord_flip()

## could add "+ scale_y_log10()"

## meaningful scatterplot

qplot(Dbh, Height, data=ufc,geom="point",

xlab="Diameter at breast height (mm)",

ylab="Tree height (dm)") +

scale_y_log10() + scale_x_log10() +

facet_wrap(~ Species)

## note some species are rare--drop them?

## fit model for one species

ufcwc
m1
options(show.signif.stars = FALSE)

summary(m1)

## be sure to do reality check--look at fit

qplot(Dbh, Height, data=ufcwc) + geom_smooth(method="lm")

qplot(Dbh, Height, data=ufcwc) + geom_smooth()

## poor fit on linear scale suggests log-log fit

summary(m2
## residual plot

plot(m2, which=1)

reduced
str(reduced)

summary(m3
anova(m3)

## and also model with interaction (replace "+" by "*")

-------------------------------------------

Simulation Speed handout

## pull out R code

library(knitr)

purl("SimSpeed.Rmd")

## then march through the SimSpeed.R code

system.time(resRepl
## replicate statistic mean() N times with sample size n

qplot(resRep) + geom_density()

caution on loops: set result vector before you do the loop

don't append

resloop1
system.time(for(i in 1:N) resLoop1[i]
## don't do

resloop1

another approach: tradeoff space vs speed

str(MM
system.time(resColMeans
very fast, but creates large matrix

another way: meta-programming using apply()

apply function mean() over 2nd index

resApply(MM, 1, mean)

force garbage collection: gc()

do this before any systematic timing to reduce variability

family of apply meta-functions

## 2014-11-21

case study of analysis

lessons

use profiling capabilities (based on functions)

things take a long time in areas of code you don't expect

local environments within functions are removed

think (and write) in terms of vectors

be aware of column major ordering: work down columns in nested loops

for (k) for (j) for (i) value[i,j,k]

vast differences in speed for different areas of memory

avoid creating vectors by appending in loops

careful about naming objects: readability, searchability

always set.seed() for simulations

makes strange behavior repeatable

start with 2000 4x4 matrices

generate new set of 4x5 matrices by adding a column

evaluate a probability (need all matrices available)

then shift columns over and drop the last

then iterate this whole process

lists vs arrays

array is homogeneous, all elements of same type, easier to access

list is heterogeneous, need set of (indirect) pointers to different types

chain
## NB: NA is by default a floating point number

## advantage of NA: find out which values are not set (by sum(is.na(...)))

vectorize runif(n) rather than for(i in 1:n) runif(1)

rfr
chain
in this case, loop does not cost, but sometimes it does

caution: your guess of where code spends time may be way off

rdirichlet(): scalar returns vector, vector returns matrix

cbind()

both these functions can be quite slow

environments

function call creates an environment

environment removed at exit

gtools package: dirichlet distribution stuff among other

rep(0,n) is slower than numeric(n)

watch out for 1:n when n=0 is {1,0}

consider seq_len(n)

function advice

pass in numeric arguments

figure out dimensions inside (no need to pass)

create temporary objects inside (will be removed with local env)

vectorize where you can

short code (with documentation) on whole object often easier to figure out

R is a functional language

unit testing: "trust but verify" (RM Nixon)

arrays and indexing operator

very flexible

again, work on whole object if you can

compiler package can "byte compile" functions

may get 30% speed increase

functional programming: apply function to an object

LISP structure sitting under R

functional vs mechanistic

divide and conquer

modern statistical ideas concern patterns and processes, which are functions

organizing code by functions can help reveal/communicate methods

Timing

set.seed() ## make repeatable

system.time(myfun(args))

profiling to measure where code bogs down

call stack written out

Rprof() ## turn on

myfun(args)

Rprof(NULL) ## turn off

summaryRprof()

know several ways to use indices into arrays

diag() can take long time

for [i,j,k] can use an array as index

help("[")

lst[1] vs lst[[1]] for list: what do you get

## 2014-12-05

http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/ [20]

install.packages("devtools")

install.packages("roxygen2")

github and bitbucket as code repositories

library(devtools)

install_github("klutometris/roxygen")

Roxygen

emacs ESS widget

M-x ess-roxy-update-entry create header

##' @title name of function

##' @param

##' @return a modified variable

##' @export if function is to be exported

create package withing Rstudio

new project -> new R package -> add source files

x create git repository

x use packrat with this project (snapshot of depends packages)

DESCRIPTION

License: GPL2 usually

GPL3 related to FFF & Apple Xtools

other possible licenses

NAMESPACE

Read-and-delete-me: read it for instructions

now that you have a package

devtools::document()

takes @arguments directly out of code file using roxygen

keeps document information in one file (with code)

updates NAMESPACE and man pages

script in source as example

##' @examples

##' L
...

NULL

==================

build tag under upper right Rstudio window

Build & Reload: try to build package

Check: check for errors and all sorts of stuff

design philosophy

go from a file to a package to a git repository

all content built into signngle file

always write examples

separate area for tests: do functions work together

testthat by Hadley Wickham (watch out for approximate equal floating point)

and produce result you think they should

devtools::test tests all test_that in a package

=======================

cran.rstudio.com

R Manuals

R Data Import/Export

Writing R Extension

complete references but intimidating

documents CRAN standards

=========================

Rsave datasets

extra directories saved in installed package

inst directory

special data for tests

raw source before manipulation

=============

r-pkgs.had.co.nz/intro.html

R packages by Hadley Wickham

nicely written, compatible with Rstudio

see also his book

jekyll: fancy web pages?

=================

philosophy

long file or many short files, say ordered by type of function

simulations all in one place

update from one set of values to another

===================

efficiencies

compile code into byte code

LazyCode

LazyData

===============

philosophy on proprietary notebooks

need license

v. open source

transfer of "program culture" training across packages

## 2014-12-12

Special Topics Suggested by Students

publication, tables, graphics

LaTeX: Rmd -> Rnw

Rmd = R + markdown

Rnw = R + LaTeX

beamer

xtable package for tables ready for LaTeX

special LaTeX packages for extra features

graphics

avoid base R graphics

ggplot2 preferred these days

think about design of plots

everyone looks at figures, tables are for obligated stuff

parallel processing

examples of reports

exploring R functions

read.csv: type name and look at what it is (but comments stripped)

cd /unsup/R-devel/src/library

cd base

ls -la

you will see man, R, inst, demo folders

cd R

look at individual *.R files

or look in utils for readtable.R

.Internal or .Primitive or .Call go to compiled C code

not easy to look at these (such as exp())

see R journal for Uwe Ligges article on how to find these

http://stackoverflow.com/questions/14035506/how-to-see-the-source-code-o... [21]

writing R functions

adv-r.had.co.nz Hadley Wickham

think in terms of the whole object

alternative languages

Julia, Python, Matlab

characteristics of R (mid 1980s design)

interpreted, vectorized, functional semantics

each function evaluation traces through code

don't modify arguments -> do a lot of copying

storage allocation and garbage collection are important

garbage collector

global stop mark sweep

generational garbage collector

reference counting for localized garbage collecting

Ed Deming: bruning the toast and then scraping it

homo-iconic

everything is (internally) a vector

there are no scalars

running an R loop is potentially expensive

updating is expensive as well

Matlab/Octave

Matlab: proprietary with many nice GUI features

Octave: open source implementation (bug for bug compatibility)

developed by John Eaton, UW Chem Engr

community has developed but is not flourishing

everything is a matrix; vectorization

R API and Rcpp

API: interface for other languages (C, Fortran)

has to be in R format, etc.

Rcpp: Dirk Eddelbuettel and Romain Francois http://www.rcpp.org [22]

http://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-package.pdf [23]

template meta-programming, but complicated (seamless?)

Python: general purpose programming language

used extensively in bioinformatics

scripting language, can be used as a filter between applications

object oriented in C++/Java sense: methods associated with a data type

many libraries / packages

technical computing requires add-ons to python

byte compiler and derivative languages

Julia http://julialang.org [24] from MIT

recently developed (2012)

uses modern CS tools (just in time JIT compiler)

most successful JIT: javascript and Google's V8 engine

LLVM low level virtual machine project

U Chicago and Argonne Nat Lab

open source compiler technology, highly modular

youtube EuroSciPy 2014 Steven Johnson talk

http://www.youtube.com/watch?v=jhlVHoeB05A [25]

wide variety of data types

function based, uses multiple dispatches

dispatch on signature of input formats (think R S3 and S4 methods)

multiple dispatches consider several arguments

see Bates presentations under github account

powerful, flexible language where the future lies