Professional Skills for Data Science

These materials are for statisticians at all levels who want to learn more about modern network and computing tools for statistics. The course for Fall 2015 is: Stat 627 Professional Skills for Data Science (1-2 cr, Thursdays 1-2:15pm, 1010 MSC). [This is NOT the Stat 327 "Data Analysis with R" course.] This is recommended for all new graduate students. It will emphasize training in best practices with R, focusing on how to organize your work (using Rstudio), how to build excellent graphics (using ggplot2) and how to document your work (using R markdown), among other things. This is being led by Brian Yandell and Doug Bates in Fall 2015; see moodle site, Stat 627 Syllabus and Jenny Bryan's Stat 545 at UBC. The course covers

  • installation of the RStudio application on personal computers and use of RStudio on departmental computers.
  • Representation of data in R, including factors and data frames.
  • Manipulation and display of data in R.
  • Visualization of data with the ggplot2 package for R.
  • The "R Markdown" language for literate programming and reproducible research.
  • The R formula language and its use for fitting linear models and generalized linear models.
  • Contrast specifications for factors and their impact on interpretation of coefficient estimates.

The raw material for Fall 2014 was at with more material in Yandell's Stat 692 Notes related to Bates's talks.

Much of the material below was added for Fall 2013. We used the following network tools to deliver information: drupal (open access web pages--this page); box (drop box for collaboration); and moodle (course collaboration environment).

Resource Topics and Useful Links

We are in the age of big data, when it is not enough to think of what you can do with the computer on your desk or lap. Further, statistics as a field is being transformed by analytics, the process of discovery and communication of meaningful patterns in data. Today, statisticians of all flavors--from the most pragmatic to the most theoretical--need a variety of computational tools, and we need access to vast resources across computer networks. While many of us work in closed shops, behind proprietary walls, the world of open source is core to sharing methods and information. Thus, it is important to understand the power and role of the linux operating system and of the R language. However, these only the beginning. While we focus on computational skills, communication is key to the field of statistics, and to science in general. And visualization is at the core of communicating complex ideas, providing a window of insight into the world of data. Today, much communication is online, and we must learn how to leverage online data and network tools to advance in the field, and to do our work effectively.

Communication & Writing

Many of the links above have communication resources.

Visualization & Graphics

Visualization is the key to quick insights with data.

UW-Madison Network Information

Statistics is a department embedded in the UW infrastructure. Much of our system, including email and backup, is coordinated with the Computer Systems Lab (CSL), but the wiring and wireless infrastructure is maintained by the campus. As such, it is important to learn about our system, the CSL system, and the campus systems.

Linux Operating System & bash Shell

Linux is the "operating system", the system that organizes work done on many computers, including the main UW Statistics machines. When you type instructions, or commands, on Linux, you do this with a "shell", which has a language structure worth learning. The primary shell for linux is the Bourne again shell (bash), written by Brian Fox and named humorously after the designer of the first unix shell, Steve Bourne.

Going Beyond R

R is one of many languages and other electronic tools of value to staticians.

Data Repositories

We have a local directory /p/stat/Data with data sets from Devore's "Engineering Statistics" (Devore, used in Statistics 312), Box, Hunter & Hunter's "Statistics for Experimenters" (BHH, used in Statistics 424), and Milliken & Johnson's "Analysis of Messy Data" (MJ); see also Yandell's "Practical Data Analysis". The Devore directory has both portable Minitab worksheets and system specific worksheets. Useful to consult StatLib or the Virtual Library: Statistics for the official lists of datasets maintained by the statistics community. Also, the Internet Scout Toolkit is an excellent source for datasets from many disciplines and organizations.