Big data sets are increasingly central to the field of statistics, and touch on all three themes. Quality research with big data requires adopting basic lab skills of data management and analysis. Big data is rarely amenable to pre-packaged analysis; researchers need to learn how to derive models and methods for big data sets paying attention to their complicated structure. Getting things done in a timely manner and in a reproducible way demands that researchers develop advanced skills for algorithms to efficiently implement data analysis methods. Big data encompasses several important research areas:
- Very large datasets, from gigabytes to terabytes to petabytes and even exabytes. Issues include storage, retrieval, reuse, compression, and harvesting from high volume data.
- Very large simulations, with millions or billions of repetitions of complicated algorithms. These studies often involve double re-sampling of data to study properties. Some studies can be easily parallelized, while others involve complicated dependencies over multiple iterations.
- Interconnectivity of multiple datasets across the internet, many actively updated by user groups. Access tools have varying degrees of open-source and standardization, making connectivity challenging.
- Pipelines or workflows that allow somewhat standardized analyses to be conducted through a series of steps. These range from R (or Matlab) packages to online tools.
- Large software package management. Software is in fact another form of data. Composed of many interconnecting pieces, it requires careful management to remain current and useful.
- Visualization of big data. This includes static images and dynamic “Google map” tools that allow users to drill down or up at will.
Big data requires large compute engines and efficient data storage and retrieval. High throughput (HT) problems parallelize into many small, independent pieces on many processors using Condor and the Open Science Grid. High-performance (HP) problems have complicated dependencies requiring tight coupling of fast processors. Many statistical models now involve highly articulated, hierarchical data structures with widely disparate data forms and Monte Carlo re-sampling methods, having a blend of HT and HP computing needs. Some studies require extensive simulations but very little data, while other data-driven studies draw in complicated ways on highly structured big data. The latter need particular attention to rapid data access, ideally through a sophisticated data management system (Netezza, Oracle) for raw data, processed data, metadata and software algorithms.
Big data research teams need new levels of technical assistance, involving intense collaboration between faculty and staff on how best to use computing resources to ensure long-term access to and reliability of data and algorithms. With access comes the need to address data security issues, in terms of privacy, confidentiality, protection of unpublished data, etc. All these considerations require attention to best practices of data integrity and reproducibility of research.
Today, we translate methods into accessible data tools, particularly for visualization in the context of analytical models. While statisticians see a wide range of big data problems as “users” of large compute engines and databases, they are increasingly driving the development of new methods that stretch the needs and capabilities of these technologies. Statisticians can anticipate problems and gaps in current technology because they work closely with researchers across multiple disciplines. Hence, statisticians can lead the directions for technology innovation. It behooves the University of Wisconsin-Madison to include the Statistics Department early in the conversations about big data, and to make sure this department has the resources to contribute well to the pressing problems.