UW-Madison’s Relationships with Data


This document is in response to a request for discussion with members of the Administrative Excellence Data Team (Phase 3) as part of their "ongoing effort to learn more about campus needs regarding the use of data center services. Our intent is to build upon the needs and requirements collected in Spring 2012 by members of the Phase 2 team. Data collected will inform the team as we follow through with our charge – designing and implementing an enterprise-wide data center service model in 2013." [in email from Mark Sweet]

Data center services and data center aggregation, as concepts, capture just a small part of what data means to a major university. Certainly, we need massive data storage, secure backups and multi-level curation. However, a comprehensive approach must address collaboration and analytics, which span technology, design and human communications. People who access data span the full range of experience. We need appropriate tools and expertise infrastructure for experts to excel, for novices to strengthen their skills, and for all to work efficiently. Below are thoughts about different aspects of the relationship of UW-Madison with data.

Administrative Excellence

Regular operations of units such as departments and programs require many data sources that are still not mutually compatible, despite best efforts. Staff must access multiple interfaces, and the primary mode of action is pull-down menus to do one thing at a time. Sometimes, improvements that provide browse modes (one line per course, for instance) would save substantial time. Staff need to share much information with faculty leaders (chairs and directors, and committees), but the normal mode is to prepare spreadsheets and documents and either email them or hand-place them on a (hopefully secure) website.

The campus knowledge-base is an improvement, where information is beginning to be organized. If this can utilize authentication so that individuals can drill down to more sensitive material as appropriate, then expansion of kb.wisc.edu to unit administrators would seem a wise investment. For instance, a chair wanting to know funding levels of graduate students or the developing plan for courses in the coming year could immediately access this directly through kb.wisc.edu rather than asking a staff member to drop what they are doing, pull out a (hopefully) current copy, and email it. Further, the chair or administrator may want to share plans as they develop with faculty so those individuals can plan better. To complicate matters, decisions and reports often require disparate data gathered by different staff; providing guidance on (easy!) improved database management would be useful.

Security and proper vetting is vital, but that should not be left to each administrative unit to figure out. Rather, it should be part of the infrastructure. Flexible knowledge-base should also be part of the campus infrastructure for all administrative units, not just to use, but to adapt to particular needs.

IT administration across campus is complicated and dispersed. Well-staffed IT groups can create test environments at very low cost (old hardware, software, access to all resources) with minimal implemention time. How would a single data center be cost-effective at deciphering special cases for a campus of 40,000+? There is also concern of introducing more middle-men, or middle-ware. Virtual machines are OK for handling network services (DHCP, DNS, printing, etc.) but computational nodes become more complicated. We would need considerable user-interface developed to allow such a data center to cover widely varying department needs.

Research Quality

Top-quality research today requires Internet access to Big Data, both locally and spread globally. Big Data are typically not static, but may be ephemeral or may change dynamically. Beyond technology, research quality depends heavily on people, and UW-Madison lacks a robust human infrastructure for research collaboration. This human component is in great demand both within and beyond the university, which explains the tremendous explosion of professional degree programs in analytics. Analytics is the discovery and communication of meaningful patterns in data. Analytics concerns the design, analysis, visualization, and interpretation to get from data to information, from information to wisdom.

The Information Technology Committee recommended in its 2012 Elevating Research Computing Cyberinfrastructure at UW-Madison hiring a distributed team of "experts who can serve as Facilitators to be deployed as needed for getting new research computing projects up and running". UW-Madison has long-standing, sustainable models of Analytics Facilitators, in the Cancer Informatics Shared Resource, the Biometry Program and the Social Science Computing Cooperative. An integrated, distributed campus-level network should include analytics experts in technology, design, inference and communication. These people will share among themselves about methodology and tools, and collaborate directly with researchers in units across campus.

Centrally managing data storage, encouraging larger scale relational database solutions, and enabling access to high-throughput computing resources, with attention to disparate needs across units--these are now necessary for research quality. Today’s wicked problems require teams of individuals transcending traditional disciplinary boundaries. Traditional publishing is gradually giving way to more interactive modes of communication. Quality research in this setting must be reproducible. Reproducible research relies on robust workflow systems that enable researchers to exactly reproduce results from data, and to transfer knowledge of research methods between individuals. While UW-Madison has made excellent progress on high-throughput computing and is finding solutions for centralizing data, it has fallen behind on workflow systems. These systems are less than ideal, but they are coming along, and we should be leaders in their use and development.

Research and administration intersect when considering grant budgets and personnel management. Faculty invest valuable research time (hours to days) identifying and managing personnel categories (RA, postdoc, scientist) and time frames for their team members, in addition to working out direct and indirect costs for grants. Much faculty-staff time is devoted to these interchanges. Some larger departments have developed in-house tools for this exchange; these need to be migrated to all departments and research units. The pi.wisc.edu provides valuable snapshots of total budget by grant, but does not go far enough. Cayuse has important tools, but much planning goes on outside of grant submission.

Here, and elsewhere, it is important to remember that data extend beyond raw data to include meta-data that detail the provenance and design, processed data for various analyses, analytical methods spanning specific instructions to organized packages, and static and interactive visualizations ranging from tables to graphs to reports.

Educational Innovation

Educational Innovation (EI) is in its infancy, and will likely transform the shape and perception of universities. It begins with transforming our curriculum at all levels. While great strides have been made in putting course information, including course change protocols, online, there are still many roadblocks for departments. The process of planning course schedules for a unit each semester requires faculty and staff to exchange much information, and to organize complicated grids to prevent class overlap and ensure proper sequencing. This is especially complicated when coordinating schedules across departments. Room assignment is a part of this, but simply getting class timings right is daunting.

The admissions process is complicated. G-WIS and G-PAS are quite valuable, but need to be intergrated into a "cradle-to-grave" systems that allows units to follow students from first contact through advancement. Clearly, access and privacy issues must be addressed with this information, but campus already has some solutions for these aspects.

As we expand continuing studies programs to more non-traditional audiences, we will likely have many more non-traditional students than traditional ones, and many will rarely visit UW-Madison. EI needs technological infrastructure for online and blended instruction, which is being developed quickly for massive open online courses (MOOCs). In addition, successful EI for data will require scalability for working with more people and with Big Data. It will also require substantial investment in infrastructure to transfer pedagogical knowledge on data analytics across displines.

However, this is only the beginning. As we better meet the instructional needs of individuals in context of their worklife, and of businesses evolving in their missions in our data-rich society, we will find new ways to collaborate. This in turn will lead to new win-win funding sources and business-to-university collaborations. UW College of Engineering is well along this path.

Data, UW and Society at Large

UW-Madison should lead thinking in society about making sense from data. Projects such as CHTC, DeepDive, and LOCI, and SSEC/CIMSS, among others, show the tremendous potential, and the leadership already in place at UW-Madison. How can we efficiently transfer Big Data knowledge to other business realms? We don't know what the next Big Data problems will be, or what analytics methodology will be needed; how can we establish an efficient and scalable infrastructure that enables research excellence and educational innovation for societal good?

The Wisconsin Institutes of Discovery (WID) provides the focal point for innovation between the university and society at large. A proposed extension, WID II: Big Data Analytics, would create a new building where campus units with missions in data analytics would reside, intermingle, and grow. Such a community should span the humanities, arts and sciences. This community would be a catalyst for interchange among analytics staff, a bridge between analytics methods and other research and teaching across campus, and a focus for industry-university relations regarding data analytics. These structures are being created at other institutions, but UW-Madison is far behind.

Departments centrally involved in developing and teaching data analytics methodology include Computer Science (L&S), Statistics (L&S), Library & Information Science (SLIS), Operations & Information Management (Bus), Electrical & Computer Engineering (CoE), Biostatistics & Medical Informatics (SMPH). Mathematics (L&S) is core to all these disciplines, and should also be involved. Many other units, notably Population Health, Sociology, Economics, and the Social Science Computing Cooperative, have profound involvement in data analytics methodology development.

Analytics is most grounded when focused on team-based projects rather than on methodology in a vacuum. Therefore, one might argue that the current structure of WID with its themes is ideal. However, student and society interest in STEM disciplines is growing, and the core STEM departments on campus are barely able to house the staff they currently have, much less the expected expansion in coming years. Further, new analytics development for Big Data will require large teams of analysts who can easily interact, both online and in person. The challenge will be finding the right balance of physical proximity and dispersion of expertise across campus to meet the needs of the decades ahead.

Rebuttal to Simply Aggregating Data

We fully endorse the aggregation of data in the creation of centers for storage, backup, archiving, access and other aspects of data management. These should become a part of our infrastructure, on a par with lights, phone, and wired and wireless Internet. However, doing this alone misses the points below:
• Funding model should be that upgrading data infrastructure enhances productivity at all levels, and will reap its rewards in new funding, much that we cannot yet imagine, including grants, Educational Innovation, and new donors.
• Top-level savings should be reinvested into this infrastructure, including hardware, software and wetware (people); unit-level savings should be guided through education/training toward efficient use of this infrastructure.
• Keep focus on people, our most valuable resource. Business community knows this of their own task force. University excels through its people. Technical and design/analytics support are key to success.
• Aggregated data should be easier to use than the current model or there will be limited buy-in, either through service fees or use of features. Large units will cherry-pick the features they need, but small units will ignore, leading to the security and confidentiality leaks that we all fear.
• Incremental innovation that can be easily assessed will come from cautious investment in data aggregation. Punctuated innovation, with deeper results that are more difficult to measure, will only emerge from investing in a new culture embracing Big Data Analytics.
The reason the Department of Statistics is making these points is that our unique mission concerns data analytics, from design through inference to interpretation. We believe it is the best interest of the campus to address these important needs of data aggregation in the broader context of why data are collected and how we will use them.

Brian S. Yandell, January 2013