By Allison Campbell-Jensen
Librarians, data curators, and colleagues from around the country — from Duke University and Pennsylvania State University to Pacific Northwest National Laboratory and University of Nevada, Reno — recently gathered in Wilson Library’s Research Collaboration Studio for two days of energized exchanges and mini-seminars with other experts in their field of data curation. The lively event was hosted by the Data Curation Network, which is based at the University of Minnesota Libraries.
A small group who wanted to learn about Stata and how they might use it to manage statistical data climbed onto high stools around a computer screen dominating the tongue of a tabletop.
One asked: “This looks like a bar. Are we doing shots?” No, Carlos Ramirez-Reyez, we are not — unless you count cups of coffee, doughnuts, and energy bars that fuel the workshop’s participants at 9:30 in the morning.
This reporter sat in on sections of a recent Specialized Data Curation Workshop hosted by the DCN. For those who sense a boundary between data-reliant research and scholarly studies that rely on narrative approaches, this was like dropping into a new region of a familiar country. One may know the language, but the natives seem to speak in a different, shared dialect.
Don’t worry: They will translate.
Why curate data, anyway?
The data to which these folks apply their expertise likely has implications beyond the single research project for which they were gathered.
According to the DCN: “Data curation enables data discovery and retrieval, maintains data quality, adds value, and provides for re-use over time through activities including authentication, archiving, metadata creation, digital preservation, and transformation. Data curators collaborate with researchers to share data ethically and in ways that are findable, accessible, interoperable and reusable (FAIR).”
In other words, data curators help empower researchers of all sorts in sharing data that is more understandable and reusable.
Researchers, librarians, funders, and data curators also have been preparing for the federal National Institutes of Health Data Management & Sharing Policy (DMSP), which goes into effect Jan. 25, 2023. All NIH research proposals generating scientific data will be required to have a data management and sharing plan. The Libraries’ Research Data Services offers numerous resources and learning opportunities for the University campus community about making data not only open to all but also understandable and interoperable.
Ope! Interoperable means that a variety of software tools or programs can be used to access the raw data that underlies a study’s conclusions. And open is the direction that almost all research data is heading, which means content will be accessible without having to pay for access.
“Learning from each other is key,” said one participant in a discussion led by Libraries’ Wanda Marsolek. Music is the grounding of Marsolek’s life. They developed a playlist by surveying data curators primarily from North America. The resulting music selections for focused work are not restricted to data curators: Listen in on YouTube.
Which data management tool to use?
Sophia Lafferty-Hess, Senior Research Data Management Consultant at Duke University Libraries and a workshop instructor, has created a test example of how to use Stata tools to analyze a statistical study. She says, “Having this analysis ready to use is great.” A data curator could export to a CSV (comma-separated values), which allows the data to be saved in a tabular format. And be sure to export the codebook created for each study, she reminded them, as that may contain clues about such problems as missing codes for numeric values.
To concoct an example: If 1=female and 2=male, what does 3 mean, if it has not been defined in the codebook? Maybe non-binary or not answered? The data curator should consult the researcher to find out.
Still, Lafferty-Hess has greater sympathy for researchers, after sitting in their seats taking a half-week to prepare this test example. While Stata is used by many economists, “R is becoming a pretty popular platform for a lot of scientists. Lafferty-Hess adds: “If I were going to learn a new program, it would be R.” A big if. And R? Oh. It’s another programming language.
Data curators’ human faces
This late October workshop was convened by the Data Curation Network (DCN), a growing consortium of 17 research institutions and academic libraries hosted by the University of Minnesota Libraries since its inception in 2018.
Mikala Narlock, DCN Director at the U of M Libraries: Shanda Hunt, Data Curation Specialist and Public Health Librarian; and Wanda Marsolek, Engineering Liaison and Data Curation Librarian, were among the local hosts.
The Data Curation Network, which is member funded, but originally supported by funding from the Alfred P. Sloan Foundation, has as its motto: Ethical. Reusable. Better.
With education to serve novices to the data curator role as well as those who wish to hone their skills, DCN also provides a platform to share issues, data challenges, and solutions among the coterie of professionals who strive to make researchers’ data available to others.
Data curators need a record or log of what they have done while wrangling data to develop its most accessible version of itself. This is to allow future scholars to better understand how the data was created, compiled, and made accessible to maximize the reusability of the dataset.
Yet the data curators want their equivalent of scribbles in the margin, like questions to themselves, to be in the end private. Air Table was recommended by Shannon Sheridan, from Seattle. “It’s online, so it’s a collaborative tool.”
One person in a group led by Marsolek comments: “It’s really nice to have this support system. … [and to know] ‘OK, I’m not alone.’”
The variety of approaches to data curation, the data management tools chosen, the different styles of individuals — recognizing all these particularly human choices in working with data benefits all at the workshop, says Marsolek.
Adds Narlock: “We’re all doing the best we can.”
“The storytelling — that’s what we are here for,” says Marsolek — in whichever language, platform, or software.