A day in the life of a data curator: the steps, challenges, and rewards of the data review process
Dr Marta Teperek
Head, Research Data Services
Jan van der Heul
4TU.ResearchData is an international data and software repository composed of 8,000+ science, engineering and design datasets that is run by a consortium of technical universities in the Netherlands.
Whilst the technology underpinning 4TU.ResearchData is provided by Figshare, a team of dedicated staff members are responsible for managing and maintaining various aspects of the data repository, highlighting the importance of human infrastructure to support researchers with data publication.
Meet our data curator
Jan van der Heul is one of 4TU.ResearchData’s data curators. His role advances the organization's mission and vision of making research datasets published in the repository as findable, accessible, interoperable and reusable (FAIR) as possible.
“The data review process provides an essential service to our community by supporting researchers with the curation, sharing, access, and long-term preservation of their data,” says Jan.
“Every data and software submission is thoroughly reviewed to check the validity of the [meta]data and to ensure quality requirements of the repository are met.”
He explains that proper data curation enables datasets to be more easily found, understood and reused to benefit wider society.
“We’re not just a ‘Dropbox’ for data,” says Jan. “But rather our repository provides an intuitive infrastructure that allows researchers to discover, download and reuse data to avoid duplication of time and effort spent unnecessarily creating new datasets.”
Quality control checks on data
Jan conducts quality control checks on data and software code submissions according to 4TU.ResearchData’s review guidelines. He provides researchers with detailed feedback via email before their submission is accepted and completed.
Checks are first carried out on the data to make sure that files are completely and correctly uploaded and that they adhere to 4TU.ResearchData’s guidance on preferred file formats.
Jan describes scenarios whereby researchers need assistance to improve the quality and FAIRness of their data.
“Sometimes, researchers don’t upload their data files but provide links to data stored on their personal computer which we can’t access and could easily be lost. In this case, we request that researchers upload the relevant data files.”
He adds that the choice of file format is also critical to ensure that the data can be reused in the future.
“In the event that researchers upload data in unconventional or proprietary file formats, I ask them to convert them to standard, interoperable, open formats to guarantee their long-term sustainability and reuse.”
Jan also mentions that a large amount of data published in the repository is NetCDF (Network Common Data Form) data, a file format for storing large multidimensional array data and embedded metadata.
He recommends that researchers transfer their NetCDF data to 4TU.ResearchData’s OPeNDAP server.
“The OPeNDAP protocol allows access and analysis of NetCDF data from a remote server without the need to download the data files. This helps to promote data reuse as researchers can inspect the embedded metadata as well as specific ranges, slices, and subsamples of the data,” explains Jan.
The data file contents and structure are checked to make sure information is clear, understandable and aligns with 4TU.ResearchData’s data collection policy.
“I advise that datasets are deposited in English as the universal language and that they don’t contain ambiguous keyboard characters. I also ensure tabular datasets are formatted with legible headers and labels,” says Jan.
Another essential aspect of Jan’s work is to prevent researchers from publishing data containing personally identifiable, sensitive, or inappropriate information.
“In the past, I’ve reviewed medical datasets that contain highly sensitive patient data, including patient photographs, names, and diagnoses. In cases such as this, I advise that researchers anonymize or pseudonymize their data and have informed consent to share their data before openly publishing in our repository.”
Aside from assessing data files, Jan makes suggestions to help researchers improve the quality and richness of their metadata to improve the discoverability, reusability, and reproducibility of their research.
“I look for peer-reviewed journal publications that accompany the dataset, check if the researcher has previously published datasets, and explore online resources, such as Scopus and Web of Science to collect relevant metadata. From this, I can suggest a more descriptive title, subject categories and keywords to describe the dataset. Sometimes it’s possible to add information about the organization that contributed to the creation of the dataset, the funding organization, and authorship,” he says.
As part of the metadata curation process, Jan also advises that authors and co-authors assign their respective ORCID iD: a unique, persistent identifier that distinguishes researchers with the same name and ensures the correct attribution of the dataset.
To improve reproducibility, the metadata record should contain a description detailing the context and contents of the dataset.
“A good description provides information about the purpose and type of study, data collection methods, and any legal and ethical requirements. I recommend that researchers upload a README file for each dataset. This is a text or PDF file that provides data-specific information such as parameters, variables, column headings, units, codes, and symbols used,” explains Jan.
4TU.ResearchData offers researchers the option of linking additional resources to their dataset, such as peer-reviewed journal publications, supporting datasets, and GitHub accounts for software development. Jan dedicates time to validating these additional resources by checking the links have been inserted as full valid URLs that resolve to the desired location.
Finally, he checks that a license has been selected to specify the reuse requirements of data and software and suggests suitable open licenses when necessary. In addition, if a dataset is published under embargo, he confirms this choice with researchers and advises that they provide a rationale for their choice.
Challenges and rewards
Jan reveals that the main challenge of the review process is the time required to review datasets when metadata fields are only partially completed.
“Usually, I review datasets within 24 hours of submission but incomplete submissions take more time. Then, once we’ve made suggestions we have to wait for researchers to make amendments to their submission before we can publish.”
Despite this difficulty, Jan explains that the process is highly rewarding.
“My personal contact with researchers guides them through the process and helps them learn how to publish better quality FAIR data. It’s gratifying to receive their positive feedback once I’ve helped them succeed in publishing their data.”
Read more about Jan and his colleagues' efforts on 4TU.ResearchData’s testimonials page.