Open Source and Open Data: Collaboration is Key
As the world has risen to the challenge of the COVID-19 pandemic, researchers and the public alike have developed a greater appreciation for accurate and reliable open data sources. From the National Institutes of Health’s Open-Access Data and Computational Resources to Address COVID-19 to the local data sources that inform our nightly news updates, open data have become a more important force in our lives than ever before. People have a stake in data and, increasingly, people are contributing their time and getting involved in developing the tools that help researchers, and the world at large, interact with that data. One way we are achieving this locally at Northwestern University is through participation in open source data repository development.
InvenioRDM: an open source platform
The open source coding community is responsible for dozens of software solutions crucial to our daily lives including web browsers, content platforms, and operating systems. The open source Python programming language, with its structured, general purpose, object-oriented base, serves as the basis for the development of the new open source, turn-key repository InvenioRDM, currently being developed by an international team of highly-engaged collaborators coordinated by CERN, the European Organization for Nuclear Research. While a version of the Invenio framework has existed for over 20 years, its modernization started began in 2018, with the goal of making the institutional repository modular, scalable, customizable, and ultimately more accessible.
From the beginning of this process, the InvenioRDM product managers have worked closely with an international team of partners including:
• Northwestern, Caltech, and NYU in the US
• Various European universities and organizations
• Eko Konnect — a cluster of the Nigerian Research and Education Network (NgREN)
• The Turkish Academic Network and Information Center
• The National Institute of Informatics of Japan.
In addition to the partners, dozens of users from around the globe have independently installed versions of the repository software and launched them at their own institutions. Both in terms of daily development and distributed user support, the open source InvenioRDM team has worked boots-on-the-ground and collaboratively to support their peers in standing up the software and supporting open resources and data at their institutions.
Metadata, DOIs, and controlled access
Though the coding team is distributed, we have prioritized agreement on a base metadata model that is compliant with data sharing mandates from the European Union and increasing mandates from US funding agencies, while simultaneously maximizing findability of data for users of the repository. Inspired by the open and participatory nature of the project, we instituted community-based project meetings tailored for non-technical but highly involved users of the repository at the partner institutions. These users have provided significant subject matter expertise as the key users of metadata while either cataloging their own deposits or searching for deposited data from other researchers.
Through these conversations, and bolstered by the project’s use of DataCite to mint unique digital object identifiers (DOIs), the partners agreed upon the use of the DataCite schema for InvenioRDM’s data model. DataCite also supports data discoverability through hosting the DataCite Commons, a free online tool through which users can discover the minimum required metadata that is provided with each resource that registers for a DataCite DOI. Taking these curation and accessibility conversations a step further, the InvenioRDM community’s Metadata Interest Group committed to the use of the COAR Access Rights Controlled Vocabulary which has allowed us to tag data records with clear designations of either Open Access, Embargoed, Metadata Only, or Restricted.
Hosting and disseminating institutional repository records designated as Metadata Only was a key motivating factor in Northwestern’s commitment to the InvenioRDM open source repository as this feature helps to serve the needs of local researchers who wish to make their datasets discoverable, regardless of the file deposit location. Librarians at Galter Health Sciences Library & Learning Center work to preserve and disseminate the scholarly output and data of biomedical researchers while respecting the privacy restrictions that must be upheld for datasets containing personally identifiable information (PII), a common occurrence in medical datasets. The Metadata Only record serves this need as it enables robust description and active curation of medical datasets through a vetted standard that maps well to Dublin Core and Schema.org, among others, while not requiring deposits of the datasets themselves. These Metadata Only records are compliant with funders’ data sharing requirements, such as the recently updated requirements of the National Institutes of Health, going into full effect in 2023, while enabling data sharing upon request through Data Use Agreements, thus protecting patient privacy.
Collaborations and repository best practices
The collaborative nature of the repository work has inspired and motivated the team and continues to do so as we explore additional metadata and other enhancements. Open source tools have a critical role to play in the data sharing ecosystem, encouraging collaborations between developers, librarians, and subject matter experts. Through sharing ideas and working together to design system improvements, experts from each of these professions learn from their peers and find new skills and perspectives to bring to their own work. As librarians and researchers work to test repository improvements made by developers, each group learns from the others about workflows, usability, controlled vocabularies, and data and metadata standards. Each group comes away with a greater appreciation of their role in the lifecycle of data preservation and with a clearer idea of what they can do to make data accessible and discoverable.
Repositories of all types have served as a guide and inspiration in this process, demonstrating how data can be effectively curated and preserved for any field research. Repository best practices these tools have incorporated such as vetted schemas and controlled vocabularies, embedded file viewers, comprehensive deposit agreements, and adoption of Creative Commons licenses, have set standards toward which all new development efforts strive. The open source InvenioRDM project continues to work towards these goals while acknowledging and supporting our project partners across the globe, supporting open data cataloging and discoverability every step of the way.