Play the webinar
Play the webinar
Register for the webinar
June 1, 2021
Fernando Rios, Chun Ly
In this webinar, Chun Ly and Fernando Rios will discuss the data curation workflow that has been implemented at the University of Arizona for their institutional instance of Figshare, ReDATA, available at https://arizona.figshare.com/. They will discuss the best practices they followed for curation and software development, the strengths and weaknesses of doing curation in the Cloud, and how they have used the Figshare API for a semi-automated workflow.
Please note that the transcript was generated with software and may not be entirely correct.
0:05
Thank you for joining us this morning, afternoon, evening, wherever you are, at the webinar today. My name is Megan Hardeman, if I haven't met you before, I’m head of engagement at Figshare.
0:19
And I'm joined today by a couple people at the University of Arizona to work with six year institutions talking today about Data Curation in the Cloud using Figshare API.
0:36
And if you have any questions, please feel free to put the questions area or in the chat, and we will be able to see them, no matter where you put them. So, wherever you would prefer just, as a brief introduction to guest speakers today, are Chun Ly, Research Data Systems Integration Specialist and Fernando Rios who is the Research Data Management Specialist and both from the University of Arizona.
1:07
Um, I think that's all of my administration introduction at the beginning, So I'll hand over ...
1:18
Thank you, that was a very nice introduction, Megan.
1:22
Good day everyone, my name is Chun Ly … and today we're going to be talking about our experience with using the fixture API for data curation and of course data creation itself is a very complex issue for many different organizations.
1:43
So, keep in mind that what we do may differ for what your needs are.
1:47
But nevertheless, we wanted to kind of, for my hour, our story of how we do data creation, particularly in the cloud.
1:54
And hopefully, some can gain from the insights that we've actually, that we've learned over the past year.
2:01
You're welcome to follow us. We are on social media.
2:04
Our organization, our data repository, is that, UAC, we data on Twitter and Instagram and you can also follow us individually at ... and ….
2:18
So, just to kinda give you a very brief overlap overview of what I'll be discussing, I'll start with kind of providing an overview of what a repository is, which is called ReDATA, where it stands currently.
2:31
I'll provide a bit of an overview of what I mean by data creation and what we mean by data curation.
2:38
I'll talk a bit about the ... process that we have with researchers in terms of research data publishing.
2:45
Go into a little bit more of a nitty gritty detail of our creation workflow and some of the technical aspect.
2:50
What I want to focus for, this talk is actually a bit a mix of both introductory information, as well as a bit of more on the technical side. I know some people are interested in terms of how we utilize to fix your API for our data creation. So we'll get into that a little bit more.
3:05
I will also discuss some of our path that we have identified moving forward.
3:09
And hopefully there'll be enough time afterwards to answer any questions.
3:12
And you have.
3:18
So you might have noticed on the first slide, we had some term CA reek data, what is free data, as you might be wondering. And as kind of the name suggests, it's the University of Arizona Institutional Research Data Repository.
3:32
And in particular, it's a free service that we make available to our university through the libraries.
3:39
And our focus is on what we consider publishing and archiving non traditional research products.
3:45
So that includes data, software, code, and other forms of media.
3:50
For example, on presentations, essentially what we're focusing on is, you know, publishing final final research outputs.
3:58
And to kinda give you some kind of perspective, no, this is our data repository.
4:02
We also have what we call system repository, the campus repository, which focus more on stand alone. Like documents, manuscripts, the assistant dissertation.
4:12
So we operate, the University of Arizona library operates two repository or research output on the publication site, as well as on, well in the data side versus the site.
4:25
And much of this came about more recently in terms of data repository.
4:30
What was identify across campus was that there was a specific need, or the university to make available and steward research data.
4:40
So, the read, the research Data Repository, to pick up read data, which was about about through funding through our Provost office, And much of this is to demonstrate our commitment towards making, kind of meant to meeting the mandates that Federal agencies, Federal and State agency have, as well as journal and publishers have tours on data sharing.
5:01
Particularly, what we're focusing on is making available digital object identifiers, otherwise known as DOI's and in terms of long term stewardship of these Research Data Outputs.
5:12
And something that we've talked about throughout my presentation is, you know, we're trying to move towards Open Science at university. So this has meant in many ways to mentors open science, reproducibility, and following these guidelines that are well-known and open science tough. FAIR.
5:32
Just kinda give you some perspective. We're a fairly new data repository.
5:36
We launch last year during Open Access Week and two towards the entire campus.
5:43
This is actually something that we see it for even before the pandemic began to actually launch During the second during the second, on the last quarter of 2020.
5:54
And when we launched right, what we decided to do was make it available to all University of Arizona researchers.
6:00
So that incorporate day counts for about 77,000 individuals that have access to the service.
6:07
And it's not just faculty, and staff, and researchers have access, But we also wanted to ensure that young researchers such as graduate and undergraduate, students have access to that service, So that was our goal from the beginning.
6:21
An image way, this is more towards the ability of allowing other people to enter the world of stem and open science research.
6:31
But prior to actually us going live to our university in October, our instance fixture for Institution instance that we had available was launch in early March of 2012.
6:42
And what that allowed us to do was to actually to work with what we caught early adopters for researchers that had data that they wanted to publish, they were very much on the path towards Open Science, and they wanted to kind of share their data with other researchers, with the community in general.
7:01
And I'll talk a bit about this more throughout my talk, but in many ways, the work that we did with our early adopters kind of helped shape how we do research data curation. It kind of informed us in terms of the pitfalls and data creations. and how we can improve on certain areas and automate or aspects.
7:19
And we've been growing very rapidly. At the end of 20, 20, 2020, we had about 24 deposits. Much of it is working with early adopters and other organization.
7:29
We've now more than part, you've got that number.
7:34
passing over 100 on deposit, as of this morning.
7:37
So, nevertheless, this is actually an opportunity that we see, as we're on very much agree, forced to continue to grow, and continue to encourage science and chairs.
7:50
But today's focus will be on ticket creation, but I kinda wanted to kind of step aside a little bit and talk about when we were deploying our service, how we manage accounts.
8:00
And part of the reason why I want to discuss is, is that, you know, top management can differ from one institution to the other, or maintaining a data repository.
8:09
And our approach might be different, and this might be a benefit to other people in terms of understanding, particularly if any new organizations are becoming members, or are starting their own fixture for institution membership.
8:21
This might be a benefit.
8:24
And what we wanted to do when we launched last year was, we wanted to ensure that, you know, when we manage researchers account on our instance debt, management would be as minimal as possible.
8:36
And particularly, what we wanted to do was to allow us, to actually allow the ability for users to actually search for data based off of research, because research teams or groups.
8:48
A fixture itself has the ability to organize data in such a manner.
8:52
And so rather than us doing it manually, we wanted to kind of automate that process.
8:57
And in order to do this, what we realized was actually the University directory service that we have of all our members at the university.
9:05
In particular, we take information about, where they stand in terms of their classification, or a researcher, or staff, faculty, students, and graduate students.
9:15
As well as their affiliation, what department they're associated with.
9:18
And that allows us to kind of associate specific research, that are commonly utilized or commonly conducted by a given researcher.
9:26
In very discipline area on the data repository itself.
9:32
Another aspect of our management is actually Orcid integration and that's something that's already available and fixture as a user. That when they login, they can actually connect the orcid and the orcid ID directly. What we wanted to also ensure was to kind of enable that connection directly with our system via, on systems that we have.
9:53
And this is in part to ensure that, you know, the ... metadata are stored with each dataset.
9:59
As users who actually are part of the research system is actually enables the researchers to get credit for the data publication.
10:08
It increases findability.
10:10
And ultimately, this is part more of a larger ecosystem for research at the university where it is a very central focus for us.
10:23
So today's topic again, is Research Data Creation, and so what do we mean by research data curation itself?
10:30
And this can vary from one organization to another.
10:33
This is how we define it, and it's the practice of examining research outputs to ensure that data quantity and management is meeting their guidelines.
10:43
And in particular, when we, when we, when we envisioned this data repository, what we wanted to ensure was that, it wasn't what we call it data dump. It wasn't similar to other data storage services, like Google Drive and Box.
10:56
This is a repository for final, or published, or research outputs, and works.
11:02
we wanted to make sure that we continue to create the data, so that it's actually re-usable by many other organization.
11:09
Or, let me rephrase that, you'll be usable by other users that are interested in exploring research data that we have.
11:16
And that's kind of why we also call it ReDATA.
11:19
It's the idea to ensure that we usability of data that's available on the On the Archive.
11:25
Another reason why we do data creation is to ensure that we comply within, within institutional policies.
11:32
So, these are policies that pertains to the release of sensitive and readily data, which cannot be gillies, It pertains to research that involves human subjects, animal care.
11:44
And, in particular, since we are occupying lands by Native Americans, ensures that any information data are based off of knowledge, information from tribal resources, individuals, people. That we work with the tribal office for consultation before we publish such.
12:06
So just to kinda give you an understanding of how the workflows are for our researchers, as well as us.
12:12
We follow a process of basically a self deposit deposit by followed by data curation. Essentially, the researchers will upload the data and submit it for up to us.
12:22
This is very much similar or many users. if they're using it, for example, the public fixture instance, they would just upload the data itself and submitted for publication.
12:35
So, the process of which it's three steps for these users.
12:39
The first of which is still have to prepare the dataset, much like any form of publication.
12:44
And for our users, what we do is we provide them with a checklist so that, in case they forget something, they can actually just follow the rubrics and actually make sure that the data is ready for publication.
12:55
We also provide other guides and tutorials to our users, particularly examples for how to deposit the data steps, um, more detail steps in terms of that.
13:08
More often, our users will utilize the, the user interface that's available from our platform at arizona fixture dot com.
13:17
Essentially, they'll just drag and drop or search for the files and upload the content through the UI.
13:25
And when they completed the data upload and provide their set of minimal metadata that describes their data, they can submit it for the church or review process.
13:38
After the, after they submitted a member of our organization, our, our team, data creator, we'll contact them to acknowledge that we received your deposit.
13:49
And what we asked users to do is to provide a to fill out two forms, the first of which is that the deposit agreement which outlines terms of the policies.
13:59
It outlines the reason what our obligations are as a data repository as outlines the obligation that we expect of the depositor or the researcher that's that's publishing their data.
14:12
Then the second part is to inform, and I'll talk about this in throughout my talk, later on, I'll start with the deposit agreement first.
14:20
And this kind of ties more into the certain aspects about data correction.
14:25
So, with the deposit agreement, as I mentioned, now, we gather information about the researchers, And particularly what we want to ensure is compliance to university guidelines.
14:36
So, when a researcher and this is a required documentation, we cannot begin the data creation without pods.
14:46
So, for users, they have to answer a number of questions that will ensures compliance to the university guidelines and policies.
14:53
So, whether or not they're making any of the data's sensitive or readily nature, for example.
14:59
And this allows us to understand more in a granular fashion, each individual datasets, um, datasets that contains either of sensitive data or because of our de identified data but tends to human subjects or other aspects, allow us to kind of remember to actually when we do the data creation that we check for any potential de identifiable information.
15:23
It also allow us to kind of work towards communicating with these other organization on campus that manages university compliances. So we work very closely with the human subject for that, we have a university.
15:37
If there's data that involves vertebrate animals, we work directly with the animal care center, and then, as well as tribal data.
15:45
So this is kind of our methods towards insurance compliance, as well as preparing ourselves for more complex matters pertaining to data curation.
16:01
And, of course, we also ensure that the users, the research or the depositors, depositing the data or publication.
16:08
We ensure that they're the creator of itself so that there there is not an issue in terms of copyright infringement or other aspects.
16:16
Any case that's enough about the date on deposit agreement that we have.
16:21
After they complete the deposit agreement and the Ruby form.
16:25
And I'll talk about the readme form later as I talk more about our data creation workflow.
16:30
Um, we can actually begin the data curation process and so that's actually done by one of our member. I met a member, one member of our small team.
16:40
And essentially, we have a set of rubrics that we look at to kind of ensure as much a fair fairness of the data itself.
16:49
And in that process, what we do is we actually are actually actively working with the researchers. We review the dataset.
16:56
And we provide them with a report, and some of that report will indicate changes that we made resemble typographical errors.
17:05
Either of those includes recommendations. And sometimes we may actually require information from them to kind of ensure that the information that's provided it's as ensures that the data's as usable as possible.
17:21
When the data is at a state where it's, it's actually fairly good, both for us and for them and there is no embargo in place for publication, we can go ahead and publish the data.
17:35
So, kind of revisiting by the My Talk Title itself, we mentioned data creation in the cloud.
17:42
And I kinda want to kind of do a little bit of a segue, but kinda mentioned that we're in a different position than other organization, in terms of data preparation.
17:51
And in particular, when we were moving forward, and planning, and building re data, we had originally consider, using traditional, on source **** resources, or data creation part, because the data creators, they usually have to look at the data themselves. And so it's more beneficial for it to be an on prem resources.
18:13
But in early 2020, what we realized was that there, what needed to be a shift in strategy for data curation.
18:20
And in particular, for our services, in terms of quality on cloud computing, a trend required that we transition away from using on prem support that was managed by that library or information technology division, Or is using a third party cloud service that we maintain.
18:41
And just for those that are curious, what platform we're using, we use a Digital ocean.
18:48
I won't go into great details in terms of why we want digital ocean over AWS and Google, But I'll do list them here.
18:54
There's just certain advantages of digital oceans for our needs, it's fairly transparent, in terms of cost is cheaper, And we find that it's really meeting our needs for now.
19:07
And, of course, there are advantages and disadvantages of using an on prem versus a Cloud approach towards maintaining data creation, maintaining your compute infrastructure. And I like these on, both on the left advantages on the left and the disadvantages on the right.
19:25
I'll start with cost, because that's primarily the major factors between on prem and Cloud.
19:31
So, with what we found from our experience over the past year was that, by going into the cloud, one of the advantages is that, you know, or short-term, using cloud use, the cloud infrastructure was more affordable.
19:45
Basically, we understood what we needed, or how services to function. So, we only provision what we need.
19:52
Of course, associated with cause is, you know, the fact that instead of purchasing a single on prem resource, so doing a one-time purchase, how cost will vary and over time, in particular, you know, it's variable on a monthly basis.
20:09
And while we can predict how much we're going to be utilizing over the next year or two, it's very hard for us to predict what the future costs will be in private plus year. So, that's something that we keep in mind, as well, to, as as we move forward.
20:22
And, very much, it's going to depend on how much the data repository grows, how much data creation we need to do in the future.
20:30
Um, nevertheless, that's something more as we get towards the 3 or 4 years timescale will actually think about what the next steps are, in terms of maintaining costs.
20:42
Another disadvantage that we learned was, over a drought experience, is that with working with a third party vendor, that also increase a certain level of bureaucracy at the university.
20:54
In particular, aside from a one-time purchase, we're making monthly variable expenditures for our cloud infrastructure. So that requires a lot more bookkeeping on our end to ensure that we're following guidelines. University.
21:10
On the benefit side, I'm going with our party vendor for infrastructure services, if they continue to maintain the infrastructure upgrades, if there's any deterioration in the upgrade their systems accordingly.
21:24
So that's the advantage, is that there are many, many aspects of infrastructure that we don't necessarily have to continue to maintain.
21:33
Related to that, of course, is, you know, expertise, you know, it's, our team are very savvy with computers, but nevertheless we, I had not had much experience with cloud infrastructure.
21:46
So much of last year has been for our team to continue to learn and continue to learn how to utilize cloud infrastructure.
21:56
But, of course, one of the advantages saint time is that we gain from going to the cloud.
22:00
Or is that our team have been able to embrace what we call development operation practices for DevOps.
22:08
So that, in addition to that, it allows us with greater flexibility overall.
22:13
So, there's certainly that benefit that we don't forget that our team has benefit from.
22:21
Of course, finally, the last thing is, you know, with any cloud infrastructure, you have to keep an eye on the costs, itself, And so that's something that we have to continue.
22:36
So, in terms of what we're doing, in terms of how we're doing data creation, ultimately, what we're doing now is something that is both automated and interactive simultaneously, OK. And that's actually the complexity of that we found with the integration.
22:54
So, kinda to give you some perspective, right?
22:56
What we found from working with our early adopter was that did accretion. It sounds very detail oriented.
23:02
We're looking at contents to make sure that they are findable.
23:06
We're checking metadata.
23:07
We're checking author lists, cetera.
23:11
And so depending on the complexity of the dataset, it can be anywhere from an hour to four hours to, to basically complete a data curation or a single deposits.
23:23
And one thing we also identify is that, as the repository grow, and as more and more users move towards open science and want to publish the data, it becomes difficult for us to scale, Um, in part, because some of the process that ensures that we follow a very good best practice for data curation can also slow us down.
23:44
So, one thing we wanted to emphasize was, or do in our workflow ways, to kind of ensure some level of automation.
23:52
Such that some of the work that are mainly conducted could be simplified.
23:57
Now, keep in mind that with data creation, you know, research data very, very widely by fit to fail.
24:03
So, our goal is not to automate everything. Oh, good goal is to identify what could be automated and should be automated.
24:11
And what areas are too complex for us to automate?
24:15
And it requires more human intervention.
24:22
And, in addition to being able to create any data set, we also have to keep in mind that we have multiple workflows or data publication.
24:33
And what I mean by that is, you've worked specifically with different organization and labs on throughout the university, and these organization do heavy data publishing.
24:43
And the workflows that we developed with may be different from the workflow that we have, or a general University of Arizona researcher that is publishing, say, at one data set.
24:54
Just to kinda give you some perspective, we work with the NASA Space Grant Consortium, ..., Arizona, um, to publish symposium work. That happens on an annual basis.
25:05
So today, we've actually publish a total of nine datasets that are of student research, are made available.
25:12
We also work with the USA National Phenology Network to make metadata record to Mint DOI's for metadata records for data that they have in their data portal.
25:23
one of the, one of the opportunities that the university also organizes is what we call a data visualization challenge.
25:31
And that is actually allow students, undergraduate and graduate students, too, tell stories about data visualization, how they visualize data.
25:40
And so we have this annual competition. Now, we had the second one this year.
25:44
And those data are actually, those data visualizing are made publicly available, invented Hawaii as well, too.
25:50
And finally, the one that we're working on the most on right now is conference, a recent conference done by linguistic by the linguistics department at the university, making available both oral and presentation presentations.
26:04
Or works that have been done by University of Arizona researcher, as well as international researchers that are, that participated in this conference.
26:17
So the software that we develop.
26:21
And keep in mind that the software that we develop, it's actually or our needs, are welcome to utilize it for your needs, is squatty, show that it works for you.
26:30
It's called Library Data Curation Tool in Python.
26:33
It's publicly available. You can see the link at the bottom here. It's on GitHub, itself, on GitHub.
26:38
It's a, it's a pure Python code that we develop.
26:42
That is back end.
26:43
Essentially, our Team execute a number of executable scripts that interacts with the API and allow us to retreat data to mint DOI's to gather metadata.
26:55
And our software interacts with two APIs at the moment.
26:59
The first of which is to fix your API, since our data repository is a picture of our institution instance and the ... API. And that's, in part because, as I mentioned earlier, we have these forms that researchers provide the deposit agreement. The written form shall discuss more later.
27:18
Those, we can utilize aquatic API to gather the metadata for those records and retrieve those information.
27:26
I've kind of talked a bit about open science at the beginning of my talk and how we're trying to ensure open science across campus.
27:33
And very much along the same line, we're also trying to ensure open source, so calling out our own words, We also make our software completely publicly available.
27:42
I'm licensed under an MIT license.
27:46
And our work that we do to maintain this software, it's publicly available.
27:52
We follow it can have flow workflow, and we do everything from issue tracking PR, and project management directly on GitHub itself.
28:06
In terms of what the software does, it does a number of things, the first of which is metadata retrieval.
28:12
Essentially, we're getting information that the user provides, such as titles, like a license, file itself, files that they uploaded. We also do secure data retrieval.
28:25
Um, one thing that we do in our work is we wanted to make sure that when we do data creation, that we follow kind of our own workflow.
28:32
And we organize ourselves following like very much best practices for data management.
28:37
So we actually develop, we put, we built and within our co managing our workflow so that our curators, our team of creators, can continue to ensure that we comply with data management practices in data creation itself.
28:51
one of the advantages of the fixture API is that it actually communicates with the data site API which is our, um, our provider for ... provider.
29:02
So, within the plan, within the software, we can also mint DOI's as we're actually doing the data curation.
29:09
one area that we're working on is, you know, we're focusing on data creation but at the same time, we're also focused on stewardship and preservation long term.
29:18
So, much of where I work now is actually moving towards ensuring that the datasets that we have curated follow practices for data preservation.
29:27
Ultimately, we're going to be moving towards baggett and bagging all of our data creation items.
29:33
So a lot of the software itself is preparing for data preservation as well.
29:39
And finally, I think what's unique about our workflow that may differ from others is, no, it's actually, as we talk about fair and usability of data, we focus on making sure that users can allow users can access the data.
29:59
And one of the best practices among data management principles is to document your data or your code, essentially providing a repeat dot TXT.
30:09
So what we've done to utilize, to benefit our users is to kind of approach this in a different manner.
30:16
I create going towards a templating approach. I'll discuss this more in a second.
30:23
Essentially, what we're doing is, we're treating Readme data as metadata, ultimately, but to kind of give you some perspective of how we got towards this approach.
30:33
one of the challenges that we have identify among researchers.
30:37
You know, a struggle to provide a readme dot TXT.
30:41
Mean, what do you include in the recap?
30:43
The X T would not include, right. It's it's one where, you know, sometimes, when you're in green, in the research, and I'm speaking as a researcher myself, I'm an astrophysicist.
30:52
You tend to forget who can actually how you want other people to utilize your data.
30:58
And we've purchased a couple of ways.
31:00
I mean, initially, what we started was, you know, we, when we were working with the early adopters, provided them with a template, say, Hey, this is just kind of a working template for you to utilize.
31:10
And we provide them with instructions to kind of populate the readme dot TXT so that they can submitted as part of their datasets.
31:18
But what we also realized is that this was actually kind of tedious on the user.
31:22
So first and foremost, you know, users are providing us with a title and metadata information in, and when they're uploading the data, but then they're actually providing that information again.
31:34
And that, of course, increases the probability of more human error, which increases work on data creators to make sure that information that's provided are consistent.
31:46
We also find that sometimes users might add their own template form, and so they might upload different templates, or a different ...
31:52
dot T X T that doesn't conform to How Readme form.
31:56
So unfortunate that also creates more work for both parties, in that, there might be information that might be might be missing, that we recommend incorporated.
32:05
And one thing that we've learned is that, you know, as that's a challenge, as it becomes challenging to people to think about documenting their data and their code and their software, uh, that actually, that, that learning curve or the learning barrier, makes it harder for people to open to adopt open science.
32:22
So that's something that we want to kind of break down that barrier, and to kind of alleys put our researchers at the people they are moving towards data publishing on an easier path towards making it available for other users to use.
32:38
And so our approach is instead of requiring a rebate dot T X T from users, what we require instead is to tell users to provide the contents, to provide.
32:51
What aspects are important about their data. Is what is specific to their dataset that allows people to understand it.
32:57
Like what software are you using, like what like to, to generate these plots?
33:02
What any specific libraries that you need?
33:06
Um, the description of the actual, if there's some, if someone's providing machine readable table, providing descriptions for the columns themselves.
33:15
And what we found is that this is ultimately a win-win for both the users, as well as data creator, in part, because it just allows the users to not think about the document itself, But just think about, how do you want users, the content, that allows users to realize their data?
33:33
And the other advantage that we have is that by treating it as metadata, it actually allows us to automate Construction of this.
33:41
Readme dot TXT reduces human ever entirely for both parties.
33:50
So this is just an example of a template that we have. It's actually our main template.
33:56
And it's essentially, if you're not familiar, we're using something called, jinja is a template engine that's often utilized for web content assets, for example.
34:07
And this template actually has languages to utilize metadata, so we can get metadata directly from picture here.
34:17
And we can also get metadata. From utter information, say in this case, the Qualtrics API.
34:23
What I'm showing you is only just the top of the template itself.
34:27
There are additional content below that the users will provide more information.
34:34
And what it allows us to do is transform something and construct a readme dot TXT like this. This is just an example of this, happened to be a deposit I published last year.
34:45
And as you can see, all the information, all the metadata about, the deposits already automatically populate into this, which allows our team to inspect it and make sure everything's good.
34:55
And if everything's good, we can go ahead and upload this deposit before we punch.
35:02
Now, in terms of the form itself, there's a readme form.
35:07
Much of what we asked for is actually entirely optional or optional, but highly recommended.
35:12
Essentially, what we can do is if the users fill out the form and do not wish to provide additional information, we can construct a barebones, a basic readme dot T X T that it's based off of the metadata that they've already provided when they upload status.
35:27
And that's often the case.
35:28
If there's a dataset that's heavily simplify, like a single CSV file or figure, you don't necessarily need to go in much great detail about the data.
35:36
But we often, in our review process, if we count our data sets out more complicated, like, there are multiple CSV file and code.
35:44
We do encourage users to provide more descriptions so that other people can, you can understand the data and utilize it.
35:52
So some of the information that we asked for is kinda descriptions or contents themselves, so the files, and any folders that they organize.
36:01
We also asked if they utilize any specific materials and methods, So in this case, there might be specific languages or programming language that you're utilizing.
36:10
We also try to encourage our users to kind of identify contributor roles.
36:16
Essentially, when there are multi, the data that's been published, that has multiple author and multiple contributors, kind of identified the roles that each individual have made towards making the data available.
36:28
So, in this circumstance, we follow basically, the credit taxonomy, and, in addition to that, we ask for any additional information that they want to provide about their data that can extend upon the metadata itself.
36:43
And, you know, I talked a bit about automation, like, you know, this is all metadata itself, so as changes are made changes to the title, changes to the author, Information changes, the descriptions, we can actually grab those metadata and, and basically construct a revision to the readme itself, and upload a new version.
37:06
So, just to kind of give you some perspective, right, we actually run this on our emphasis, and we have this command line and so this actually allows us to retrieve the data.
37:15
In this case, it's actually a dataset that was published recently, actually some geospatial data, and it's going throughout workflow is creating the organizational structure.
37:23
It's creating is retrieving metadata information.
37:25
It's currently downloading data that are submitted for review.
37:30
It creates it downloads a curation report that's actually a report that we use to kind of fill in.
37:37
Like our information, it's also retrieving information from our deposit agreement, for example.
37:43
And as you can see, it's very interactive, It allows users to either decide what the next steps are.
37:48
For example, as you can see, we can get the metadata itself from the project form, And because we're a software, it's generalized enough that we can work with different templates, OK? And I'm actually going to show you what the contents of the output of what, what it looks like, This is actually our organizational structure. We have multiple folders for data and metadata, and, and for our own on data curation workflow.
38:16
So I've talked a bit about somewhat software that we're developing, And, in particular, LD, cookie or library data curation tool in Python is a mono repo. It's primarily for the workload that we have for rate data.
38:30
However, we, we understand that you know, data creation is a fairly complex topic, but nevertheless some aspects of it can be automated.
38:37
And now the automation that we've done by utilizing the fixture API can be very beneficial to other organization, other institutional repository as well, too.
38:46
In particular, what we're trying to do now is move towards creating it in such a manner that it's very modular and that other people can utilize the software that we've developed.
38:57
And now I'm happy to announce for our team that it's that fixture API component that we've been developing in the past year.
39:03
LG Cookie picture is now publicly available.
39:07
When you're still in beta, we're still working on Improving App, but nevertheless, this is available.
39:12
You can pip install it, but single command, if you're interested in the repositories that we have it publicly available, Once again, it's under an MIT license. Any software that we develop, it's under MIT.
39:26
It uses, we only require they utilize a modern Python. So anything from the point 73.9.
39:33
Kinda give you some understanding of how the software is. I'm going to discuss a bit more.
39:37
It's object oriented, and they use an open source libraries, such as requests and pandas.
39:42
Just kinda give you an understanding, like we utilize panna as a relational database.
39:46
So any interactions that you do with our instance will give you, um, a data, a dataset that allows you to interact with it.
39:55
So whether you're looking at the list of curation items, list of researchers, the metadata itself, that's all available in a relational database.
40:06
The code actually includes type annotation or hints, so for those that are developers and want to contribute, that's actually built in with your IDE. So it kind of makes it a little bit easier.
40:17
And we also make available some documentation.
40:20
We're still working on our documentation, and it doesn't could have examples, but we have our space documentation publicly available on Rita Talks at ... picture that read.
40:31
And of course, we are working on improving the software.
40:35
So one of the aspect that we'll need to increase this unit testing or our for IOT Core fixture.
40:44
If you have the software install, it's actually fairly straightforward to use. We try to simplify that as much as possible.
40:51
Depending on your preferred, um, Python interpreter use, whether it's a Python on Python, you have the software installed. You should be able to do a simple ... software.
41:03
So from Audi, E underscore fixture, import, fix your Institute admin, this is the class that we utilize.
41:12
You'll need to have an API token on the tissue instance that you have.
41:19
And keep in mind, if you do create a token, we recommend that you create a token on an account to have both administrative and creation access.
41:27
This allows you to access information about its users as well as conduct the data curation that's necessary.
41:34
You can set a flag cost stage. This, actually, so that's one of the advantages of the software, is that we make it work with both Production and Stage instances.
41:42
So you want it to explore other aspects. You can set state equals true.
41:46
Make sure the token it's actually for a stage instance.
41:50
Here, I'm defining an object, as you can see here.
41:52
So I basically am finding Fs admin, providing both the Token and the stage.
41:58
And with that, you actually have a complete object that allows you to interact with the Picture API for Data curation.
42:06
So if you wanted to, say, get a list of users that are logged into your service, for example, you can use the dot Get Account List method, and that will provide you with that data.
42:17
If you're in a Pandas data frame of all the account information, this will include not only your username, but their e-mail address, any organization or any group that they're associated with.
42:27
For example, you can retrieve a list of curation items as well to using the get creation list. So that will also provide provide you with a data frame.
42:39
It will provide information like who is a depositor, the author lives, et cetera.
42:47
In addition, we have other aspect. And what I'm illustrating is only just an example is not to be inclusive of everything.
42:55
But if you, if you're working with a specific deposit, you can check to see if a DOI has invented, using the check.
43:04
And, in addition, you can actually reserve at the AI where it will check to see if the DOI is not meant to inhibit, isn't. It will prompt you whether or not you want.
43:14
And then what we utilize a lot for our data creation, is you want to download the data, and so that information is stored within kirsch Creation API for each individual.
43:24
Deposit, they're curation.
43:26
So you can actually get information. So this will respond.
43:29
This will provide you with a Python object that dictionary, contains all the metadata associated for data curation, including what that is actually the paths to specific files for retrieval.
43:44
So I've talked a bit a bit about our data creation workflow, and in addition, I talked to them on the technical side of how we do it, how we do, how do we automate some of the aspects of data curation.
43:55
So as always, you know, data creation continues to evolve, and as such, our software continues to evolve.
44:02
And so we certainly are in a path towards continue to make progress.
44:06
There's certainly, as I mentioned, ... is a mono repo, so we're trying to reduce the technical debt that LD cookie currently introduces.
44:13
So much of our team will be working towards refactoring the code, making it so that.
44:21
Other people can utilize it as well to make it as modular as possible.
44:26
A lot of our quality assurance are done manually or done with some unit tests, but it's not complete.
44:31
So we're hoping to extend upon our unit tests, a unit test, and Jen essentially allow for end to end data creation to ensure that the curation workflows work as intended.
44:43
As I've kind of explained LD, cookie is primarily a backend software that we use and, of course, our team currently are very, the matter with computers, and Unix, and other platforms, or are on the command line side.
44:58
But as our service grow, we need to increase number of data creators that can help create.
45:05
And so what we're hoping to move towards, perhaps maybe later this year, is to kind of build our own user interface to allow for our members, to actually actively.
45:16
Data curation is a lot more complicated itself.
45:19
Um, documentation can always be improved.
45:23
Um, we're also looking at voting are only Mitchell. I've talked a bit about projects and how users have been utilizing that to provide the readme.
45:30
But, of course, with projects, it has its limitation, has its benefits and limitations.
45:35
We're looking at building our own user interface so that users can provide all the readme data, all their metadata, for as many deposits that they're submitting as.
45:45
And, of course, you know, as, as we continue to grow and scale, um, biggest pain point, well, it's storage, and, know, to kind of keep the costs down and kind of allow for it, too, scale as needed, we need to move toward ... storage instead.
46:02
So, block storage is definitely still on the for now, but I think in a year or two timescale, we'll probably have to transition over.
46:10
I think that's it. I have 2%. I will go ahead and I'm sure there's many questions.
46:16
Go ahead and answer any questions that are available.
46:20
You, so much, Jen. I don't think I've seen any come through yet, but I'll give people a chance to ask them in case they wanted to wait until the end. I might just ask one while I give people a chance to type out.
46:39
I well, first of all, thank you so much.
46:42
I thought it was very informative and really, really interesting to see what you're doing there at the University of Arizona.
46:48
And I think anything that lowers the barrier to engagement is an excellent endeavor.
46:57
In my opinion, have you seen a relationship between lowering the barrier and an increase in engagement? So, you've had quite a few upload. It says, of announcing, just over 100. And would you say the service has, has had an impact on that?
47:17
I would say, so, I think with when we've worked with individuals' users, we've actually asked for feedback, Right.
47:24
And the feedback we received has been positive on the data curation.
47:27
So people have found that, Like, oh, I didn't think about how to present my data for other people to use. Right? And so they've kind of seen the value. And so I think in some ways, it's a teaching moment as well, too, at state.
47:40
Think about Open science in some respects.
47:44
So, so there's that advantage of that.
47:46
Um, you know, we are trying to, I mean, I'll go to, to try to get as many data publish as much as possible.
47:53
So, I mean, it's not just only the data creation aspect, it's also the engagement exits that we do on social media where we make, we actually announced data of researchers, and that's kind of increase the research in packets while, too.
48:08
That's not necessarily thought of as a tangible component of making your data publicly available.
48:17
Great. Thank you.
48:19
I've had a couple of questions come, and then the first one is, how big is your team that works on this?
48:27
Very good question.
48:28
Our current team is a total of three people, I think, at the moment, we're at four, or so.
48:34
Actually, yeah, for, so.
48:36
So we're a small team, but we're very energetic team, and we've, I mean, I love what's impossible to live very, very explicit and verbose verbal and communication.
48:47
So we ensure that everyone's on the same page, you know, when when questions arise, we try to respond to them as quickly as possible. We get questions from the researchers as well, too.
49:02
Thank you. Yes, and Fernando as he pointed out, half of our team are students.
49:08
So we do recruit graduate assistants to kind of work with us, particularly on some aspects of data curation, and some aspects of software development as well too.
49:19
So these are kind of a lot of great opportunities for students that are particularly interested in informational science, an open science, to kind of get involved in that aspect, that's valuable to many students.
49:33
Excellent, Great, Thank you. And do curator's do any under the hood analysis to determine compliance with fair principles?
49:44
We, I mean, we do, I mean, I think there's like, I can kind of go to a backup slide in terms of, know, when we do the data curation right, There are certainly things that we look for, and there are certain things that we feel like it's, it's not something that we should go down that route, right?
50:01
And we might recommend, say, Hey, you know, you might, you might consider no, Providing, more, more, more, and more documentation in your code. Like, maybe have a couple of hashes here, but we don't necessarily require that.
50:17
I can go to my backup slide.
50:21
And, so, in terms of what we love, we prioritize, right? Everything, pretty much everything in level one.
50:28
And most thing in level two, this is adapted from lafferty, and has worked in terms of their work in terms of what other organizations have done genetic variation. So, anything that's highlighted in blue is currently doing.
50:41
And so, these are things that we check for.
50:44
These are things that we are conducting, and things are some of the things that we either recommend or require from users themselves, and in terms of how we, how we prioritize. So, this is our prioritization.
50:57
And, you know, what we do is, you know, kinda similar to how fixture have metadata about the causes. We kinda does something very similar.
51:06
And it turns out how it follows of work.
51:09
So we look at, we basically prioritize different aspects of what we want to do.
51:15
So we first, in terms of any aspect of data question, we kind of put them in different categories, like is this data management is as fair as this metadata, copy, editing, et cetera, right, So, and any item could fall under multiple categories and then the areas refer to, where is the data curation happening? Is it happening in the fixture instance?
51:34
Is it happening in the documentation?
51:37
For example, and then there's different levels. We say there are levels that we conduct.
51:41
There's a level that we might require we might recommend, and then there are certain aspects that we will identify as out of scope.
51:49
And then there is an action that we reject, like, if someone basically submitted data that is sensitive, for example, we would have to immediately reject the deposit itself, and actually scrap the server. And other aspect.
52:00
Because it contains sensitive information, and we prioritize the levels, you know, from everything from very high to very low.
52:08
Hope I'm answering your question.
52:09
So, this is kind of like our, quote, unquote, cheat sheet, where, essentially, we basically have a list of action items that we need to do for data creation.
52:18
And we basically prioritize. It is something that requires, it's something that we conduct, etcetera.
52:24
Have everything color coded as you see, and then we have example responses, so that it kinda makes our data creators like a little easier. In terms of what we've done in the past. A lot of this was informed with data collection that we've done with early adopters to now, and we continue to evolve this.
52:39
Did that answer your question?
52:43
See what they say, but I thought it was very informative. Thank you.
52:51
I don't actually see the question, so yeah, I'm not sure.
52:57
I think, but I think maybe just the host. I don't know, but I'll ask them, and Yes, that answer to that question? So thank you.
53:08
There's another question about what you do with non-standard metadata?
53:15
So that's a very good question. I'm wondering is for Fernando your I can answer, But Fernando may also have some other insights as well too.
53:23
But I mean, we're pre data, we we treat it as a general repository.
53:31
So we're not necessarily looking at developing more custom metadata for specific datasets itself. For example, I mean, within fixture instance, we do have some capability for that through custom fields, but we're not utilizing that at the moment.
53:46
General, we want to kind of keep it as simplified as possible.
53:50
But I mean, that's an opportunity that might change over time, as maybe demand grows in certain areas.
53:57
The geospatial or it could be other forms of data that are that requires more custom metadata to describe them.
54:06
And keep in mind that even though, it's a data repository that we have, that though that we maintain, we also recommend other data repositories. So if someone is doing genetic research, we might recommend that a more appropriate repository is a disciplined repository like NCBI, for example.
54:28
Thank you very much.
54:30
So I just wanted to clarify, ****, They see, you're basically use the APIs, double-check the South uploads, before they're published, so nothing goes live until you choose to make it, right, and we communicate, I mean, we communicate what GDPR compliance will communicate with, You know, Institutional Review Board will communicate with the researchers.
54:51
Sometimes researchers are working too to make available a dataset, and, at the same time, the publishing the paper and there might be an embargo in place, for example.
55:01
So we might hold off as well, too, and set an embargo for a specific date, or wait, until the, The Research says you can go ahead and publish the dataset.
55:12
Thank you. one more question. Are there any restrictions on file types that you accept for deposit?
55:22
So essentially, I mean, we are a data repository. So anything that identifies as a research data, What are research, non traditional research output, will be included?
55:31
So a document, if someone was to submit like a manuscript that we would not accept, they were to submit data, and they also wanted to provide a copy of the manuscript. We would accept that as long as part because of data.
55:43
The paper might actually provide additional descriptions about the data, OK.
55:49
In terms of formats, not necessarily, We do have recommendation.
55:53
We recommend machine readable tebow, so we might recommend a CSV or taps TSG over Excel, but we don't necessarily say that Excel is not allowable, in that case, example.
56:06
It's really dependent on the field, itself, there. Some field might use HD five.
56:12
Other fields might have binary files, other forms of binary files, semi just have images that they're providing, for example, so as long as it's described, what the data is, and how to utilize it, right?
56:28
If there are specific tools that are needed to look at the data, for example, we're fine with that, perfectly open source tools, but not necessarily always the case.
56:39
Makes sense.
56:41
Great. I don't think there are any other questions, just leaves me to thank you for coming and presenting for you, and thank you for everyone for coming in, and asking questions. It's been really informative.
56:58
Andrew?
57:00
Yeah, that's it. Thank you so much, and have a good rest of your day, everyone.
57:05
Thank you. Thank you for this opportunity. We enjoyed it.
57:09
OK, Thank you.