Play the webinar

(registration may be required)

An introduction to sharing big data on Figshare+

October 13, 2022

Ana Van Gulick

Figshare+ is Figshare’s repository platform for publishing big datasets, FAIR-ly. In this webinar, Figshare’s Head of Data Review, Ana Van Gulick, will present how researchers can use Figshare+ to publish larger datasets (over 20GB up to many TBs) to meet funder and publisher requirements for sharing data openly. The webinar will include an overview of Figshare+ including how to get started and the deposit process including checks by Figshare’s data review team as well as guidance on best practices for organizing your dataset and adding documentation and metadata to improve discoverability and reusability.

‍

Transcript

Please note that the transcript was generated with software and may not be entirely correct.

0:14 Hello everyone that are already online. Just gonna give it another minuyr or so before we get going with the webinar.

0:40 So, and if you can't hear me, At this point, please do say in the chat or the Q&A, and we'll try iron any technical difficulties out.

OK, so we're just about the hour, so hello, everyone, and welcome to the webinar today, which is an introduction to Sharing Big Data on Figshare+. I'm just going to start by sharing some small housekeepings bit. So all attendees are in listen only mode, but if you would like to ask a question or ask clarification, at any point, please use the Q&A function, or the chat, I'll be monitoring both. And if we don't get time to answer all the questions in, the eventuality we get loads. Then we'll be sure to follow up with you afterwards because we'll have visibility over who asked which questions. We're recording today's session, and we're going to share it with all registrants in the next couple of days. So if you have to drop off at any point, not to worry, we'll share the full video with everyone that's registered. So without further ado, I'll pass over to honor to kick things off. Thank you.

1:36 Hello everyone. Thank you for joining us. Today. I am Figshare, Government and Funder Lead and Head of Data Review, and I manage the Figshare plus repository with my colleague, Dan Valen. So I'm excited to tell you about this new Figshare+ repository today, launched just about a year ago for specifically for hosting larger datasets.

2:02 So let's jump right into it.

2:06 Yes.

2:08 Yeah, so here's Figshare plus the repository. It's built on our Figshare infrastructure for standard compliant data repositories. So that everything is discoverable, searchable, re-usable, and tracked. And so we're excited to tell you more about why we did that and how you can use it.

2:32 I want to back up for just a moment, if you're new to Figshare.

2:37 So Figshare is a trusted, Cloud based repository for storing, sharing and discovering research outputs.

2:45 And our flagship repository is figshare.com, which is our free generalist repository and this is available to researchers free of charge around the world to share any scholarly output.

3:02 And FIgshare just turned 10, which celebrating our 10th birthday this year.

3:07 And over that time, we've hosted more than four million research outputs, have more than half a million users, post hundreds of terabytes, of data and other research materials, which have been cited more than 100,000 times. And then we also support research repositories for more than 80 research organizations around the world, for academic institutions.

3:32 Funders, government agencies, publishers, and corporate entities.

3:40 So, figshare plus is built on that same infrastructure. And it is basically a portal, a customized Figshare repository and it launched one year ago tomorrow. So figshare.com is 10 and figshare plus is one. And we designed this repository to be a little more accustomed to the needs of publishing big datasets and also to help ensure that that data was fair or find level accessible, interoperable, and re-usable.

4:17 So why did we launch Figshare plus?

4:20 Well, really, it's a story of our support tickets on figshare.com.

4:27 And so we would often get requests from researchers who had larger datasets. So the limits, on figshare.com is 20 gigabytes. And that's free of charge. You know, just go create an account.

4:42 Upload your data, self publish it. And you can publish datasets and single files up to 20 gigabytes which, you know, works for many, many use cases. Lots of data and other materials fit into that.

4:57 However, we're seeing an increasing number of datasets that are larger than that, with you know computing, Increasing technology for how researchers gathered data science, expanding the growth of larger datasets over 20 gigabytes, um, is really happening. So we would get requests from researchers who said I have a 50 gigabyte and 100 gigabyte.

5:23 I have a terabyte I have a five terabyte dataset. Can you help me? And we would try to grant those requests as often as we could.

5:34 But, you know, there is a real cost to hosting large datasets in the cloud and if you've shopped for Cloud Storage for your own research group or your institution, you'll be aware of that.

5:48 And the one thing that we do with figshare is make sure that not only is well, that the data can be shared, but also that it can be openly accessed, and then it can be accessed free of charge. So, not only we're storing the datasets, but we're also paying for those access fees for the download or access costs.

6:11 And so, we needed to make a decision about how we could support these larger datasets in a way that was sustainable for us and to make sure that the data sharing and data repository ecosystem was a sustainable part of the research ecosystem.

6:27 And so, that is part of the reason we will launch this dedicated platform, so that we can have a sustainable model, or covering some of the costs of hosting these larger datasets that included transparent pricing, that could be built into funding plans, grants.

6:48 Many funders and publishers are now requiring researchers to share data.

6:53 So, we wanted to help support that mandate, but then many funders also allow data sharing costs to be billed to a grant.

7:02 So, with these transparent costs, people can build figshare plus, into data management, and sharing plans, and plan for those costs in advance, and, hopefully, build them, you know, build them to a funder.

7:16 And this is all part of, kind of this shifting research data ecosystem.

7:22 As data sharing becomes more common, and big data becomes more common, and, you know, projecting a sustainable way to do that.

7:32 So, let's say we want to support this large increase in larger datasets, sharing those datasets in a flexible generalist repository. So, there's many discipline specific repositories that are great for specific methods of data, specific disciplines, and research communities. But we've also found that there are datasets that exist that don't have a home, that don't have a discipline specific repository, that is a good fit.

8:01 Or it may be that that repository won't accept the datasets because they're too large, or because of the format.

8:07 Or maybe there's a second version of that dataset that once wouldn't be accepted. And so that's, where figshare.com offers that flexibility and so does figshare plus larger datasets. Similarly, you can share a variety of file types or different types of outputs.

8:23 That might not be accepted in a repository like that.

8:29 And then, the final bullets here is that we wanted to provide some expert guidance.

8:34 And this came from some lessons we learned, through work we did with NIH, and that we've done with a couple of other research organizations.

8:46 And, and, people who use our figshare, your infrastructure to review datasets is that we found there's a big impact, to having some human guidance to depositing datasets. So, figshare.com, is great because it's self-service, it's very quick, easy, low barrier to data sharing.

9:03 We wanted to make sure that if people were going through the process of sharing large datasets, that we could provide some support in doing that and then also help review those datasets to check them and make sure that there was complete and high quality metadata that describe them that links them. That provided context and citeability and reusability to the work and made it discoverable and re-usable.

9:31 So this is the repository here, plus.figshare.com. We have had a number of datasets published over the past year, 32 different depositors, and we're actually hosting about five terabytes of data. Not all of it published yet, but in progress.

9:53 So it's been great to see that we've gotten some users who've found the need for this repository since then.

10:00 I also want to point you to.

10:06 This site, which is on our Knowledge portal. So this is that knowledge.figshare.com/plus. And this is a description of The Plus Repository. Its features. It's where you can find those transparent pricing. For the data storage that you need. And it's also where you can get started with purchasing figshare plus. So this you can find more information there that I'll be going over today.

10:33 So what are the features of Figshare plus?

10:37 So as I said, it's intended for all of those datasets that need a flexible home that are too large. To be stored on figshare.com or published figshare.com.

10:49 So this would be any dataset that's over 20 gigabytes and up to many terabytes. So our upper limit is actually files of five terabytes, and this is imposed by our Cloud Storage through AWS.

11:06 We also allow up to 5000 files per dataset, which is larger than the figshare.com 500 limit. At the same time, best practice might be to re-organize your files into some folder hierarchy, rather than having 5000 individual files, But that's something that our review team can help guide you on when you get to the deposits. And should you really need to publish 10000 files in a dataset, we could lift that limit for you as well by request.

11:39 Figshare plus supports uploading of these large files via the browser, but as will also via the Figshare API which I'll point you to so that's great for larger datasets. That may be more time intensive to upload and allows you another way to do that.

11:58 Comparative picture and icon, which all with which offers only a CC zero or a C C B Y, license, which requires attribution.

12:08 Figshare plus also offers a few additional Creative Commons licenses. So you can restrict things to non commercial non derivative share like and those combinations of Creative Commons licenses if you need to.

12:23 We always encourage you to use the most open license that you can, especially for datasets CC zero may be a good choice to make sure that the data is broadly re-usable.

12:37 However, we do understand there's use cases for datasets are how consent was written that may necessitate certain types of licenses. And so again for these large datasets, we recognize the use cases, maybe a little more nuanced. So we've included more of the license options in this repository.

12:57 All the data is stored securely and persistently in the cloud. It's backed up additionally outside of figshare so you can ensure that it will be persistently available over time. And everything was published is made fully open access.

13:13 So you have the option to apply an embargo to the files if you need to restrict them for a short period of time or even permanently, but we want to try to make the data as open as possible, however, recognizing that sometimes restrictions are necessary.

13:32 one thing to note with figshare plus is that it is its own separate repository.

13:39 It's not, the data is not siloed there in terms of discoverability data.

13:43 Published data in figshare plus is discoverable across figshare, including on figshare.com and across search engines like Google Google dataset search for datasets, dimensions, data site Commons, and other indexes of open data.

14:02 And data cite brings us to the fact that each item published on figshare Plus has a unique, suitable, and trackable data site DOI.

14:11 That's a digital object identifier, something you may be familiar with, from publications, And this is a unique, persistent identifier, that link that will resolve to the datasets over time, And so we send all of the important metadata about the dataset to data site.

14:32 And it is tied together with that DOI, including your ORCID ID, so you can link your author, ORCID.

14:39 If you don't have an ORCID yet, I do encourage you to get one. and this is your unique identifier as a researcher.

14:48 We can also associate other related works with the dataset, So make sure that associated publications, or datasets or links.

14:56 And importantly, for the new funder mandates, make sure that your funding is linked to it so that funders will know when you have shared data that they supported, and will be able to see all the results of their funding.

15:11 Lastly, and similar to figshare.com, every item on figshare plus has openly tracked metrics, including view's downloads, citations, which are pulled from the full text of the scholarly literature, allowing you to see how your data is re-used. And also an alt metrics, or which shows attention from other sources outside of scholarly literature, such as social media, news, media, and things like that, that may kind of occur first before our traditional scholarly citation.

15:46 So, I mentioned that there is a one-time cost for figshare plus.

15:50 So this is a one-time data publishing charge, and this is tiered based on the amount of storage that's needed for the dataset. So that's one of the things you'll want to consider first is how large is your dataset and include that when you submit your purchase request.

16:07 This data publishing charge is just a one-time fee, and it also includes our deposit support over e-mail and that dataset review and revisiting.

16:18 So, our lowest tier begins at 100 gigabytes, and then the tiered in tiers of 250 gigabytes up to that. And these include the review fee for each deposit. If you have more than five terabytes of data, as long as a single file, is more than five terabytes of data, we're happy to help.

16:42 But, please, do, just get in touch with us, or, we've also provided the, how you can do the math yourself, to figure out the costs. But, on our order form, if it's more than five terabyte, it's just put comments about how much storage you'll need.

16:59 So, what can be shared?

17:00 And figshare plus is a repository that's really focused on data and those other materials that are associated with reproducing a research results. So, it's not a repository for, say, posters, presentations, maybe publications, or pre prints, but focused on data and those other research outputs that go along with it. So this could be data, code or software.

17:33 Image, video, other multimedia files.

17:36 Workflows are documentation for how the data was collected or analyzed.

17:42 Figures perhaps showing visualizations of the data or how the data was collected. Something to note about the methodology, perhaps. How something was designed to capture the data.

17:55 Any of those things.

17:57 So it could be that this is the entire dataset or maybe part of the dataset is in a discipline specific repository and these are other parts of it or other versions that go along with it as well. That's quite flexible.

18:11 but we would ask that kind of you group deposit in figshare plus to correspond with either a specific publication or a specific research project. It does not need to be published works yet. Or even in the future.

18:26 We're happy to accept datasets.

18:28 supporting results are things that may not be published, but they should be, the deposit should be clustered that it's all data, supporting a single research result or a single research project.

18:41 You can upload any file type and then organize these files in a couple of different ways, so if you're familiar with figshare, you might be familiar with our concept of items.

18:52 So an item is sort of that page view of a published dataset or other research output and it has the file. And then it has the metadata describing it, the title, author's description.

19:06 And it has a unique DOI pointing to it, and an item, could have a single file, or it could have many files. And in the case, of Figshare plus, probably many files, if it's a large datasets. And these many files could be single files, or they could also be zipped or archive files, which might be quite important for preserving the file hierarchy that you have in a large and complex datasets.

19:32 So, we're going to be introducing new ways to organize files within a hierarchy. It's on our Roadmap.

19:41 It's on our public roadmap for figshare in the next year or two.

19:45 However, at the moment, the best way to capture that file or hierarchy is to zip portions of the data.

19:52 Not creating files, zip files that are really, really large and unwieldy for someone else to work with when they download, but clustering them by subject, or experiment, or method.

20:05 Or, time, phase, or type of data, something that makes sense, given the structure of your data, and then uploading them. However, you might also have multiple items.

20:15 And this is something that you can decide, again, based on the structure of your data and I always tell people to kinda design the items in the way that you think they would be cited.

20:26 So probably someone's not going to cite individual subject data, right?

20:33 If you have data from, you know, 60 mice, they're not gonna cite the dataset collected from a mouse one and the datasets collected from mouse to separately, they're gonna cite the entire dataset.

20:48 However, you might have multiple experiments that might be cited separately, or perhaps you have different types of outputs. Maybe you have a dataset, and then you have the analysis code that went with it or the videos that accompanied it or something like that. And so you might want to foster those into different items because they may need to be described or licensed or cited separately.

21:11 And if you do that, we have multiple items, you can have up to 10 items per figshare plus deposit.

21:18 This is just because we're reviewing each one separately and you know, you want to keep them within that.

21:26 Want to keep them together is not too many different research outputs but even then cluster them into a collection so the collection is another figshare feature common across figshare.com and all of our repositories and this allows you to create a public grouping of other published items so you can do this on your own and figshare plus and then put the published items in there. And importantly, a collection also has a unique title description and DOI. So you can use the collection DOI then to point to all items include that DOI in your manuscript or publication to point to all of the data with one link.

22:07 So, I mentioned that figshare plus datasets are reviewed. And so this shows that workflow a dataset is submitted, comes to our data review team, at figshare. We'll take a look at it, get back to you within a few business days, and send you some required or suggested revisions that needs to be made. You can easily make those right inside the repository, just updating the datasets that you've submitted, and then we'll approve them to be published.

22:37 This is really focused on trying to improve the metadata, to enhance the discoverability, and reusability of the work to make sure that there's context to it and that it's well linked.

22:49 two people, organizations, funding, other outputs. So we do conduct a spot check on a sample of the files. We just want to check there's some documentation, and they match the description.

23:01 They can be opened, but with really large datasets, we're not doing an exhaustive curation. So, curation means many different things to many different people, and one could curate datasets for days, and could do computational reproducibility of them, and things like that.

23:21 These checks are focused quite a bit more on the metadata and the discoverability and context of the dataset.

23:30 So, what we'll be looking for is a descriptive title that provides context to the work, and then, importantly, is not identical to the associated publication.

23:39 That's an important one, just to separate these outputs. in your scholarly record, That's a good practice. They will have their own unique dois and citations.

23:51 However, if you can include something that's different, this dataset supporting this work, that does help differentiate them as to different scholarly output that you want to get credit for.

24:04 Individually, the dataset should be viewed as its own important scholarly output.

24:12 We'll make sure that the item type that you've selected as well as maybe advise on that file organization.

24:18 I just went over check that you affirms, that. You're not sharing personally identifiable or sensitive data. Importantly, that should not be shared in figshare plus.

24:30 You can share de identified data as heard your content language and IRB or our protocols, and any other requirements from your research institution. But we would ask that it be de identified and not sensitive in nature.

24:48 We check that the metadata description provides good context to the work. If someone came across this dataset on its own independent of the paper, would they know what the research question was?

25:02 What the research methodology was, what the structure of the dataset is, what's contained there, where can they find more information, and how can they use it?

25:13 Checking in a permanent license has been applied.

25:15 This is really just guidance for, usually, we're looking to see if it's a CC license or a software specific license, because you may want to select one of the software specific licenses as well, if that's what you're sharing.

25:32 Check that the funding information is specified and linked. I'll show you how you can do that. You can create a direct link out to the funding information in dimensions, or the database of our sister company.

25:45 Dimensions.

25:46 It's a link right to a specific grant, and also check that supporting papers or pre prints are linked.

25:52 So, this is to build that persistent identifier map within the scholarly record. You want to know that this dataset supports this publication, and this publication describes this dataset, and make sure those are linked.

26:09 So, what's the process like for figshare plus, and how do you get started?

26:14 This is pretty simple.

26:16 Just go to that website on our Knowledge Portal that I point of view, too, and there is an order form.

26:22 So you can get started by filling out this form, Telling us a little bit about the dataset to be shared.

26:29 I, it helps if you can tell us just a sentence or two about what the methodology is, or the research question, what type of data, if there's a publication associated with it or a manuscript title.

26:43 Just helps us make sure that it's a good fit for the repository, and then tell us importantly, how much storage you'll need.

26:51 This will certainly impact the pricing. And also make sure that we can set up an account for you and get that configured properly.

27:01 Give us your contact information, both to create a user account, as well as to get in touch with you, and share your billing information, or an invoice.

27:11 So once we get that, we'll approve your data publishing request, assuming it's appropriate and most of them that we get.

27:19 Then we'll set up a new user account for you as Figshare Plus. So if you're already a figshare.com user, you will need to create a separate account or will create an account and login with fixed sharpless. Those be a separate user account. However you can still add the same ORCID ID to it, you can use the same e-mail address. And that's not a problem. And in the future it's possible you might be able to merge your offer profiles as well.

27:44 Something we're looking to do down the road. But we'll create a new user account for you on this repository and we'll create a project and we'll allocate the storage project.

27:55 So this is perhaps a little bit different from figshare.com where you might have created an item in your my data.

28:03 Here you're going to use a group project because the storage for your data publishing charge will be allocated to your user account. You'll actually have zero storage in your user account. But that storage will be allocated to the project. So you'll go to Projects and create an item in there, and I'll show you a screenshot of what that looks like.

28:24 So you'll get an e-mail then with login and upload instructions, couldn't go log into your account, reset your password, set up a profile, and sync your ORCID ID with your account. So we can make sure that your ORCID associated with your author profile there and will be included in any metadata.

28:44 Will send an invoice for the data publishing charge to the billing address that you in with the billing address on it that you've indicated. This invoice will be sent via e-mail, though. So, if it does need to be directed to someone else, let us know.

28:57 And this is payable by credit card, Although, if your institution has other requirements, do let us know when we get in touch, and we'll try to work with them.

29:07 And then, you can create items in the project, begin to upload files to those items, and add metadata, and you have 12 months to submit the dataset.

29:18 This is just to keep everything moving along. They're starting to publish it.

29:25 However, we recognize that academic scholarly publishing can be a lengthy process.

29:32 And so if you're waiting to publish your dataset for papers to be published, and you need longer than that time, just get in touch with us. We're happy to extend it.

29:42 We just want to make sure that we're following through, and getting your datasets published in a timely fashion, not that it would ever take us 12 months, but we give you some people publishing in two days, and some people publish in 18 months, because they're waiting for a paper to go through a review. And, there's just a wide variety there, and we're happy to support it, either way, You can also publish the data set with an embargo on the files.

30:05 So that, the DOI would be live, which is nice, that, copy editors at the journal can see, that the dataset is there, and you can, include a private link to it, to reviewers if you'd like them to be able to see the files.

30:20 But it gives you, flexibility to get through that review process with us, and then just remove the embargo on the files, when the paper comes out, if you're waiting to sync those up. Of course, that's a personal choice. You're welcome to release a dataset before a paper comes out as well.

30:38 So, after you submit the dataset to be published, we'll take a look at it, review it, e-mail you with required or suggested revisions.

30:48 The revised dataset will then be approved, The dataset will be published and made publicly available.

30:57 And then you can include those live DOI's into public, into publications, and grant reports. And you can also reserve those DOI's in advance so that you can you'll always know what the deal I to a figshare plus dataset will be.

31:14 And after it's published, we would ask you to keep the dataset and the metadata up to date. Importantly, if there wasn't a peer reviewed paper, at the time you published, go back and add it later. So that we can make sure those links are maintained. You'll always have access to go into your figshare plus account, make updates, submit those updates to be approved, and pushed to the live version.

31:39 I will, do, you want to make a shout out to where our API documentation is here, so you can use our API to upload be a two step process for figshare plus because we're using the collaborative project feature of FIG share to allocate the storage. So when using the API, you'll first create an item in the project and then you'll add files to that item.

32:07 And I would point you to where our API documentation lives at docs.figshare.com.

32:13 And also, to let you know that we're having an API for end users webinar on November one, where you can learn more about using our API, and the registration for that should be live Any day now, unfortunately, I couldn't give you the registration link today, but when it is live, quite shortly, it will be available at this digital science events page, is our parent company, Digital Science, where all of our webinar events are. You can also see our state of open data report has just come out today and there's a webinar coming up on that and what about sharing NIH data as well in November. So, all of our events there.

32:55 Another important, important resource for Figshare can be found on our help site. So helpdesk.figshare.com has lots of resources about using our repositories.

33:06 And one of them is this guide to sharing data on big share plus that you can find here and that has a number of the best practices that I'm going over today as well as that workflow written enough so you can easily find it there anytime.

33:23 So here's a few screenshots of what that workflow looks like.

33:27 So I mentioned that one thing we will do is create a new user account.

33:32 If this is your first figshare plus Deposit, we'll create that user account and you'll be able to login and reset the password.

33:40 And we will invite that user account to be a collaborator on a group project. So in figshare, you have my data, which has all of your items you've created.

33:51 We have projects which are collaborative groups where you can create items and share them with other people and projects can have storage allocated to them. And then collections, which are groupings of public items, rather than draft items like projects.

34:08 And then you'll see this activity tab.

34:10 So, when you login after you've been invited to join the figshare plus project for your specific data submission, you will see that the ffigshare plus admin has invited you to collaborate on a project.

34:25 And these projects will be titled with the schema of submitting author, last name, dataset, and then a year and months, not the work demo. That's just for me.

34:37 So you will see this Project, Invitation, is, Is good for a number of weeks, and you can go in and join the projects, and then you'll be a collaborator. You'll be able to create new items in the project, and submit them to be published.

34:53 So, then if you go to your Projects tab after joining, you will see the project is there.

35:00 It's on Figshare plus, and it's title like Dataset one, October 22.

35:08 And if you go click into that project, you'll see I've put a description there that you can expand. These are just some quick instructions to remind you how to use this project. It reminds you that this is for the publishing the dataset supporting the paper. Made up title was tuna and salmon preference, a few lines with a typo in it.

35:32 And it will remind you to create up to 10 items within this project, upload your files, add metadata and then submit those items to be published. And also remind you not to publish this project publicly.

35:47 Projects don't have DOI's. So while they can be published publicly, I would suggest the better practice using figshare functionality would be. If you want a grouping of all of your items published together, to use collections to do that. And it also point you to that guy that I just showed you in our help page with all of these tips, You'll see that the storage is allocated here. So, I haven't had an uploaded anything to this project at the time it took this, But you will see that it was allocated 500 gigabytes so the storage will be shown there for the project.

36:28 You go, oh, so then if you go to add a new item, it'll take you to our an item page, so you can begin by uploading the files, including those zip files and you'll see a progress bar. So this is important for larger datasets. If you have only uploaded smaller files to figshare before may have felt that they just upload it instantaneously. But with large files, it may take a bit. It should be batched through the browser should, so it should stop and resume on its own and batch that upload.

37:07 Again, maybe more effective to use the API for very large files. But this green, moving across here is the progress bar of including those files.

37:20 Then you'll fill out your metadata just as you would on figshare.com or anything share repository. And we've added a few custom metadata fields at the bottom. And customized a few of these ones here at the top, but title remains the same, and I would just say to make sure you add a meaningful title, provides context to the work.

37:41 For authors, will work the same as figshare.com, should list all the appropriate authors.

37:48 You can list, you don't just list yourself unless you're really the only one who created the dataset.

37:54 And you can re-arrange the order of the authors. You can search for someone who has a figshare.com account and add the Author profile.

38:03 You can search by their name, e-mail address, or ORCID.

38:08 You can also had someone who doesn't have anything, share accounts and you can add them with their name, and then also include their e-mail or ORCID if you would like.

38:18 And best practice would be to add there, OK, if at all possible, so that can be included in the metadata.

38:25 For categories, we use the Australia and New Zealand fields of research codes. So this is an ontology that you can select one or more research categories to assist to describe the area of research.

38:40 Item types.

38:41 So these are the item types available on figshare plus, so you've got dataset, software, media, which could be a dataset or media files, right? So media, is like imaging files, considered data, or is it considered media?

39:00 That's something we defer to you on, in terms of, you know, what are the practices in your research community, and how do you want that?

39:07 Also allow figures, if they're supporting it, workflows or online resources. So that gives you a sense of what your options are for item type, but most of the time, you'll probably be sharing datasets.

39:20 And, then, I think, you know, when you'll add keywords, I recommend at least five per item, similar to the way you would add keywords to a publication, and then a description where you can explain the work in quite a bit of detail. You can add hyperlinks.

39:37 You can add headings and bulleted lists, you can add documentation to this.

39:45 Here will be available and searchable in the interface, so sometimes I encourage you to add documentation, both to this metadata description, as well as in a Readme file, so that then it's in the downloaded file when someone downloads all the files of a dataset. And there's nothing wrong with putting information in both places. one is more discoverable and searchable, and easily seen by someone.

40:10 The other perhaps speaks more to the re-use of the data and a secondary user of your dataset having that information at the ready when they go to re-use the dataset.

40:23 But, I sometimes advise people when creating a description, to not just describe what's in the dataset, but also describe the context of it. So, what does it support a specific manuscript? And what was that broader research question?

40:39 Perhaps even, you might want to include the abstracts of that research paper to say, this is the larger question we're answering, and then here's the dataset and the components, and the files, and the fields that make it up.

40:57 Continuing, if one scroll down in this edit item page, there's a number of other fields.

41:03 Why don't are the funding sources? So, in this funding field, you can add free text to describe your funding sources for the work.

41:12 You can also search within this field by either grant title, or grants ID, or grant number. And so, I would suggest doing this and adding awards this way, if they're indexed. So, this pulls from the Dimensions Grants database, and it has many, many, many grants indexed.

41:34 It may not have them all now, so, if you're not able to find yours, do feel free to add the funding information as free text. That's not a problem, but I've included a few tips here, specifically for some large US funders.

41:47 So, for NIH, there's a lot of different ways to start an NIH Award Number.

41:53 So, here you want to start with the activity code, then the Institute Code, and then that six digit serial number, whereas for NSF funding, you just want to start with that six digit grants, grant ID. So, there's a little bit of nuance to that. Also, something we're happy to help with during the review process, if you have trouble adding them, just add as much as you can and we will try to get them linked.

42:18 Then there's a few fields in the metadata here where you want to link out to related materials, so we specifically, on figshare plus, added an associated preprint DOI field.

42:29 Just to remind people to put that there, it could also be added to the references field.

42:34 But, we were seeing, you know, that, in this use case with sharing large datasets, there was frequently a preprint that had already been published, or would be published soon.

42:44 So, include that link there. Resource title and DOI would be for the peer reviewed publication.

42:50 If that's not available, you can come back and add that at any point, that will create a link out to the publication, Then, references fields. Put anything else here that relates to the dataset, could be the collection in ..., Plus, if you have, if you've created one. Could be another dataset in figshare ... plus another repository.

43:12 Could be a GitHub repository, a website, clinical trials, registration, visualizations, sites, anything else that would help people understand the work, and you can put these those hyperlinks in the description. As well as copying them into the references, and we've just launched in beta a new edit item page. So, this will be changing a little bit the interface shortly, and one thing that will change in the next few months is also. this references field will now include a relationship type.

43:46 And so you will add a link to the reference and then there'll be a drop-down where you'll say this is supporting it, or this is cited by this, or this is citing this work.

43:58 And that again, helps us build that map of how scholarly outputs are related to one another.

44:05 License is, here's the licenses that I mentioned for figshare plus. So it's an expanded list of Creative Commons licenses. Require attribution, fully open access with CC, zero, are fully re-usable rights waived or other restrictions. And then we have a few software specific licenses like Apache, ESD, GPL, and IT.

44:29 And we've added a few custom fields here for figshare plus that you'll notice our non-fiction Tom.

44:36 one thing we'd like you to do is specify the research institutions that you or your authors are affiliated with. This is another thing that will be forthcoming two platforms in the future, using a controlled identifier, called Research Rohr Research Organization Registry. But we think this is really important. Institutions are curious where their datasets are shared.

45:01 And so we've asked you to put that here as free text currently, and a contact e-mail. So this will be public. So do put an e-mail address. that will be persistent, And that someone could contact you, ask for more information about the datasets similar to what you would put on a publication.

45:21 And then lastly, there's a few things to affirm at the bottom for each dataset to affirm that you're not sharing sensitive or human identifiable data. And that you have the right to distribute this data under the license selected for any requirements of your funder, or your institution, your human subjects protocols, et cetera. Then optionally, if you have any competing interests, we would want you to declare those there as well.

45:51 And so, after completing that all and publishing it, here's a few examples of datasets published on figshare plus.

45:59 So here, you can see one that is 74 gigabyte datasets. And you'll see the expansion on the left here, of the zip file. So we've got a number of different zip files. And then, in this view, you can see each of the files within them and see that file file hierarchy.

46:24 And this is a collection of fossil Elise, Laura Lewis, from Penn State with collaborators at a number of different institutions that was funded by the NSF.

46:35 And here's the description for it. And you can see, includes a lot of information, including a link to the publication, which is also shown here, linking out to the publishers Publication Record.

46:49 You'll see the usage metrics of this.

46:51 As of yesterday, I had 859 views, 216 downloads hadn't been cited yet, but it's just been published a few months ago.

47:00 And it has an all metric score, and so this alt metric icon shows the attention it's gotten with each color corresponding to a different type of attention. Blue for Twitter and Red for News outlets.

47:14 Um, they've selected a number of categories and selected many keywords.

47:19 And this is an example of putting lots and lots of keywords, which is great for discoverability, and then a license, and here, they've selected a CC V Y or Y No license requiring attribution.

47:34 And you can see there were many funding sources. I couldn't capture them all here.

47:42 In addition to their description of the dataset, but the beginning of these funding sources, this first one links out to a grant from the directorate for geosciences, and it's an NSF award.

47:53 So when you add funding sources, through that dimensions integration I was talking about, these will be hyperlinks pointing out to more information.

48:04 Here's another example of a dataset. They've got a description of the dataset first that gives you an overview of it, and then a number of archived files that contain the datasets.

48:17 This one is actually really large.

48:19 This is a, I'll just 4.9 to 8 terabyte, nearly five terabyte dataset.

48:27 So this is one of the largest ones that we're hosting currently and then here's one more example from ... Plus of a collection. So this is a collection in figshare plus that contains seven different datasets.

48:43 Each dataset has their own DOI. In this case, they're all datasets can see some of them have different author order ship order of authors. And then, this collection also has its own description, its own metrics, and metrics for, and its own citation, including a unique DOI to the collection that would point to all of these.

49:10 And so that's it for me today. I'm gonna stop there. I will happy to show you the platform, but I do want to stop and save time for questions.

49:23 So you can go to cluster figshare.com at any point and get started with ... Plus. And thank you so much for joining us. You have questions. Please put them in the question box now, and Laura will help me field them.

49:43 Yes, perfect, so thank you. Oh no, we have to in there already, so the first one is ..., with a hosted FIG chef institutions I owe the, for instance, say my eyes, policy caps deposit size to 50 G. But the user needs something larger.

50:03 That's a great question and I actually should have mentioned the relationship with figshare, institutional repositories.

50:13 So one thing we do when someone submits a data publishing request on fixed terror plots is check if they're coming from an institution. That has a figshare repository already.

50:27 And if so, we will advise them to make sure that they've already checked with their repository administrators there in the library or elsewhere, wherever it resides, to ensure that that repository is not available to them. And so there are cases where that where they may not be, right.

50:49 So, it's quite possible someone's figshare repository, they may put a cap on file size, 50 gigabytes or smaller or larger, in which case if that's not a good fit and the repository would like that researcher to use figshare plus.

51:06 Then we're happy to put the figshare the dataset in figshare plus and publish it there.

51:11 It would be two separate accounts and two separate researcher profiles. So, there's not currently a way to integrate them.

51:21 However, one thing we could do is mirror the figshare plus dataset in the institutional repository. so that way it would have a finisher plus DOI, not the institutional one.

51:34 And it would be no uploaded there. However, it would be discoverable, visible in both repositories.

51:44 So that's, think how we would do that.

51:48 If you have, um, a figshare repository and you're interested in having researchers use figshare plus on a regular basis for certain types of data or large datasets. Do get in touch with us and we're happy to have a discussion with you about how to setup a workflow for them.

52:12 Thank you. And then the other one is, what is your current data retention policy? Do you preserve indefinitely?

52:21 Yeah, so that's another good question. So, we do intend to preserve for the life of the repository, which we would hope is indefinite and our preservation plans include keeping a copy of the dataset accessible indefinitely. But per the legal terms, it would be a 10 year retention period.

52:48 And so, again, if that's something that is a specific requirement of your funder, or your institution, contact us, and we can see if we can make any special arrangements for specific datasets.

53:04 But it's really that we have a preservation policy for little life of the repository, which we hope will be indefinite.

53:15 And then we have backups of all of that so that they can be kept available with those previous links.

53:26 Thank you. Next question is: can you have multiple datasets at five terabytes, or are you limited to one dataset of that size?

53:35 Yeah, no limit. That's part of the reason that we use the groups' structure.

53:41 and so that rather than allocating the storage to the user account, you could have multiple figshare plus deposits going on at once and they would be using different projects that had different titles and different storage allocated to them.

53:57 So you yeah, really, the limit is on the storage that's allocated to that project. And, you know, we'll keep that allocated until you publish the dataset 12 months or longer if you need it, if you get in touch.

54:13 And so you can have a project with five terabytes, and you can have another projects with five terabytes or two terabytes, or whatever it is you need. The only limit is just that single file restriction of five terabytes, what would apply in each one. But you could have a 10 terabyte projects as, well, if you needed, it's going on at the same time.

54:37 Great. Thank you.

54:37 That's all the questions that we have at the moment, and this is really great.

54:45 Yeah, I'll give it one, final few seconds, see if any more come through. If you do think of anything else after the webinar. So you've got the contact details on the screen, so do feel free to reach out.

54:58 And other than that, things just have to say thank you very much, everyone, for joining, and thanks for your presentation on, and we'll be sharing the recording with everyone voting.

55:08 Great. Thank you so much, everyone, and have a good day.

‍

View transcript

register for our webinar

register to access our webinar

An introduction to sharing big data on Figshare+

Transcript