Play the webinar

(registration may be required)

The next 10 years of Open Data

December 13, 2022

Mark Hahnel

A Figshare webinar that looks ahead to the next 10 years of open data. What should the roadmap of open data uptake look like in academia?

Figshare celebrated their 10th anniversary in 2022 and have been reflecting on 10 years of providing leading repository software to universities, publishers, funders, government agencies, pharmaceutical organizations, labs and more. As we embark on the next phase of our journey, this webinar will take stock of the current landscape of Open Data and what the coming years could bring for Figshare and the community as a whole.

2022 also saw the so-called ‘seismic’ OSTP memo and in January 2023, the NIH’s new Data Management and Sharing Policy will take full effect. During our webinar we’ll discuss the rise of national and international open data mandates and what they mean for publishers, universities and importantly researchers themselves.

Transcript

Please note that the transcript was generated with software and may not be entirely correct.

0:04 Wait for a few more people to get online and there's quite a few of you are ready and then I will give the housekeeping bits and pass over to Mark for today's presentation. If you can't hear me at this point, please say or of his crackly or anything, and we'll try get assorted before we kick off.

0:27 Cool . so I think everything's working, and there's quite a few people on. So we will move that, say, hello everyone, and welcome to fake Shares last webinar of 2022, which is the next 10 years of open data. So as I said, I'll pass over to Mark, but here are the housekeeping bits, just for you all, So all attendees are in listen only mode, but if you would like to ask a question or need clarification or having a technical hitch, you can put it in the question box, in the Go to Webinar side panel, or you can put it in the chat, And I will monitor both. Questions, general ones, aside from those, about technical issues, We will come to at the end during a dedicated Q&A section, and in the eventuality that we get loads of questions and we don't get time to get onto yours. Not to worry, we'll make sure to follow up with you afterwards because we'll be able to see who asked what.

1:17 We are also recording today's session, so we'll share that recording with everyone in the next couple of days. If you do have to drop off at any point, not to worry you'll receive the recording. So I think that covers just about everything, and I will pass over to Marc for the rest of today's presentation.

1:40Thank you, and Hi everybody. Thanks for coming along. I'm sure Laurel. let me know if she can see my screen. For those who don't know me, my name is Mark Hamill.

1:53I'm the founder, and we have a classic thing, that this always happens.

1:58 We're not 100% sure why.

2:01We're green goes to the AWOL for a second here, sir.

2:07We've had this a few times, we're not sure why.

2:11 Usually writes itself, amur away, again, know what we were doing, OK.

2:18Um, there we go. So, I'll take it from it.

2:23We are 10 years old, which is why we've been doing a set of webinars around the last 10 years, and the next 10 years of open data. This is the final one of the year, so I'm going to use it for.

2:37A few things is gonna move on to the next slide. Is this both sides of it.

2:44So I'm going to have a little states of where we are right now, and what's been happening, and what can we learn from the last 10 years, where we see the space moving, and what's been happening with some other stuff that's going on.

2:56And then I'm going to end with some outlandish predictions that you can all screengrab and say, Mark, thought this was going to happen in the next 10 years. How ridiculous. He didn't have a clue what was going on. But I'm sure there'll be plenty of unknown unknowns in the next 10 years, as well.

3:14I should be talking just for about 30 minutes, so if you have any questions, please, as Laura mentioned, put them in the chat box.

3:21And feel free to ask me about any of the predictions or any of the data that I am showing. It will be showing a bit of new data that we've been collecting around.

3:31Um, how we can start linking information together.

3:35Um, the first thing, obviously, I've got up on the screen already, is just looking at how other areas, we often think that academia and research doesn't move too fast, or, or new changes in research.

3:49We're often struggling to move along when it comes to, uh, peer review, or when we're trying to think of new, innovative ways to improve the academic dissemination of knowledge.

4:04And I think there's also this idea that, because of the legacy of academic publishing and academic content, we know about perverse incentives popping up, whether it's, you know, impact factors or countries, governments, trying to get ahead in research, and trying to make, and providing financial incentives for the researchers.

4:28We've seen that we've seen lots of them over the last decade and I think, with open access, paper publishing in particular, where we have seen this fantastic flip from the vast majority, 60% of publishers. In 20 11 being publications, being closed access publications. We now see over 50% of academic publications being open access from the start. We will see this continues to go with things like the OSTP memo saying that you need to make publications openly available with no embargo time. I think that's a force innovation in the space for a lot of journals.

5:09A lot of publishers.

5:11But we also see something, I've, I've mentioned a bit, this, this idea that gold open access is, by far and away the biggest way to make content openly available, which I see as a failing on the repository system, which we work in, and something that we're hoping to improve on over the next few years.

5:31But it's also, you know, path of least resistance for researchers.

5:34There were many people who said that, there's no way you could be publishing, 1.75, two million papers a year, gold, open access, the, the academic publishing ecosystem can't support that, and yet, here we are. So, whether it's more cost effective or less cost effective.

5:54It is providing open access, is making things more openly available for everybody.

5:59But then you get to this weird conundrum that a lot of nations where there is not the same level of publisher, funding, freedom, or academic funda freedom, then you get this two tier system again. Whereas, everyone may have access to the content now, but not everybody can publish their research in an equitable manner.

6:26So I think this is important to think about, as we look too, the emerging data space.

6:33So, if we focus specifically on open, I can the academic data, what I can tell you now is there is more of it.

6:42And this is just one little graph which is showing the evolution of free fixture usage over time.

6:50So we have, can I start that gift again?

6:54So what we've seen there It goes from 2013 onwards is if you look at the academic literature, published articles.

7:06And if you look at 33 million indexed articles, you can see that the number of papers the mention, a link to a free FIG share this is we fixed it provides infrastructure for repositories as well as having a free version. If you just look at the DOI string that we use for free version 10.6084. Then you can see that this is popping up in more and more publications.

7:35Still, a low number, but this is a generalist repository is not all of the FIG share infrastructure, is not all of the other generalist repositories. It's not all of the other subject specific repositories.

7:46But what we can also tell from that, this is based on our sister organization, dimensions, is that you can see, this only includes publications where a funder and research organization are identified.

7:58What I think is really important about this is, we have built this data to be good data.

8:05So, there is more datasets that have been linked from academic publications, but if they haven't had their funder, or their research organization identified and identified in a machine readable way, we've taken that out of the, uh, the equation. We've clean the data, so to speak. And, I think that's an important thing to think about, when we think about how data is going to be used.

8:28There's this idea that if it is not of a certain quality, when it comes to persistent identifiers and linkages to other types of content than it, it can just not be consumed.

8:39And we still see this wonderful animation where we know that 4000 organizations, and a thousand funders account for this, all of this open data that is being made openly available. And if we look at our State of Open Data report that we do every year.

8:55We found the 17% of those surveyed in 2022 said they were required to follow a policy on data sharing for their most recent piece of research. So what I say there is lots of data.

9:09that graph that I showed you beforehand, the animated gif showed you.

9:13That there's more data being made available.

9:17There is more data that is being made available, link to more funders. There is more data being made available that isn't linked to more research organizations.

9:26And this is, one of the reasons is, is that funders are saying that you have to make your data available If we fund you. In January of next year, we'll see the NIH policy come into fruition, which mandates that if they fund you the biggest biomedical funder of the world, You will have to make all of your research data available when you publish your paper.

9:49And I think we've seen the UNESCO movements on it this year. We've seen some other growing momentum, the saying you're going to have to make your data available.

9:59I think, particularly in North America, we will see more of this as we get towards 20 26 when the Office of Science and Technology Policy, the OSTP memo comes out, Um, and says that as of 2026, 0 embargoes on publications are openly available and the research data will be need to meet openly available as well.

10:23Um, and so, again, from the state of open data, we've, we've seen consistent themes for the last six years, when we've been doing the state of open data that says, What motivates you to share your data?

10:38And a large majority of people will make their data openly available, if it gives them some credit.

10:46So, we have an established things will always change over time.

10:50There are these unknown unknowns, but the funder requirements, um, which is just cut off at the bottom here, 56%. and all of the other ones are, the larger ones are around impact of my research.

11:03So you got those two key parts of the equation that I don't think are going to go anywhere. I don't think they'll be toppled in terms of what is the most important reasons why researchers are making their data openly available.

11:17And that is I'm being told to and I need credit for my research in order to continue on my career. So I think these are established facts now. And we've had six years of longitudinal data that says, these are the most important things may change.

11:32But I think that will continue for the next 10 years, at least, as I continue with this scene setting of where we are, before moving into the forward facing side of things.

11:44I think a really interesting research report that came out from our sister organization, repair to looking at the state of trust, and integrity and research, really identifies the other side of researchers making their data openly available. And that's this idea of compliance. Can we check who is doing what, and all they complying with what they said they would do as part of their funding.

12:14And I think this is going to be a really important feature going forward, is we don't have any funders at the moment doing consistent checks on the accountability of their researchers, to make their data openly available.

12:29I've said this before, if anybody on the call is brave enough to do this.

12:34I think that's an interesting idea to go and look at every paper that says dates are available upon request, requesting that data.

12:42And then if they refuse to make that data openly available to you, requests at the publication be taken down as they are not making their data available on request, and therefore, they are not being transparent about their research, which I think is important.

13:00What we also saw from this report, is the importance of data availability statements.

13:07And the importance of linking to the publication I mentioned linking to the publication is: I think there is A Gradients that we are looking at in terms of: open academic data, which is: only data which is very homogeneous, and very well describes, can ever be useful.

13:32And the way I like to think about it is, um, data for the machines is going to have the biggest impact in the long term.

13:42But it doesn't mean that the data is not accessible for machines, does not have a huge amount of benefit to research going forward.

13:53So I think these data availability statements from academic papers will continue to grow, I think we'll move towards 100% of publications having a data availability statement within them. If you are a publisher who is not doing that, I'd be interested to know why.

14:13The, one of the questions there, as well as whether publishes work as a completely separate entity and have the dissemination of data as well as this dissemination of papers become part of that core reason that for And what I think we've seen is this kind of decentralization of academic publishing systems.

14:38So when I say that FIG share itself, we, We provide academic repository infrastructure.

14:46We provide that for publishers like spring in Nature, like clause. We provide that for academic institutions like UCL up the road from me, or Carnegie Mellon, University of Melbourne or University Of Amsterdam. It's a global thing.

15:05But we also work with different types of research organizations, and I think this is something that wasn't evident 10 years ago, and I don't think this is going to be something where you have one particular group, or one particular law stake holder take ownership of publishing of datasets in the way that we have seen previously in academic publishers. As way you publish papers.

15:31There is a small minority, maybe a growing minority of university presses, and different organizations where you can publish your academic research in the form of a peer reviewed paper. But I think the majority of them are academic publishers.

15:49So, when we look at the data availability statements and location of datasets, it's really interesting to see that these are all different funders, and how they've grown over time. I think it's interesting here, at the NIH, going from 20 12 to 40%.

16:08The NSF C, which is a Chinese research organization, research funding going from 11% to over 30%, and this is just in the last five years. So we're seeing a real growth here, and I think this will continue up to 100% in terms of where the datasets live.

16:25The biggest one for all of them, except for maybe the German on the Here, is the dataset live in a repository.

16:32So, I think this has become the established norm.

16:35I think a repository, um, can cover a hole, plethora of different types of stakeholders. It can be a fonder repository, it can be a publisher repository, it can be a subject specific repository. It can be a generalist repository like the free FIG share dot com.

16:52Um, I think the datasets living in files and in papers may continue, I think we'll see a push to move towards repository publishing and data availability statements away from supplemental materials.

17:06We've seen some publishers already stopped having supplemental materials, and then upon request and not public, is something that I feel we need to have a bigger level of accountability and a bit bigger.

17:19A bigger crack down across academia that there needs.

17:24There's always going to be reasons why you can't make some certain certain types of data are openly available, right. The European Commission Line of as open as possible, as close as necessary, is very true. When it comes to certain types of data.

17:37You don't want to be publishing the locations of the, uh, last.

17:45Members of a soon to be extinct species, right?

17:49But I think it's something that is overused. This data are available upon request because only I should have access to it.

17:58If we start looking at the data, again, this is from a dimensions integrity model, we can see, I started looking at generalist repositories and you can see a group of them here and I think what's interesting in looking at different repositories, this is the Repository's used by a time since the corresponding author's first publication. So it's how long they have been publishing in, in the publishing academic content.

18:29And things to consider, newer researchers are making their data available at a greater rate, than in generalist repositories, I should say, at a greater rate, this is just filtered on biological sciences, as well because you, I think there's also going to be this matic.

18:49Um, Question that needs to be answered.

18:52And this is really interesting for me, because what we're starting to get into now is, should I make my data openly available, or not?

19:01Was the question 10 years ago, And the changes to where I make my data available and why I make my day true available, will seek row.

19:11So, no. 6.5% of?

19:18researchers were making that data available in dryad. If they're new researchers.

19:24And you see a consistent theme that, more time a researcher has spent from their first publication, the less chance they are using a generalist repository. It might be that they're using other types of repository, but I would hazard a guess here.

19:44Don't quote me on this, that there is less publication by more or less publication of data by more established researchers who are not used to this New Way of Working, Whereas the more junior researchers who are new to this space, you've grown up with digital technologies, are more used to making their data openly available.

20:08The other thing I wanted to show here, as well is, if you're just looking at the generalist repositories, is, um, on the left hand side here, you see the biological sciences, that I mentioned, and on the right-hand side here, I've just filtered on Psychology.

20:24So this is the number of articles, uh, in dimensions, where you look in full text and they've said, I've made my data available in a generalist repository and it's this one.

20:37So they've put in what's really interesting here is the Open Science Framework Center for Open Science started out of the psychology background.

20:47There was supposed to be a reproducibility crisis in psychology that was first highlighted around 2010 to 20 11.

20:57And what we've seen here is Center for Open Science OSF, the Open Science Framework, has the vast majority of open psychology data in of all of the generalist repositories.

21:12If you compare these repositories Dryad fixtures to Noto, Center for Open Science, Mandalay data and Harvard Dataverse, they all are very similar. I can talk about why they are very similar in functionality in a little minute.

21:26But I think it's very interesting that the, you can see, there's a huge difference here in the amount of psychology open data, and that is because the psychology community has chosen OSF because if they've been promoted in that way.

21:41OSF Center for Open Science does not have more subject specific metadata for psychological research.

21:48Should it, should it have more? more functionality catered for certain researchers in there before? Have a thematic repository a repository around a specific type.

21:59I think this is where we might get into, in the next 10 years, we might start seeing these thematic repositories or repositories, or different areas of research. So not Gen Bank, not saying, we just want to take your fast few files and a very specific file, or your ... files, very specific file types with very homogeneous metadata, I think you might get.

22:22The matic research group together. We still might be heterogeneous, but it's all based around a certain field and therefore, you can start asking for more specific metadata.

22:33You know, if we could, we could add A life sciences, research, somatic repository, all uh, more specific Human Biological Sciences Research repository. And it could have things like age of subject, or you could just have this species that was involved. Whereas geological Sciences don't probably care about the species because it's a, it's not a metadata field that makes sense.

23:03So, and saying Where we're going, I think compliance is gonna become A big thing for funders in checking the accountability of their researchers, and we can start to see this already.

23:19You can filter on, um, published research and say, If I just wanna look at Australia and the different funders, can you tell me? How many papers are being published? Where they have these different trust markers?

23:33We just saw a paper come out, are a blog post impulse, where they're working with an organization called Data Set to try and look at data, availability, encode, availability. And so this is a publisher that has the data are available, that has the tools available to it, that can start looking at compliance, and start looking at trends to understand who is making what's available.

23:55When, I think if you combine this with, you have to make your data available, then there's going to be a bigger level of accountability for those researchers, based on compliance.

24:08Um, I think, by having this data openly available, as well, there will also be a looper effect, in terms of the funders, getting pressure from other funders to act more appropriately.

24:26So, in the same way that there is going to be pressure on researchers to make their data openly available, and comply with policies and have that compliance checked, I think will also start, as we like to do.

24:40Comparing different funders and seeing how different funders are acting, and how, how different funders are encouraging transparency in the research that they fund. I think there will be a level of accountability that they hold themselves going forward as well.

24:58So the, the question that I get asked a lot, specifically because we are a generalist repository is this thing I brought up before about how useful is any of this.

25:10There's different types of data and we tend to look at either end.

25:15We don't look at the middle of OK, well, I can read that paper and I can understand what that data says based on the context of the paper Because I'm a human And I can operate at that level.

25:27And then you have the other end of the spectrum, which is the machines, the machine learning algorithms can elucidate all of this knowledge, if they just fed very consistent files and very consistent metadata.

25:43I think the space that's on underexposed at the moment is this stuff in the middle.

25:49So if we think about How useful is any of this, I've put this before.

25:54This idea that data with no checks can be useful.

26:00So this is, this, academic files and metadata are available on the Internet and repositories that follow best practice naughts.

26:07This is researchers who use generalist repositories, the on check, the aunt curated.

26:13But they just follow best practices, and they describe their data. Well, we see a lot of software made available, follows best practices, because it's a community norm. You add a Readme.

26:23You add content.

26:25But we also have, content is made available, that you have papers adding the context to. So this is ice.

26:32This data with no checks that can be useful, I think, is at one end of the spectrum.

26:38And I think everybody thinks about interoperable and re-usable data, which is, no files are in an open preservation, optimized format subject, specific metadata schemas, This is the Shangri-la, this is the holy grail that we're gonna get to right?

26:53So when everybody thinks about the future of open data, the next 10 years of open data, it's, it's this Or nothing interoperable and re-usable or nothing.

27:03And I don't think that's going to happen at all.

27:05I think, um, top level data with no checks can be useful will continue to exist.

27:13But everything will shift downwards.

27:15There will be a concerted push to go from data with no checks, all the way to fair, findable, accessible, interoperable and re-usable data.

27:26And the way we'll see that happen is moving things towards findable accessible, and then moving things that are currently findable accessible towards entry, interoperable, and re-usable.

27:37And this is going to be a general push downwards.

27:41But at the same time, I think we're going to see a swelling of inflammation in the middle of this, which is findable and accessible data that is OK described. And so the question there is, well, how useful is that?

27:54And I was delighted to see this paper come out recently, which is, answering a question about a lot of the data that is made available on the generalist repository fixture that is not well described.

28:11And therefore is it of any use to anybody I get a lot of pushback from people who say, well, it's not well describes it's ****, it's just wasting space on the internet.

28:21And what this research group found was a resource They provided a resource for automated search and collation of geochemical datasets from journal supplements.

28:31So they went through the supplementary data of journals on FIG share, They used the FIG share API and they generated this dataset for 150,000 some things and oxygen isotope analysis.

28:48So this is large-scale data analysis across what seemingly is heterogeneous research?

28:57And from this as well, they propose a set of guidelines for format formatting supplementary. Dave, data tables that allow for published dates and be more readily used by the community.

29:09So what they've said is, for a very specific, Nish area of research, data with no checks are useful. They can create novel data based on that top level.

29:23But what they're saying is, we need to, if we want to move down this gradient towards interoperable and re-usable.

29:31Here is it community led sets of guidelines in order for the research to be more useful. And so this is fantastic.

29:43This is essentially a thematic repository, but if everybody in the space starts following these credentials, and these guidelines, then you can make use of generalist repositories and make content available in a way that is very findable, accessible, interoperable, and re-usable.

30:08And a lot of this is in the the formatting of particular spreadsheets.

30:15But it also relies on existing repositories to improve their metadata flexibility. And I think this is what we're looking at going forward, as well.

30:24Um, so we also have examples of the other end of the great in here as well. If you've seen me talk recently, you've seen, I've got very excited about DeepMind and the work they're doing.

30:38They have some improvements happening all the time, but they, they caused a wholesale change in academic research and novel data, discovery, in using machine learning.

30:53Artificial intelligence, on protein structure, two, um, go from 17% of all proteins, in humans, known about two, 100% of all protein structures in humans, predicted with accuracy. And they've seen that then, since evolved this into the bigger protein structure folding world. And this has huge ramifications for drug discovery. And it's, it's, it's already winning, huge academic prizes.

31:25And, as I said, I'm going to be making, random predictions, are not random prediction informed predictions for my, for my mind, about where we're going. And, I do think, Alpha Fold and DeepMind will win the Nobel Prize for some of the work they do. Which, I think will be very interesting as a, as a tidbit of how academia is moving forward. So we're thinking about a homogeneous, consistent, high quality metadata, and how we get to that.

31:53That is what we want to move towards, but I don't think that homogeneous means repository for free type of content.

32:08I think what it means is that the same types of content from same fields of research will be treated in the same way by the researchers who are published. This can, of course, be helped, and I think this will, of course, be helped by curation.

32:24I think dataset curation is one of the things that we're going to see more of in the next 10 years.

32:31And it wouldn't surprise me if this becomes the model for every type of published dataset going forward, or you have some level of filter that is, this dataset has been checked for set criteria by a trained professional, in the same way that we have peer reviewed journals. We have peer reviewed publications, and you can say, I only want to see the peer reviewed content, I don't want to see the free prince. I only want to see the peer reviewed content.

32:58We do know that by having curation, we have more metadata and we do take that box off more views and more downloads more more of the catering to the researchers getting what they want in terms of impact.

33:13If you so I do think there'll be a positive loop effect there of researchers will make more data available, they'll learn how to describe it better, and then they will subsequently have more impact, make more data available a virtuous cycle, so to speak.

33:29So, I've mentioned a lot about the machines. I think, yeah, we've got a really interesting thing happening all the time. There is all of these unknown unknowns.

33:42I mentioned this perverse incentive structure at the beginning and how one of the predictable, but unforseen aspects of open access publishing was predatory publishing was a huge, um, percentage of the market ownership by gold open access publishing.

34:10And I think there will be more concerns about the fact that the percentage of large-scale artificial intelligence experiments coming from academia has virtually gone to zero.

34:27And so, the, no competition is good from industry and academia.

34:33But it would be really good to have academic research projects, investigating how artificial intelligence is being applied to open academic data, because I think that's a big concern. And I think that will happen.

34:51I think one of the things you'll see is more research funding, and more models coming out of academic research groups to help assist with the generation of novel content.

35:06I think that that comes back to this research here, right? Researchers are getting more computational in everything that they do.

35:14And so using an API to scrape 150,000 analyzes is no longer something that is outside of the norm of a generic research project.

35:25So I do think we'll see more computational theory in each of the research groups going forward.

35:33This is a really interesting one where things just pop up, right? Thing. This is, this is very ... gate guy, ST.

35:43If you are familiar with Shat Chat, GVT, it is A An AI module, which interacts in a conversational way.

35:54And I think it's very interesting to see how these things move along and how widely adopted they are. So, I think this is hilarious. The, you know, Facebook 10 took 10 months to get a million users. Chat GVT took five days since that launched last month.

36:08And, I think this has huge implications for the academic publishing industry writ large. In terms of the educational industry, The content generation industry, if I ask it, What are the benefits of open academic data? It can write, it, can tell you more than I could ever tell you, in a much more concise way.

36:32I can make research more transparent and reproducible, which can help improve the overall quality of research, It can make it easier for other researchers to build on existing work, potentially leading to new insights and discovers. All of these, I consider to be facts.

36:48All of these, I consider to be hugely insightful.

36:52The thing we need to know more about is how, how it's all working, what's going on behind the scenes because we need transparency this. It can make research more transparent, more reproducible. We need to open methodologies around the AI to understand exactly how it is inferring if we're going to take it to the level of protein folding and drug discovery. We need transparency.

37:15What are the barriers to global open academic data? So this is what ... GCT came up with.

37:22Interestingly, as I mentioned, we do the state open data report every year.

37:27And, if you ask researchers, What are their barriers to sharing data, they come up with a very, very similar list as their top priority, interestingly, when you ask a human, are asking machine this question.

37:42So, the bottom section here is, is, is pretty much matches 1 for 1.

37:47The only one slightly difference is the concern about the misuse of data, which is, I've always inferred as with the rise of anti vax and covert misinformation and even fake news as a hashtag. The misuse of data for nefarious use cases, I think, has been growing.

38:11The chat, GCT, most power, the parallel one here, is intellectual property rights.

38:18There may be shared concerns about sharing data that could be protected by intellectual property. rights. That doesn't come up from the researchers. But maybe their concerns about misuse of data covers that. That's the human element of I don't want people scooping my research. I want to get credit for all of my out.

38:34It's interestingly if you ask Chet GVT, what novel research findings could you elucidate from fixture and says you com provides it is not possible to provide specific novel research findings from picture as a platform for them to share versus discover research data.

38:52So it depends.

38:54That's wrong chap, GBT. We found out recently you can elicit a new novel research based on all of the outputs that have gone into a generalist repository.

39:03So I just wanted to highlight that as a new and innovative way in which the research field will be impacted by machine learning and AI, we need transparency in the background to understand what we are treating as facts.

39:20And when we have facts, we need methodologies, and we need data in which we've elucidated those facts and presenting things of facts that aren't necessarily facts, needs, citations.

39:32So I'm going to round out with where, I think, some predictions of what's going to happen in the next 10 years.

39:41I think the next 10 years for open data really focused on repository's librarians and custodians of open academic research data.

39:50I think the big things for me when I thought about this, the sustainability of those repositories.

39:55I think open academic data will move towards 100% publication in repositories of some format, because it's easier to query APIs of repositories than it is to scrape the text, 100 million articles, and look for files in there in a consistent manner.

40:15It's easier to scrape those papers and look at data availability statements.

40:20I think the technical interoperability of repository is, is technological, inter operability of repositories, is a really important thing, because I think there's some under looked areas that I'll touch on. The quality of metadata and the provenance of the content, how it's linked to other files, how it's linked to other researchers, to how it's being re-used.

40:42And I think information designed for the machines in AI is something to think about.

40:46So we think about funding models for repository's if that's going to be a very important factor. As I mentioned, fixture is we provide software as a service infrastructure for hundreds of organizations around the world.

40:59And so the chance that one of those organizations stops working with fixture is not necessarily low.

41:07But the chances that all of them do ARR much lower.

41:12What's interesting is you if you look at the funding models, there are, you know, membership models, philanthropy, corporate sponsorship.

41:22We've seen recently with them call from jeske, just because pulled the funding for core in the UK.

41:29And I think that's highlights the need for continued development of business models and sustainability models of these digital data repositories.

41:42Interestingly as well, I think, when we start to pull out analyzes of the data that's already out there.

41:50So this is Roche et al. They looked at some dry our data and they said the half-life, the need, half like data needed to reproduce the work.

41:59More than one third were not.

42:01We're either not machine readable or essentially unusable in other ways, and that is data that has being curated.

42:10And so I don't think curation is going to be the be all and end all.

42:13But I do think that it does shift the needle in making the content more available and giving a bigger return on investment.

42:22if you funded some research for $5 million, that the funders will make the $300 available to site to curate that dataset, so it's more useful.

42:33I think the dryad have been trailblazers in this space.

42:38And they, the fact that they have to encourage funders that this is an important model, is very different, too.

42:46The funders demanding that they get the most out of their data. And I think that is one of the shifts will see in the next 10 years.

42:54I think interoperability of metadata is being worked upon.

42:58Bye.

43:00In this particular group, we're in the National Institute of Health, is funding the generalist repository ecosystem and its system initiative. And it's basically making sure that all of the generalists repositories are working together in all of these different fields.

43:12Things like adopting the consistent metadata models, so that you should be able to query these repositories pull out consistent metadata, in ways that can help you build on top of the research that's come.

43:26Before, this is really interesting because this is interoperability between lots of repositories that host lots of heterogeneous research data, can be pulled together. Once as being filtered, that's very different, too.

43:43Siloed repositories that aren't necessarily working together, even though they work in the same fields, or very adjacent fields, if you have a subject specific repository that is looking at, um, air quality, and a subject specific repository that is looking at asthma, are they interoperable in the way that these generalist repositories will be going forward? And if you look at ...

44:11data dot org, which has 3065 data repositories listed, of course, to highlight here, they're not. if you click on some of them, they're not all there anymore. Some of them have gone away.

44:21Some of them, sustainability models didn't, didn't last.

44:26But more concerning for me, is 50, 53%, So 47% higher use persistent identifiers systems. They're not all the same one. The majority used digital object identifiers, which is great.

44:41But 53% of them don't have any persistent identifiers system, which I think is a real problem as I see that as a core tenants of what a repository needs.

44:51Of those 3000 repository's, 39% of them don't have an API.

44:57All of the generalists repository do have an API.

45:00Um, the ones I've just listed, but not every repository that is made available, has an API, and, again, different metadata schemas, different ways in which the API works.

45:11But it is a real concern that 39% of these repositories have no API, and in terms of quality management, 54%.

45:20Yes, but what that means is subjective.

45:22I'm sure, when it comes to metadata standards, that's that's something that I don't think we will be solved in the next 10 years. I think, you know, we the classic X case, CD caskey comic, when it says, we'll get more and more metadata standards.

45:37But I think there will be a push towards core standards for cool research outputs.

45:45If you can get, you know, much more minimal information biological, and biomedical and specifications working, that's fantastic, but that needs to be combined with APIs and persistent identifiers to make things more interoperable.

46:01What I think will happen for the vast majority of research, this is a piece from Todd Carpenter that came out recently is, you know, can we just call things what they are?

46:09We all know that, uh, research, persistent identifiers should have a locket, they should have a DOI or a persistent identifier that's analogous. They should have a raw field of research organization field. They should have a consistent persistent identifier for fund.

46:31All of these things are obvious things that will move the needle on a lot of this stuff.

46:37So, can we just put it in the policy, can the fund to start being explicit about what they mean, that it will annoy folks who are very niche?

46:47Identify, as it will, suggest that some identifies it better than others, but as long as they are community owned identifiers, I don't think that's an issue.

46:57So my final slide, and I'll let you all go, there are any questions I'm happy to answer, but if not, you can, You can contact me and called me up on these predictions and 2030 and beyond.

47:08So I do think researchers will be held to account in the next 10 years about their open data. I do think that the, the funders will be able to query this in ways that have not been able to before.

47:18I do think the research community will, We'll call out their fellow peers and say, data available upon request is not good enough, in this example, and start holding researchers to account.

47:31I think some repository's willing saving, I don't know how, I don't know who.

47:35But I think there is a constant battle to make sure that this constant this content persists.

47:42I think that core metadata libraries will exist across academic data repositories, others will be available. Extensions will be available available.

47:50But all repositories should have a core level of metadata on a core level of functionality.

47:57I think this will provide a gap, in that the funding models for several different types of repositories are not the same.

48:07And I think that subject specific repositories, some of the essential repositories that we've had for 50 years for 20 years, need to have an update to their sustainability model, or they need to have an update to their funding. So they're operating at the same level. That will be expected of the generalist repositories.

48:28If the generalist repositories are all pulling together funding information, and associated literature, and all of these things, Then if you go back to my first original, animated gif, I showed you that we'd filtered that data based on does it have associated funding? And does it have associates or organization?

48:50That is a good example of where data will be elucidated from generalists repository's? Because they have standard interoperability.

49:01I think there's going to be a gap between the subject specific repositories and the subject specific repositories will still be fantastic for that middle of the. I can build on top of this research.

49:10But when we want to ask big questions of the interoperable research between subjects and between fields, I think that is where new knowledge is going to be discovered. And I think that is something that will be essential. So I think that gap needs to be filled.

49:28I think funders will start paying for, they set checks, like I mentioned, withdraw it, I think that model will become ubiquitous.

49:35I think we need to be aware of what's happened in open access publishing, but I think that's going to be happening in the next 10 years and, I think humans and AIs the machines will provide more new knowledge than we expect. I think the ...

49:49has shown us, that, I think, is, we have no comprehension about the speed, at which the Machine Learning and Artificial Intelligence is advancing.

50:02And so, some of these things about saying the research data that was published on fixture in 20 11 that is poorly described is never gonna be useful to anybody. As soon as you can start applying contexts.

50:14And as soon as you can start pulling in other linkages, I think, the machines will move in a way, too, discover new knowledge.

50:25And at a rate that we can't comprehend right now.

50:28So, these are my big predictions screengrab, it pulled me up on this.

50:32Tell me I'm talking nonsense, but if you have any other questions, or if you have any thoughts around this, please do get in touch with me at any of these places. And I'm happy to answer any questions.

50:43Thank you, Mark, we have no questions in the box, either everyone Grains, or they're saving it to follow up with you afterwards. I guess it's like to say thank you everyone for coming today, and thank you for those. I can see online that, have come to a lot of future webinars, the share. We're already working in our scheduled webinars the next day. So we'll be sending out invitations for those very soon, and so now is the time I can say, I have a happy holiday break, and we will see next year.

51:11Thanks, everyone! Bye!

RE-GENERATE TRANSCRIPTSAVE EDITS

‍

View transcript

register for our webinar

register to access our webinar

The next 10 years of Open Data

Transcript