register for our webinar

register to access our webinar

State of Open Data 2022

State of Open Data 2022

State of Open Data 2022

play the webinar

Play the webinar

play the webinar

Play the webinar

Register for the webinar

(registration may be required)

State of Open Data 2022

October 25, 2022

Mark Hahnel

Find out the latest on researchers’ attitudes toward and practices of open data in the seventh installment of the State of Open Data survey and report. The longest running longitudinal survey and analysis on open data practices, this year’s report pulls together a host of contributive pieces addressing the challenges and opportunities found in the survey results.

The full report and survey results are now openly and freely available here.

‍

Transcript

Please note that the transcript was generated with software and may not be entirely correct.

Hello, everybody. I'm just gonna, give it a couple more minutes for everyone to get online, and if you can't hear me, at the moment, Do, say so, and we'll get it figured out before we kick everything off.

0:28

Lovely. There's already quite a few people online, so we'll kick things off. I'm just gonna share some housekeeping before I pass over to Mark and Greg for the bulk of today's presentation. So welcome everyone to our annual state of open Data Webinar. Pretty exciting to see so many of you joining us. So all attendees are in listen only mode, but if you do need to ask us something, or get some clarification, you can pop that in the chat box, or the Q&A box, on goto Webinar, and I'll be monitoring both. If we don't get time to answer all the questions in the eventuality, we get loads, don't worry, we'll be sure to follow up individually with those that ask. because we'll be able to see who asked what. We are recording today's session and we'll be sharing that recording with everyone that registered over the next day or so. If you need to drop off any point, not to worry you will receive the recording. So without further ado, I will pass over to Mark and Greg. Thanks everyone.

1:26

Thank you Laura, and thanks for everybody joining. I am Mark Hahnel and I'm the founder CEO of Figshare. We've also got Greg Goodey who is a Senior Research Analyst.

1:38

And this is the second time I've had this happen.

1:44

When I go to start my meeting and my slides runaway, so bear with me.

1:50

Normal service should be resumed. So I'm Mark Hahnel, founder at Figshare and we have Greg Goodey, who's a Senior research analyst at Springer Nature. Who's going to be talking today.

2:02

So for those who don't know, we'll be talking about our report that just came out a couple of weeks ago. The state of open data. If you haven't seen it, go and download, it will be digging into some of the data and some of the broader concepts around it.

2:16

With regards to why we have an interest, from the Figshare e side of things, for those who don't know we we build repositories. We build data repositories we have done for a decade.

2:29

And the technology side of things is just one part of this puzzle, right? It takes technology, it takes code to change, and it takes people.

2:41

And so, whilst we build repositories, we're very much focused on the, um, policy compliant infrastructure side of things, which is taking care of what I think, I don't want to say boring things.

2:55

But, very important things such as web accessibility, ISO, certification, persistence, preservation of content on a, on an active data level. And so, we really want to understand what's happening with the researchers day by day, And by doing a longitudinal study, we can start thinking about how we can cater our technology in order to better suit this explosion of data publishing that's happening.

3:23

So, I'll pass it over now to Greg to talk us through the survey itself.

3:31

Thanks, Mark. So yes, as Mike said, I'm Greg. I'm a senior research analyst at Springer Nature. I hope you can all hear me OK. So I sit within the market intelligence team of Springer Nature whose aim is to advocate and educate our different teams or business units about the perceptions and behaviors of our customers.

3:47

We do this through traditional market research approaches.

3:50

So, within this role, I have now, for the last five years, the bulk of the state of open data survey, From managing and developing to analyzing the data in and of itself.

4:02

Similar to what Mark said about Figshare for Springer Nature.

4:07

The reason for our involvement is that, as a business, we believe in an open research environment, I'm glad it's a positive thing, that we realized that we have a responsibility to support that transition, within the community, and within our own business. You know, we're still on a journey to get, to be fully transparent, and open. Therefore, being involved in building an evidence base for the needs of the community and recording evidenced over a very long timeframe, is key to the efforts that we're trying to make and conduct. And I believe, you know, the sub state of Open Data Survey does provide which parts of the report provides a snapshot of this landscape.

4:46

Slide

4:50

Um.

4:51

So, my, my involvement today is really just to talk a little bit about how this, how the surveys developed, and who responded to the survey? To give context to the findings, and the discussion and that, You know, the outcomes of what we learned.

5:07

So, really, quickly, if I just give you an idea of what the survey is.

5:10

So, it's, it's, um, is a digitally available survey that's optimized for both computer and mobile, so that people can access it, however they wish. And we translated to encourage better uptake within specific regions. So, this year, the translation program extended a little bit. We translated the French language. And we have plans to continue to translate into additional languages for 23.

5:33

The survey is live for approximately two months, every year through June and July.

5:37

And we distribute via a number of channels that we as a business have available to us, or as businesses have available to us, which include e-mails, are web registrant's and leadership, social media outreach, and blog posts.

5:50

This year, this result, all our efforts resulted in a roundabout 5000 responses.

5:56

Um, this, I think I've lost the slides, but I'll keep talking, and through this effort.

6:04

The increase in the last two years, so, you know, it was a very good year in terms of number of responses we got, but it is worth stating that this respondent pool is a sample, which comes with inherent biases.

6:18

In terms of, you know, who we can actually reach. So, what I'm trying to show in the next couple of slides is how we try to, how balanced our survey sample is, and therefore how representative of our global community.

6:30

Next slide, please.

6:34

So, this somewhat overwhelming slide here illustrates the demographic pool for 2022.

6:42

As can be seen on the right hand side, we have a fairly well spread set of responses across different fields of interest, and career stage of respondents. So the career stage in particular, last year, over half of the panel, was made of senior academics, and so we manage to balance that a lot more this year, so that each senioritis sort of represented by about the correspondence.

7:06

So we are trying to balance that geographically. We can also say that we have sufficient representation across continents to be able to compare capacities, conduct some comparative analysis, or the other thing is obviously quite clear that the there is a bias towards the northern hemisphere.

7:22

Some particular differences in 22 is that we saw a significant increase in the responses from China, which is now the Country, where we see nice responses, and it's the first time that the US hasn't given us the largest response rate from an individual region.

7:36

And that China response, I was a significant significantly increased, so rather than a 3% of the population to 11% this year. And that was due to a number of factors, got a really good social media campaign set up by the marketing teams out there.

7:53

And 2022 was also the first year, in which we looked at the economic status of countries, where from, which the respondents were. And it's interesting to, you know, see that difference.

8:04

And making sure that next year, we're trying to attribute and balance that a little bit more, because we have seen some of the results. There is a clear differentiation between the beliefs and behaviors of people from countries with different economic backgrounds.

8:19

So always in development, always trying to strengthen that panel, but generally representative, and at least we can identify whether or not if there is a particular shape.

8:29

Next slide, please.

8:31

And then finally, one thing that I just wanted to comment on, as well, is that one question I always like to try and address, because it's one question. I always get the end Is How biased is Our sample to Those Who are actively engaged to advocate Open Science, and open data more, generally?

8:43

Um, to address this in the survey itself, we asked respondents to what extent they agree or disagree with a number of statements that are highlighted on the left hand side of the slide.

8:54

Left-hand side of this slide, which is an adaptation of the question we asked in 20 21, but is we've so brought into a little bit more detail based on feedback that we received in 2021.

9:08

And I think this has enabled us to pull out a few different segments from what we did. Anyone attended the webinar last year? Perhaps unsurprisingly, our panel is made up predominantly from those who are pro open science, arts and science advocates. And that's made up of those who strongly agree with all of the statements in the left-hand side.

9:26

But we've also got fairly strong representation of those who are open publishing advocates, as well.

9:30

So those who strongly agree with only strongly agree with making articles and data openly available, as a slight differentiation in it.

9:38

But I would also say that we have a healthy response from those who are sort of agnostic, or those who are completely, so anti science as well. So those who don't believe in science practices, which means they make up sort of 15 to 20% of our panel. Well, this means that, you know, while we might not necessarily have to wait for the global average, although I'm not necessarily sure what the global averages of those who are like this, once, and anti science, we can at least try to highlight what difference between those who advocate for a lesson content, ecosystem site there. That we do have ways of pulling them out. At least identifying different that if people want to look at the other.

10:17

In terms of the full results, I'll hand back to Mark to sort of go through what the highlights are.

10:23

Thanks so much, Greg. Every time I turn my camera on, I lose the slide. So hopefully they're back.

10:29

But, yeah, that's really interesting, Hey, I love the term non believer of open science, as if it's, uh, is it something that is fiction? if open science is happening, folks? I'll tell you more about the trends over the years, to see how that is coming about. Obviously, as Greg was saying, as well, it is, it is hard to get as a fully representative view of every single demographic.

10:57

But it is fantastic that we are getting this diversity covered.

11:00

And I think in terms of the thought leadership pieces, as well, it's fantastic to see such a high caliber of thought leadership pieces, that we've got some talk to each of them coming in this session.

11:12

That not only highlight, kind of, the, the level, at which this is being, actively thought about, but also the different aspects of the global population looking at it. I think it's fascinating that China is responding, well, as Greg mentioned.

11:28

There is a big, there was a big marketing push in that segment, as well. So congratulations to the team for getting that out there.

11:37

But we're also seeing, my colleague, Daniel Hook, recently brought out a white paper, looking at, um, the five big metrics around academic publishing, and how China is advancing. So I think that's a really interesting, additional bit of color commentary to, to think about there.

11:57

I'm going to talk about now, what, what we're seeing over the years, what we're seeing this year, just some pull out highlights.

12:05

And then, try to kind of think about how this means we're moving forward in this space.

12:09

So, some factual stuff, some presentations from the different sides, and some more thoughts on what this could mean, and how we, as a community can move the space forward. So, as has been mentioned already, it has been running for seven years. We have had feedback from over 100, from 192 countries over that time. And it's a sustained looks is really helped. So, this is from last year. Some of the really interesting things that we found.

12:39

And, and how we've try to think about it in all 2022 Improvements to the way we communicate, with research's the way we build our technology stack and the way, the Springer Nature engages as well as other publishers as well.

12:58

So, we know that last year, we had a big, a big spike in concerns about misuse of data.

13:06

I think this idea of trust in a time of covert and academic publishing in general that we're aware of, and so, how can we start to build trust in data publishing?

13:21

Not receiving appropriate credit or acknowledgement has has continued to increase as a factor as to why researchers might not want to share their data. This is why researchers might not want to share their data.

13:37

And then a lot of the other ones were always about, kind of, you know, who's gonna pay for it? And who's going to help me do it? because still, researchers don't know a lot about licensing. They have a lot on their plate. So these were common themes we saw last year, and a lot of them weren't too surprising.

13:55

What did it surprise me last year, was this idea that funders, two thirds of researchers said funders should be mandating sharing of research data from the researchers they fund.

14:08

And they were saying that if we, um, If refund is due, say that you have to make your data available, they should remove funding, if you're not doing it. And I think this is, along the lines of, if we fund you, you have to make your data available, and it has to come from the funder. Otherwise, certain researchers may feel they're not getting the same advantage as their peers, who, who might not be sharing their data. So we can talk about that a little bit as well, and try and look at, I've got a little bit about how researchers are actually acting these days, thanks to some colleagues as well.

14:46

So we can see whether what people say on what people do is, is consistent.

14:56

This year in 2022 we have seen a growing push, probably the big news of 2022 happened happened very early on in the NIH issuing seismic mandate.

15:08

As we saw over at nature, this is being hotly debated about how seismic But from the conversations we've had, it has definitely moved the needle on, um, how researchers, we'll have to act it, particularly in the life sciences.

15:27

This is also fantastic, and that we've got participation from the NIH within the, in the report itself. So please do check out that we will hear from them later on.

15:38

We also saw UNESCO issuing an Open Science policy.

15:44

We've seen several funders around the world, an increase, what they are doing.

15:49

But I think the really interesting thing about the NIH policy coming out, which goes live in January 2023, so just three months from now, is that they've backed it up with funding around some of these different spaces.

16:04

I'm co-chair of the Open Metrics Group of the Generalist Repository Ecosystem Initiative.

16:11

And this is the NIH funding repositories to work together to develop consistent metrics. As always, it's, you know, if there is a subject specific repository, you should be making your data openly available.

16:26

If there is an institutional or thematic repository, that should go next because there's usually someone there to help you.

16:33

Then the generalist repositories listed on the on the right here can work together to try and make sure that A everybody has a place to make their data available, But be we can make the metadata is more interoperable. We can ask big questions of the data we can elucidate more knowledge from the data.

16:55

And I really like the idea of the Open Metrics subgroup, because when we think about what I showed you previously on, how researchers are concerned about sharing their data, because they're not getting credit.

17:12

I think one of the first problems we have to solve there is, how do we measure credit, Right?

17:18

The funders can't come out and say, we're going to measure credit based on downloads at this point, because your, your data might be videos, and nobody downloads videos off the internet anymore?

17:31

So, implementing Open Metrics, so we have a level, a level playing field across different repositories, so there's consistency in the way.

17:40

Ways that we measure impact should then lead to a way in which funders, can, themselves, provides credit to researchers for sharing their data.

17:51

And it won't just be a sticks approach, will also have the carrots as well.

17:57

We've obviously seen in publications that publishing your data in a repository is associated with a 25% increase in citations to the paper itself, the heavily cited paper in plos Biology.

18:11

But, well, I think this, while I think this is a huge, a bit of a statement, for researchers to see that they can get more credit for their data. I think it also is very hard for researchers to keep on top of academic papers outside of the specific fields. So, having having some kind of reward mechanism should move the space forward.

18:35

The NIH was the big one. Obviously I mentioned UNESCO, NASA, other folks had policies come out, some big publishers to. But if we look at the ... powered Sherpa Juliet's, we know that 52 funders are listed. That require data archiving is condition of funding, and a further 34 encourage it. A lot of them also require data management plans.

18:58

Which the theory has gets researchers thinking about publishing their data from the very start.

19:07

What's fascinating about this, this is the first snippet from this year, is, When the rubber hits the road, do the researchers want to, are they really the Open science advocates that they say they were?

19:19

And so, more than two thirds of responses of the state of open data 2022, are supportive to some extent, of a national mandate for making research data openly available. But this number has been declining.

19:33

And my hypothesis here is when researchers are asked, do you think open access is a good thing? Do you think open data is a good thing? Do you think open research? and open science is a good thing? It's very easy to say yes.

19:46

Um, which is why it's more puzzling to me that we have people who are non believers in science.

19:55

But, um, but when the rubber hits the road, are they actually doing it?

19:59

Because there is, as although we try and keep the barrier low, there is some extra work needed there.

20:06

And researchers already have a lot of that, uh, allocated time, going to administrative purposes like going for new grants, setting up things like all kids, making sure they're complying with the institutional policies that publish the policies that fund the policies. So the decline, I think, is something to flag.

20:25

I think we have to be very realistic about what researchers actually feel as the space evolves.

20:33

An interesting fact from this year's policy from this year's report, this was who researchers will be willing to receive support from to help in reviewing, curating, and preparing the data for public release.

20:45

And I think this is the when. When I think about publication of datasets, I don't think of it.

20:52

I don't think of the metadata improvements is peer review, I don't think, with AI.

20:58

I think it's about kind of an integrity check on the actual data publication itself. Do you have the right license? It's not saying, is this data novel? Can I rerun the experiment? This point the reproducibility and a fair?

21:13

It's really at the point of who? Who would you go to for help in?

21:17

Checking that you're doing things right and publishing your data 41% relied on public said they relied on publishes.

21:24

A 38% said they relied upon their own institution, research offices, peers, and librarians. I think there's not much in this, it is interesting that publishes is, as it has been in previous years, is slightly higher. I think this is also related to the fact that researchers are busy then don't always proactively go out and seek help within their institution.

21:50

But, at the same time, when they go to publish their papers, which is often the time they have to make the data available, thanks to great policies from Springer Nature. Or, Oh, applause.

22:02

Oh, lots of other publishers who now have data policies, that, that is the first point in which they faced with, I need to make my data available, and the publishers are the folks I usually go to for disseminating my research.

22:15

Um, some other quick tidbits on just high impact statements on what, what we're seeing on.

22:27

Major findings, of course, the data is made available, so we will be sharing the links to that, so everybody can go into dig into the data themselves. If you have any questions about the actual makeup of the, of the survey, and Greg can, can help with that at the end.

22:42

Although, 38% said that they'd go to the, the, the library for help, 72% said they would rely on an internal resource for help with managing or making their data openly available.

22:56

So I work with librarians all the time, and I know there's a lot of librarians doing heroic jobs, trying to engage with researchers day by day, one at a time, groups at a time.

23:10

But I think there's also this idea that not every library has something, not every institution has something to help make the data available.

23:23

Or people who can cover this, this broad spectrum of new things that researchers need to do around open access, open data, and policy compliant. So I think, again, something we need to be thinking about for an equitable future here.

23:39

67% of researchers said citation of their research papers would motivate them to share their data.

23:47

So that's fantastic news that we know that if you share your data in a repository, on average, you will get more citations to your paper. That was on half a million papers. So it's no small numbers. It doesn't mean that everybody is going to.

24:03

But I think this is really encouraging, and if you're if you're talking to your researchers at your institution, this is a fantastic message to be telling them.

24:12

It's also a fantastic message for institutions to be wanting, um, to encourage data.

24:20

Publishing at the institution, because we know that when we talk about rankings, however much, we don't want to, you know, we think that research quality is independent, where it's published, we do know that rankings are real, and rankings do exist.

24:38

And universities take them very seriously, rightly so, 70% of researchers said they were required to follow policy on sharing data for their most recent piece of research.

24:50

So this is what I was saying about, it's happening.

24:54

This is, this is moving along. We know that this is only going to grow and as the funder policies grow.

25:02

But 75% said they received too little credit for sharing their data openly, so that the credit issue is really the big one that we want to solve.

25:09

Lot of questions about how we do that still.

25:12

Any suggestions? I'd love to hit them at the end.

25:15

So now we're going to run through some short videos from about the correspondence pieces that the expert leaders in the space who contributed to the report.

25:25

First of all, we got Samuel Simango, Manager of Research Data Services at Stellenbosch University in South Africa. And he's talking about the steps the university has taken to comply with the National South African Open Data Strategy.

25:41

Video from Samuel Simango - Stellenbosch University

28:36

So, an excellent piece from Samuel, as he mentions, You know, it brings into account the national driver there.

28:44

The fact there are nine different policies that institutions are having to think about, but also the fact that Stellenbosch has somebody like, sand Samuel in the team to be already thinking about these areas.

28:56

And we need to make sure that that is is happening globally.

29:05

So, again, here we have Juan Miguel Palma Pena.

29:11

from the National Autonomous University of Mexico (UNAM), this is the first ever report contribution from the LAC Region.

29:23

let's hear from Juan Miguel.

29:31

Video from Juan Miguel Palma Pena - UNAM

33:12

Thank you. Juan Miguel, again, big points there.

33:16

The fact that it is similar trends happening, different speeds in different territories, Fantastic to see nine countries involved. Fantastic to see. Dataverse being used our good friends at Dataverse, who we're working on with the Generalists Repository Ecosystem initiative so we can get some normalization across the globe.

33:38

Also interesting and relevant is a contribution piece from the computer network information center, the Chinese Academy of Sciences, who talk about the infrastructure they've been developing and how has helped result in at 21 X increase in data deposits in China.

33:57

Um, next up we have Kate Mckellar, Wiley on the contribution that she authored with Rebecca Grant and Matt Cannon about a separate survey that they, they ran back in the spring, looking at the results with the additional content, additional context of the State of Open data Survey results. So, let's hear from Kate.

34:23

Video from Kate McKellar - Wiley

36:37

Thank you very much to Kate, Rebecca, and to Matt for their report.

36:41

Again, broad range of topics here that thematic, uh, requirements. You may see this if you're working at an institution, and you're having to deal with different research groups.

36:54

The common thing I talk about when I speak to humanities researchers, is, you know, what are the files you're producing, as opposed to data, because everybody works computational you now, everybody has files of some sort.

37:09

OK, and, last but not least, on this section, is a summary presentation video from the NIH, which we're very grateful for, on their contract contribution to the report, with an overview of the upcoming policy.

37:23

Ishwar, Amy, Taunton and Susan.

37:28

Video from the NIH

40:31

I think you'll all agree that we're very fortunate to have Susan Gregurick, Taunton pain.

40:39

Amy Hafez and Ishwar Chandramouliswaran.

40:43

I know as Ishwar very well, and I always struggle with a surname who, as you can see by their job titles, are really pushing the needle in terms of research data, sharing and using the same language across the world, with regards to why it's important, and how to make it A actionable plan for research data moving forward.

41:10

And so I just wanted to touch a little bit on, Thank you very much for all of our contributors there. We will be sharing the slides afterwards. They will be made available as well, this recording. I just wanted to touch a little bit on the other side of this and some colleagues of ours that Digital science and looking at what researchers actually have been doing with data so far. So we have some benchmarking before we go into big policies like the OSTP memo that came out in August.

41:38

And this June, full compliance by 2026, and the, obviously, the NIH policy that is going in, in January, to add to 51 other policies that are already live.

41:52

So, digital science also had a report on the state of trust and integrity and researched this fact, in research perspectives on data sharing, policies and practices. So, this, they looked at five different funders from around the globe.

42:06

Um, a private funder and Bill and Melinda Gates Foundation, the European Commission, National Institutes of Health, the National Science Foundation of China, and the german, German Federal Ministry of Education and Research.

42:20

And what this is looking at is what researchers have been doing with data over the last five or so years. If you look at data availability, statements, and location of the datasets. So data availability statements, for those who don't know, is in the published article, when it says, I have made my data available and you can find it here.

42:45

On the left hand side chart, you can see, although it looks relatively flat on this axes, even the flattest one here is the, Bill and Melinda Gates Foundation, which shows the growth from 40% to 50% of articles published funded by them. That have a data availability statement. If you look at the NIH, which is the purple line, here, you can see it's gone from about 12% up to 40%.

43:14

All of the other funders have gone up and to the right. So there is growth here.

43:19

If we look at where researchers are saying that data is made available for nearly all of the funders, in a repository is the biggest statement. So that's fantastic to see.

43:31

Me personally, I get upset about data available on request, which in Germany is still topping the way for the BMPF.

43:38

But we're seeing a real adoption over time and these are high percentages. No.

43:44

one is under 28%, which I think is really encouraging in terms of how many researchers are already having to make their data available, when we see where research is all making the data available.

43:58

We looked at a few repositories, and you can see the difference between 2016 and 2019.

44:02

Uh, for me, what's fascinating here, more than anything, is the growth of Git Hub.

44:07

For those who don't know, GitHub is a fantastic tool, 86 million users around the world. But, it's not a repository in the true sense, in that it is not persistent and it doesn't provide digital object identifiers.

44:20

There is a fantastic team at Git Hub working with repositories to try and ensure the software code and datasets made available in.

44:31

Git Hub can be sent to repositories like figshare and zenodo in order to get a DOI and pull that back onto the landing page on GitHub.

44:38

But it won't be something that becomes mandatory across the Git Hub system, as they serve a lot different communities other than academics.

44:49

What I found really interesting about this when we compare it just to the information on fixture.

44:55

On the graph on the top left you can see how this growth is happening.

44:59

So again, the generalist repository is going up into the right, but Git Hub is really exploding here. And I think the message I would have for researchers who are interested in making their data available, or the code available on Git Hub, is you've already done the heavy lifting. You've already done most of the hard work. Now you just need to snapshot it and put it in a repository. So it does persist.

45:23

It's also interesting to see that, when we looked at the Figshare line, Which is the bottom right graph.

45:30

If you look at the data availability statements numbers, growing up into the right, how many papers that had a data availability statement is that the data was available in Figshare, the free figshare.com. If you have a Figshare repository, we're not talking about those, this is just the generalist free repository.

45:49

But, if you search dimensions, just the same DOI DOI throughout the whole paper.

45:56

So looking in the full text of the article, the methods section references, the data availability statement, You see about double that number.

46:04

So it's fantastic to see that the actual numbers of datasets being made available is probably growing at twice the rate that we're seeing in this top left graph. So it really is moving along.

46:16

And I think that's really the take home messages, is just an abundance of technology, culture and people, from publishers, from institutions, from societies, from funders, all moving, in the same direction, towards fair data. And this is what's causing the change.

46:35

In terms of future compliance, this also means that we can start looking and saying, I mentioned metrics before being a good impact, as it is a carrot.

46:45

The same report can also, this is also a Company Dimensions Integrity Module, can also start looking at buyer by funder, how many of their papers have code availability statements, data availability statements, two different fields.

47:03

And this is really powerful, too, At some point, be able to say, well, if you've had to make your data available, have you made your data available? and I think that's really moving the space along.

47:12

So, I'm just going to finish up with a couple of slides on what's coming next and where the space is moving.

47:19

If you do have any questions, get them in now for myself or for Greg.

47:23

Figshare turnt 10 this year So we've been, I've been looking back a decade and looking forward a decade a lot.

47:29

Just over 10 years ago, this paper came out on sorry.

47:32

This book came out called The Fourth Paradigm, led by Jim Gray.

47:38

And Jim Gray was imagining a world in which all of the, all of the files were available online. All of the scientific literature was openly available. And all of the academic data was available in 20092010. When this came out it seemed like a long way for the majority of academia. the was already great repositories, like Genbank and PDP.

48:00

But it's a pretty lofty vision that the fourth paradigm is going to allow us to move further faster and allow working with AI and machine learning and have the computers, do the heavy lifting.

48:14

And the researches find the, the new knowledge within the trends that the machine learning and the AI is creating.

48:25

Doctor Gray said, we want to have a world in which all the science literature is online, all the science data online is it? And they interoperate with each other.

48:34

So, I think this really touches on what a lot of people want from academic dissemination of content. Which is, they want it to be fast, they want it to be good. And they want it to be open.

48:46

Fast and good is subjective.

48:48

But open is pretty binary, and good means different things for different types of content. Like I mentioned before, peer review of journal articles is always going to be the context around the paper, the context around the research.

49:05

Fast, can be in a click of a button. But then, if you pull on one of these things, you often move the other thing.

49:12

So, I think pre prints is a great example, which is very fast and open, but, good is something that's being worked on, Right. The peer review is not that.

49:23

So, if we look at this for academic papers, in the last 10 years, since the fourth paradigm came out, we saw in 20 19, some dates from dimensions, the open access publishing became the majority of X, of published content, for the first time. And this is continuing to grow.

49:41

I'm sure it will never get to a 100%, but it'll get pretty close.

49:46

And then we saw last year, the end of last year, the DeepMind from Google taking data from protein databank.

49:58

And, uh, they created a protein structure database, and in year one, um, they went from 26% of the protein structures in humans.

50:13

That has taken 50 years to create using crystallography very expensive.

50:18

They went from 26% to 99% overnight, And then the year, since they've gone from nearly one million structures to over 200 million structures, this is a real step change in an academic field, due to open data, it's very homogeneous, it's very consistent in the way it's described, and the types of files that they are.

50:41

And we've already seen, just, a month ago, now, they, they won the Breakthrough Prize, which is a North American Prize, but this is this boeuf paradigm step change in research.

50:56

one thing to note on that is that there is a widening chasm separating industry from academia enlarge model AI.

51:04

In fact, we've gone in 60 years from nearly 100% of large-scale AI results coming from academia too, basically zero. So that is something to be aware of, from an academic viewpoint. But I think this is something that a lot of people didn't expect to be happening in 10 years.

51:23

So to sum up, I think that this is what the next 10 years of data publishing is going to look like. We did this pilot with the NIH, where researchers submitted data.

51:33

They were checked for mainly the findable inaccessible affair, um, and made available. So, it was relatively quick.

51:43

Uh, it was relatively fast.

51:47

It was good, in the sense of what we were trying to achieve at the time with academic data publishing, and what a lot of universities are trying to achieve, publishers are trying to achieve by doing these checks on the datasets.

52:01

The results, as you can imagine, if you, if you ask people. For more metadata, you get more metadata, But we also saw more views and downloads, which is fantastic.

52:08

When you're trying to encourage researchers to describe their data better in order to make images, have more impact with their research back to the credit system, again, I do want to highlight that we shouldn't be letting perfect be the enemy of good here.

52:25

Not everybody has the resources to check every file at their institution or at a publisher.

52:32

Data with no checks can be useful.

52:35

A lot of the context is often in the paper.

52:38

And so, if a paper links to a badly describe dataset as a standalone object, it's not, it's not very useful for the future for the.

52:46

uh, fourth paradigm, AI knowledge generation, is very useful for the transparency and the reproducibility of that research today.

52:55

So if you read along this, I think this is also how we'll get the gradient of research data publishing, moving from little metadata tibet's NHS metadata to reproducibility down the fair path lines.

53:10

So as Jim Gray said, the speed at which any given this scientific discipline, advances will depend on how well it's researchers collaborate with one another, I don't think they need to collaborate in a way that Jim Gray predicted. I think that the, the repository's, the libraries, the publishes the fund does everybody who's contributed to the state of open data reports this year. And the years previously. Is that this is helping move the scent.

53:32

Make the center of this Venn diagram bigger so that we can fairly have access to the literature and the data and the ability to publish literature and data in a useful way.

53:43

So I'm going to end here, if there are any questions, I think we have a bit of time for it.

53:50

Otherwise.

53:52

All right, hand over to Laura to see if anything has come through.

53:56

Thanks very much Mark, Can't we have got a few questions And so I'll start off with this one, which is why would research has published that code and data in fixture rather than in GitHub?

54:09

That is a great question.

54:12

This is this is, this is the thing of meeting researchers where they are, right?

54:16

I don't think the researchers will ever think of generalists repositories like Figshare or the data repository at the university or a data repository or publisher.

54:29

Well, they're funded as a place where they're actively developing the code.

54:34

It's a slightly different model.

54:36

The great thing about it is, is that so many researchers already use Github or Git bucket or Git labto make their data available.

54:47

So I don't think the message should ever be, don't publish your code on, get help.

54:53

I think the message should be, if you've made your date your code available on git Hub, you are 90% of the way there to complying with your funder policy. I would hazard a guess, if they have a policy.

55:06

And that is that the data needs, the code needs to be available in a place where it gets a persistent identifier and it is persistent. one of the problems with Git Hub, of course, is that you can delete those files at any time.

55:18

If you could just delete all of your papers at anytime, the academic publishing landscape would be very different, and the way in which we discover new knowledge would be very different, so I would, I wouldn't say.

55:30

Um, it's a question of publishing your code only on picture, or other generous repositories. I'd say.

55:39

it is you getting the message across that, The hard work has already been done.

55:45

You can set up these easy, once click syncs to repositories like Figsahre and zenodo, and it is on the repository's themselves to better had to lower the barrier to entry for those researchers.

55:59

Great, thank you. This one, relates specifically to the survey data. So I'll pass it to you Greg ..., what is the current gender, and age distribution, and trends per the report? It's a good question. We don't actually ask about gender.

56:17

Um, and looking back, I'm sure we've had a reason for not doing so, but off the top of my head, I can't remember in terms of age distribution. It's kind of similar to the seniority, it's or the third, under 45. A third of 65, and then 25% over that. So, it's relatively, again, it's relatively even spread. But, yes, I do, we don't have the agenda.

56:47

Don't know why, historical thing, I imagine.

56:51

Yeah, and this one is quite a big question, somewhat, with more of a discussion one, but how can we improve the feeling of researchers towards getting credit for that research?

57:00

Because I know that was, I think, was 71% felt that they want enough credit, so I don't know if you have thoughts on how we can make that better .

57:10

I think, I think one way we, as a community could win, obviously, what I was talking about, all said it's very hard for funders, to reward researchers, without having a consistent way to track the impact.

57:27

It means different things to different communities, of course. And so, having a consistent message on how you can, in the UK, help with your breath, or in Australia, help, with your era submission. It's very difficult to do that. I think one thing we do is search.

57:45

And so, I think, encouraging better citation, uh, workflows.

57:52

There is a big community push in, called, make data Count.

57:57

And I think that by having consistent approaches to, to citing of datasets and other types of research inputs, outputs, as well as the paper, we could encourage researchers to see that they could have more impact and get credit for all of their research outputs.

58:15

Right, thank you. Next question, I believe, relates to the survey questions itself. So, are there any things, specific to social science data? So, any questions asked, But perhaps it be interesting to know, is it broken down by discipline it, or do people have to? Yes, so we have humanities and social science broken out as a separate discipline, so you can, within the data itself, you can, you can do those.

58:38

Yes. You can separate out. We didn't ask specific questions for specific fields, the pretty cross and within the, you know, within the drop-down selection.

58:49

We tried to make them as diverse as we can to incorporate perceptions of those within different fields, similar to what is it came from.

58:59

Was there was referencing? Yeah, so we actually worked with Rebecca Grant and sort of collaborate a little bit on making sure that the survey was considered the ***** and Social Science Group as well. So yes, it can be looked at and that No, not specific questions. It was a very generous of and respect.

59:19

The last one we have at the moment is what if our publishers don't require disclosure.

59:28

What should we do if our publishers don't require a disclosure? I think that's a great question.

59:33

I know Springer Nature has over 2000 journals in there, cohort and every single one of them has one or all data policies, I believe, must make your data available too.

59:48

Probably should make your data available. I think that everyone's being told that they should be doing something that's bringing nature.

59:54

I think if there's a publisher that's not requiring it, it's good practice to do So anyway, because you might have more impact, you might get more citations and it might help.

60:06

With that 25% increase the citations to the paper itself. So for you as a researcher yourself, it might be good for your own impact.

60:13

And for the publishers who don't encourage this at the moment, I would suggest I appreciate that. It's more work.

60:22

This is definitely areas, or you could put two lines into your policy to say, make your data available in a subject specific repository, if not one of these repositories that would move the needle a little bit as well.

60:37

Well, thank you.

60:39

one more just. What ways do you see for preventing usage of shared data without correct citations? Any law proposals or a hall of shame?

60:50

60:52

Greg, is Springer Natyure working on ahall of shame, No, It's not that I'm aware of.

60:58

No, I don't know, so it's I don't know.

61:02

What I would say about that is it's no different from the system we have already. Right?

61:09

What happens when someone doesn't cite your paper?

61:11

You can, you know, it's usually because of the competition in academia.

61:17

So it is a concept, consistent problem, but it's not a new problem.

61:23

So the might be innovative ways to see that with different types of research outputs from datasets to certain.

61:30

If you're, for instance, if you're citing a very well known database, or set of data, it's very hard to argue that you haven't used it when you have.

61:41

Whereas with broader concepts, like you find in papers or several studies looking at the same things, it's, it can be a bit more blurry on whether people have actually read your paper or not.

61:51

So, say, not a new one, if somebody else would like to create the Wall of Shame.

61:57

I, for several years, have wanted to e-mail every author who didn't have data available upon request. And there was a study that came out this year that looked at it. And I think 7%.

62:09

Thanks to sets digits and round.

62:12

So 93% did not made available, and I think that could be grounds for retraction.

62:19

But I am not going to be the person send those e-mails as I do not want to.

62:24

I want to make it to my 40th birthday.

62:28

As I'll say, from my perspective, there's also an element of, you know, sometimes it's not always the person's fault whose maybe use the data.

62:36

It's not necessarily clear from the, the metadata, whether those things are, you know, how to do how to cite these things. So making sure, you know, that that level of detail is included in any data set that you are sharing to avoid that as best as possible.

62:51

I think, you know, we need to be considered both sides of that discussion as well, because sometimes it can be other accidents, or just lack of consideration. So, you know, making sure that we're making it clear. How things should be shed from publishers, right down to individual researchers, you know, making sure it's clear how that data can be used, and how it should be referenced as important as well.

63:12

Thanks, Greg. That is all the questions for today, and as we are a ittle bit over, that brings to a close. Thanks, Greg. Thanks Mark for the presentations and to all the video contributors because as well, it was great to, to hear from the different authors, and we'll share the recording and the slides after, and, yeah, any other follow ups. Please do get in touch.

63:32

Thanks, everyone.

63:34

bye!

‍

View transcript