▲How to make your scientific data accessible, discoverable and usefulnature.com

80 points by JohnHammersley 704 days ago | 53 comments

ftxbro 702 days ago [-]

Oh it's by Nature one of the beneficiaries of the corporate capture of publication metrics. This is like reading an article by DeBeers about how you can do a proposal without spending as much on diamonds as you are afraid that you might.

Here's what you can do. Make a github account. Make an account on arxiv.org or biorxiv.org. Publish your work onto arxiv. Put all of your data onto your github account, including copies of whatever you put onto arxiv and including extensive markdown readme file that is a synopsis of the paper. Disseminate it by announcing it on your twitter or mastodon or rss blog or substack or research community discord and even put it to a site like hacker news if it's relevant. If you are with an affiliated institution then their press office will put a small release.

jkh1 702 days ago [-]

Can I put my 25 TB microscopy image data set on GitHub? Will they host it for free indefinitely? In the life sciences, there are dedicated public repositories (databases) where the data is hosted (free to the researcher), catalogued, standardized and curated to some extent. These repositories are searchable and often crossreference each other. So you can find a data set even if you didn't know about it before. Putting data all over the internet in dumps like Zenodo, Dryad and the like is just not very useful. Advertizing your work is probably good for your career but this is not what makes your data and work useful to others. It's how easy you make it for others to understand, access and combine your data with their own data. This means providing data and metadata using open community standards (there are already a bunch of these in life sciences even if there are gaps in coverage).

ftxbro 702 days ago [-]

> "Can I put my 25 TB microscopy image data set on GitHub? Will they host it for free indefinitely? In the life sciences, there are dedicated public repositories (databases) where the data is hosted (free to the researcher), catalogued, standardized and curated to some extent."

Yes this is better. Like if you are a scientist and you find an amazing new cancer gene then you should probably put it in the actual gene repository and not randomly on github. Because you're this hypothetical scientist you probably know the exact niche place that is most appropriate for that already. Like GenBank or whatever is the one that there was a scandal about the pre-covid coronavirus genes were mysteriously deleted from it.

dekhn 702 days ago [-]

I talked with NIH program managers and leadership about this quite some time ago and tried to convince them to explicitly fund long-term data hosting for science publications. The budgeting got complicated quickly: paying for a bucket with 25TB isn't a huge expense. Does NIH cut a deal with AWS, or GCP, or Azure, or (shudder) Oracle so they get discounts (these can halve or even more reduce storage settings). Does it go in long-term storage or live storage? And that doesn't even include egress- which in my experience, when you have lots of downloaders, gets expensive fast. Does NIH cover the distribution costs, or do requesters pay? Do people running high throughput jobs in clusters near the S3 storage get to work directly against the data or do they make a duplicate?

Their response in this instance was to fund SRA-in-the-cloud and other ventures, such as having PIs in well-connected locations like U.Chicago rent datacenter space in an exchange, negotiate very cheap hosting and bandwidth, and then give people access to compute and data either there, or in AWS (https://www.uchicagomedicine.org/forefront/news/university-o...)

This still doesn't address the "high quality metadata problem", which IMHO is the NP-complete problem of biology.

ftxbro 702 days ago [-]

The way that grants and funding work right now, from what I've seen, is incompatible with this kind of thing. It's by definition a maintenance project. They do not want to fund maintenance projects that go past the timeline of the grant like five years or whatever. No PI is going to win the Nobel prize for making some kind of janitorial service for scientists. Also the project has zero chance of curing cancer by itself. Although the PI who eventually cures every cancer might use it, they will be PI on other grants not that grant.

pizza 702 days ago [-]

bittorrent could be an interesting fallback distribution method. There's already academictorrents

tetris11 702 days ago [-]

https://zenodo.org

MavropaliasG 702 days ago [-]

Use OSF (osf.io) instead, which is made for open science

ISL 702 days ago [-]

This approach alone will not get you the peer-review necessary for colleagues to take you seriously nor the publication record necessary to get a job.

ftxbro 702 days ago [-]

It's true, you should eventually submit it to the most topical free journal after doing something like these steps to make your scientific data accessible, discoverable and useful. It won't get as many academia merit points as submitting to the journals that have lobbyists.

sacnoradhq 702 days ago [-]

If you are a computer scientist, deliver a permalink hosted Docker image for a paper that runs fully-unattended by default and produces the graphs and data in the paper verbatim. If you cannot do this, you must turn in your nerd badge and report to the cryptobros for reassignment. You win a Medal of Honor nerd badge if the code is a literate programming quine embedded in TeX source and plays a crack demoscene track while doing an AI DOOM speedrun in ANSI.

Helmut10001 702 days ago [-]

I did exactly that for a recent paper [1].

    1. 9 Jupyter Notebooks attached (with HTML converts to the journal), all figures and statistics generated in notebooks, all commits of the entire 5 year research process versioned in a Gitlab repo, using Jupytext for clarity
    2. All base data shared, using HyperLogLog to reduce privacy conflicts
    3. Versioned docker image added to our registry, which includes Jupyter and the analysis environment used for the study (Carto-Lab Docker Version 0.9.0 [2])
    4. Post acceptance, I published another notebook how other users can load and work with the shared data, including making (limited) additional inference [3]
    5. For the peer review process, I added all (redacted) notebook HTML files to a Github repo [4]

It was a fun experiment where I tried a maximum of transparency in research. This maybe added 1 full year of additional work, but I still don't regret it. Given the quite specific audience for this paper, I doubt that anyone has ever tried to open the Jupyter Notebooks - I even doubt that reviewers had a look at them, at least by judging from the comments during peer review.

[1] https://doi.org/10.1371/journal.pone.0280423

[2] https://gitlab.vgiscience.de/lbsn/tools/jupyterlab

[3] https://kartographie.geo.tu-dresden.de/ad/sunsetsunrise-demo...

[4] https://anonymous-peer12345.github.io/

sacnoradhq 701 days ago [-]

Outstanding! :D

Effortless reproducibly plus open source must be the gold standard, regardless of the form it takes.

As an aside, back when I used to live in a van, I wanted an app that could hyperlocally find a parkable location on a street or parking lot with the most shade for a given time of year. It seems roughly estimatable if high resolution height information were available and combined with the the sun path. It could also be useful if one wanted to locate housing with minimum insolation.

Helmut10001 701 days ago [-]

Just another way landscape preference can be expressed!

anthk 702 days ago [-]

Uh, Emacs' literate programming with org-mode and Guix instead of docker grants a 100% reproducibility on your tasks.

JohnHammersley 702 days ago [-]

It's also interesting to look at this in the context of the "State of Open Data" report [1].

[1] https://digitalscience.figshare.com/articles/report/The_Stat...

JohnHammersley 702 days ago [-]

and in case anyone's interested in the meandering thoughts of two start-up founders in the scholarly comms space, Mark and I had a nice chat following a recent figshare milestone :) [1]

[1] https://www.digital-science.com/tldr/article/seven-million-o...

yawnxyz 702 days ago [-]

Heh, it takes a lot of work to convert illegible scribbles in lab notebooks to well-formatted numbers and descriptors that make sense to anyone else but the person who did the experiment.

This includes other lab members on the same project...

analog31 702 days ago [-]

Something I've noticed, coming as no surprise, is that the movement towards open data and analysis got its start in fields where the data are collected in digital form in the first place. Of course lab notebooks could be authored directly on a computer, but it's still hard in fields such as wet chemistry, where you're wearing gloves, handling toxics, and don't have hands free to move a laptop around. Plus, in most academic labs, the workers are still expected to supply their own computers, and are justifiably reluctant to donate one to the cause.

rjsw 702 days ago [-]

For physical science, I think people could use ISO 10303 [1] to represent their experimental process and results.

A facility like CERN could have an accurate model of the equipment available, you just add the description of your experiment and the results to it.

[1] https://en.wikipedia.org/wiki/ISO_10303

jkh1 702 days ago [-]

The problem is that ISO standards are not open.

carlossouza 702 days ago [-]

The article doesn't fully answer its title, especially the "discoverable."

__MatrixMan__ 702 days ago [-]

I've been thinking about that problem.

I think the publishers (or maybe the universities? or anybody at the center of a community of experts really) should host an API which maps set of CTPH hashes to URL's (or ideally, CID's for use in something like IPFS). The goal would be that anybody (author or otherwise) could attach metadata after publication.

Maybe it's criticism, maybe it's instructions on how to get the included code to run, maybe it's links to related research that occurred after the initial publication...

Suppose you have metadata to attach, you generate CTPH's for the article, pick a subset of them which corresponds with the location you want to anchor your metadata to, and upload the pair to the context aggregator (these would likely be topic-centered, so if it's a biology paper you'd find a biology aggregator).

When people view the paper, they can generate the same CTPH's and query the appropriate aggregator, and they'll get the annotations back which link locations in the article's text to metadata that, for whatever reason, was not included in the original publication.

I want to use CTPH's instead of DOI's or somesuch because they don't require a third party to index the items for you, and they still work even if you have only part of the article (like maybe the rest is hidden by pagination or a paywall). You could do a speech-to-text transcription, annotate it in this way, and somebody else who generated the same transcript could then find your annotations without ever creating an ID for the speech you're annotating.

stonogo 702 days ago [-]

That's be the 'metadata' section. Encouraging scientists to include metadata, as opposed to unlabeled binary dumps, is an ongoing effort.

carlossouza 702 days ago [-]

Metadata is necessary but not sufficient.

Imho there aren't enough tools to discover scientific data.

robwwilliams 702 days ago [-]

Semantic web was supposed to help long ago, and may finally be doing so. In www. genenetwork.org we are now using RDF SPARCL and GraphQL and Xapian for speedy and flexible search that can represent much of our complex metadata. Surprising how long this has taken to catch on.

jkh1 702 days ago [-]

What seems to help in the life sciences is the existence of public repositories. These could be replaced by portals that collect info on data hosted elsewhere. But the main advantages are that they provide clear, well known places to start looking and they curate, standardize and organise the metada to make it searchable.

stonogo 702 days ago [-]

You're right, and I consider this the frontline there: https://www.go-fair.org/

jkh1 702 days ago [-]

In the life sciences there are dedicated structured repositories. These are searchable by keywords and often crossreference each other. They are the goto places for finding data.

chaxor 703 days ago [-]

IPFS or torrent are the best options for distributing data

JBorrow 702 days ago [-]

That is incorrect for scientific data. The limitations are:

a) Massive data volumes (~100 Gb - 1 Pb/project)

  ai) This means that data is typically stored on limited access machines like HPC clusters

  bi) This also means that shipping this data around is financially expensive, and cannot be supported purely by small client machines

b) A low number of seeders; scientific data is not exactly popular, and there may be network restrictions on uploads through the typically used networks;

c) The requirement for a data legacy; torrents are fantastic for ephemeral data (e.g. operating system builds), but are terrible for data that must be archived and kept for potentially decades to centuries.

0cf8612b2e1e 702 days ago [-]

Most scientific datasets are not that large. For every CERN type study there are 1000x biology papers with n=3 where the collected results sit in a single tab of an Excel document.

jhbadger 702 days ago [-]

Depends on what you mean by "biology". Maybe some traditional naturalist-style data about where butterfly species were spotted or something like that could just be a spreadsheet, but in more modern biology studies things like RNA-Seq are used, which generate gigabytes or even terabytes of data per paper.

jrumbut 702 days ago [-]

Very true, but if you're planning on expanding the work of a small, pilot study like that and you don't have the people who were involved in the original you probably need to recreate the study (to shake out the kinks in the protocol, confirm results for yourself, etc).

It would be challenging to find a solution robust enough for CERN type data but also simple enough for an n=3 undergraduate research project (that may have yielded some interesting results).

I don't know what the solution is there. My intuition is that university libraries could be involved, and that a data librarian could help you get your small study into shape or be embedded at a percentage effort on a large study.

jkh1 702 days ago [-]

In biology we now routinely produce datasets in the multiple terrabytes range. It can easily be n = 3 x 10 TB such as for example imaging 3 fly embryos by light sheet microscopy.

ISL 702 days ago [-]

Those biologists might have huge image archives, even if they use microscopes.

robwwilliams 702 days ago [-]

Those do not belong in IPFS. They won’t be replicated and may die.

staunton 702 days ago [-]

Scientific datasets like that can be very easily hosted at one of the repositories such as zotero. The only reasons people don't do that is a vague sense of insecurity about having someone declare their analysis botched, vague legal worries, vague unwillingness to do the very small amount of work required to publish data, or the hope to milk a dataset for more papers before anyone else gets a chance.

Blahah 702 days ago [-]

I guess you meant zenodo, not zotero.

staunton 702 days ago [-]

Yes, oups... I always mix them up

chaxor 699 days ago [-]

I'm not sure I agree with this view. If the NIH or DoD, etc serves data over FTP, they can certainly also serve it via IPFS. It would be much better for them to do so. Then, if anyone pulls that data (even if it's only a small dataset, say 2 TB, to their house or lab) it is now more available (since now NIH and other labs are sharing it). I would imagine that the data centers serving these files would be happy to reduce the bandwidth they have to serve by allow other labs to help out as well.

It only adds. I don't understand how it subtracts.

robwwilliams 702 days ago [-]

Not in out 5 year experience trying to use with GeneNetwork.org to share large and small datasets. IPSF is marketed as simple but is complex—or even over-engineered from some perspectives. Hate to say it, but Dropbox is much easier and stable.

Hoping IPFS makes it someday because the idea is great.

Blahah 702 days ago [-]

It seems you have confused 'distributed' with... Something else. Regardless of how easy it is for you, or how complex you found it, the data is distributed via ipfs.

nl 702 days ago [-]

I think you are thinking "distributed" in the "distributed computing/decentralized/p2p" sense.

In this discussion it means "distributed" in the "made available" sense.

The parent did not use IPFS because it didn't work for them. So no, the data was not distributed via IPFS.

anamexis 702 days ago [-]

What data?

Blahah 702 days ago [-]

Any data shared in the network.

anamexis 702 days ago [-]

I don't understand what you're responding to. GP said they tried using IPFS with their project, but it ended up being too complicated and they opted for Dropbox instead.

Blahah 702 days ago [-]

I don't understand what you think GP was responding to. Dropbox isn't distributed. The problem they articulated isn't relevant to the problem solved by ipfs.

anamexis 702 days ago [-]

OP was responding to this:

> IPFS or torrent are the best options for distributing data

And suggesting that IPFS is not a good option for distributing scientific data due to its complexity.

hsjqllzlfkf 702 days ago [-]

Every time I remember that torrents exist, that blows my mind.

stainablesteel 702 days ago [-]

did they say to publish it in an overpriced black box so only their subscribers can view it?

epgui 702 days ago [-]

That's a problem, but it's a different problem. Stay relevant.

TechBro8615 702 days ago [-]

It's literally the same problem. If the black box disappears (as black boxes tend to do) then the data is no longer accessible, discoverable, or useful.