GitHub and more: sharing data & code

A recent Nature News article ‘Democratic databases: Science on GitHub‘ discussed GitHub and other programs used for sharing code and data. As a measure for GitHub’s popularity, NatureNews looked at citations of GitHub repositories in research papers from various disciplines (source: Scopus). The article also mentioned BitBucket, Figshare and Zenodo as alternative tools for data and code sharing, but did not analyze their ‘market share’ in the same way.

Our survey on scholarly communication tools asked a question about tools used for archiving and sharing data & code, and included GitHub, FigShare, Zenodo and Bitbucket among the preselected answer options (Figure 1). Thus, our results can provide another measurement of use of these online platforms for sharing data and code.

sharedata

Figure 1 – Survey question on archiving and sharing data & code

Open Science  – in word or deed

Perhaps the most striking result is that of the 14,896 researchers among our 20,663 respondents (counting PhD students, postdocs and faculty), only 4,358 (29,3%) reported using any tools for archiving/sharing data. Saliently, of the 13,872 researchers who answered the question ‘Do you support the goals of Open Science’ (defined in the survey as  ‘openly creating, sharing and assessing research, wherever viable’), 80,0% said ‘yes’. Clearly, for open science, support in theory and adoption in practice are still quite far apart, at least as far as sharing data is concerned.

os-support-researchers

Figure 2 Support for Open Science among researchers  in our survey

Among those researchers that do archive and share data, GitHub is indeed the most often used, but just as many people indicate using ‘others’ (i.e. tools not mentioned as one of the preselected options). Figshare comes in third, followed by Bitbucket, Dryad, Dataverse, Zenodo and Pangaea (Figure 3).

all-researchers-sharing-data

Figure 3 – Survey results: tools used for archiving and sharing data & code

Among ‘others’, the most often mentioned tool was Dropbox (mentioned by 496 researchers), with other tools trailing far behind.  Unfortunately, the survey setup invalidates direct comparison of the number of responses for preset tools and tools mentioned as ‘others’ (see: Data are out. Start analyzing. But beware). Thus, we cannot say whether Dropbox is used more or less than GitHub, for example, only that it is the most often mentioned ‘other’ tool.

Disciplinary differences

As mentioned above, 29,3% of researchers in our survey reported to engage in the activity of archiving and sharing code/data. Are there disciplinary differences in this percentage? We explored this earlier in our post ‘The number games‘. We found that researchers in engineering & technology are the most inclined to archive/share data or code, followed by those in physical and life sciences. Medicine, social sciences and humanities are lagging behind at more or less comparable levels (figure 4). But is is also clear that in all disciplines archiving/sharing data or code is an activity that still only a minority of researchers engage in.

data-code-archiving-respons-researchers

Figure 4 – Share of researchers archiving/sharing data & code

Do researchers from different disciplines use different tools for archiving and sharing code & data? Our data suggest that they do (Table 1, data here). Percentages given are the share of researchers (from a given discipline) that indicate using a certain tool. For this analysis, we looked at the population of researchers (n=4,358) that indicated using at least one tool for archiving/sharing data (see also figure 4). As multiple answers were allowed for disciplines as well as tools used, percentages do not add up to 100%.

While it may be no surprise that researchers from Physical Sciences and Engineering & Technology are the most dominant GitHub users (and also the main users of BitBucket), GitHub use is strong across most disciplines. Figshare and Dryad are predominantly used in Life Sciences, which may partly be explained by the coupling of these repositories to journals in this domain (i.e. PLOS to Figshare and Gigascience, along with many others, to Dryad).

github-and-more-heatmap-table

Table 1: specific tool usage for sharing data & code across disciplines

As a more surprising finding, Dataverse seems to be adopted by some disciplines more than others. This might be due to the fact that there is often institutional  support from librarians and administrative staff for Dataverse (which was developed by Harvard and is in use at many universities). This might increase use by people who have somewhat less affinity with ‘do-it-yourself’ solutions like GitHub or Figshare. An additional reason, especially for Medicine, could be the possibility of private archiving of data in Dataverse, with control over whom to give access. This is often an important consideration when dealing with potentially sensitive and confidential patient data.

Another surprising finding is the overall low use of Zenodo – a CERN-hosted repository that is the recommended archiving and sharing solution for data from EU-projects and -institutions. The fact that Zenodo is a data-sharing platform that is available to anyone (thus not just for EU project data) might not be widely known yet.

A final interesting observation, which might go against the common idea, is that among researchers in Arts&Humanities who archive and share code, use of these specific tools is not lower than in Social Sciences and Medicine. In some cases, it is even higher.

A more detailed breakdown, e.g. across research role (PhD student, postdoc or faculty), year of first publication or country is possible using the publicly available survey data.

The number games

In our global survey on innovations in scholarly communication, we asked researchers (and people supporting researchers, such as librarians and publishers) what tools they use (or, in the case of people supporting researchers, what tools they advise) for a large number of activities across the research cycle. The results of over 20,000 respondents, publicly available for anyone analyze, can give detailed information on tool usage for specific activities, and on what tools are preferentially used together in the research workflow. It’s also possible to look at results for different disciplines, research roles, career stages and countries specifically.

But we don’t even have to dive into the data at the level of individual tools to see interesting patterns. Focusing on the number of people that answered specific questions, and on the number of tools people indicate they use (regardless of which tools that are) already reveals a lot about research practices in different subsets of our (largely self-selected) sample population.

Number of respondents
In total, we received 20,663 responses. Not all respondents answered all questions, though. The number of responses per activity could be seen to reflect whether that activity plays a role in the research workflow of respondents, or at least, to what extent they use (or advise) concrete tools to carry out that activity (although we also included all answers like ‘manually’, ‘in person’ etc).
On methodology

For each question on tool usage, we offered seven preselected choices that could be clicked (multiple answers allowed), and an ‘and also others’ answer option that, when clicked, invited people to manually enter any other tools they might use for that specific research activity (see Figure 1).

outreach-small

Figure 1 – Example question with answer options

We did not include a ‘none’ option, but at the beginning of the survey stated that people were free to skip any question they felt did not apply to them or did not want to answer. Nonetheless, many people still answered ‘none’ (or any variation thereof) as their ‘other’ options.

Since methodologically, we cannot make a distinction between people who skipped a question or who answered ‘none’, we removed all ‘none’ answers from the survey result. We also adjusted the number of respondents that clicked the ‘and also other’ option to only reflect those that indicated they used at least one tool for the specific research activity (excluding all ‘nones’ and empty answers).

Figure 2 shows the percentage of respondents that answered each specific research activity question, both researchers (PhD students, postdocs and faculty) and librarians. The activities are listed in the order they were asked about, illustrating that the variation in response rate across questions is not not simply due to ‘survey fatigue’ (i.e. people dropping out halfway through the survey).
Respondents per activity (cycle order)

Figure 2 – Response rate per survey question (researchers and librarians)

What simple question response levels already can tell us

The differences in response levels to the various questions are quite marked, ranging from barely 15% to almost 100%. It is likely that two effects are at play here. First, some activities are relevant for all respondents, e.g. writing and searching information, while others like sharing (lab) notebooks are specific to certain fields, explaining lower response levels. Second, there are some activities that are not yet carried out by many or for which respondents choose not to use any tool. This may be the case with sharing posters and presentations and with peer review outside that organized by journals.

Then there are also notable differences between researchers and librarians. Researchers expectedly more often indicate tool usage to publish and librarians are a bit more active in using or advocating tools for reference management. Perhaps more interestingly, it is clear that librarians are “pushing” tools that support sharing and openness of research.

The differences in response levels to the various questions are quite marked, ranging from barely 15% to almost 100%. Some activities seem relevant for all respondents, e.g. writing and searching information. Others, like sharing (lab) notebooks, sharing posters and presentations and peer review outside that organized by journals, may be carried out by . fewer people, or may be done without using specific tools.
Then there are also notable differences between researchers and librarians. As expected, researchers more often indicate tool usage to publish and librarians are a bit more active in using or advocating tools for reference management and selection of journals to publish in. Perhaps more interestingly, it is clear that librarians are “pushing” tools that support sharing and openness of research.

Disciplinary differences
When we look at not just the overall number of respondents per activity, but break that number down for the various disciplines covered (Figure 3), more patterns emerge. Some of these are expected, some more surprising.

Respondents per activity per discipline (researchers)

Figure 3 – Response rate per survey question across disciplines (researchers only)

As expected, almost all respondents, irrespective of discipline, indicate using tools for searching, getting access, writing and reading. Researchers in Arts & Humanities and Law report lower usage of analysis tools than those in other disciplines, and sharing data & code is predominantly done in Engineering & Technology (including computer sciences). The fact that Arts & Humanities and Law also score lower on tool usage for journal selection, publishing and measuring impact than other disciplines might be due to a combination of publication culture and the (related) fact that available tools for these activities are predominantly aimed at journal articles, not books.

Among  the more surprising results are perhaps the lower scores for reference management for Arts & Humanities and Law (again, this could be partly due to publication culture, but most reference management systems enable citation of books as well as journals). Scores for sharing notebooks and protocols were low overall, where we would have expected this activity to occur somewhat more often in the sciences (perhaps especially life sciences). Researchers in Social Sciences & Economics and in Arts & Humanities relatively often use tools to archive & share posters and presentations and to do outreach (phrased in the survey as: “tell about your research outside academia”), and interestingly enough, so do researchers in Engineering & Technology (including computer science). Finally, peer review outside that done by journals is most often done in Medicine, which is perhaps not that surprising given that many developments in open peer review are pioneered in biomedical journals such as The BMJ and BioMedCentral journals.

You’re using HOW many tools?

How many apps do you have on your smartphone*? On your tablet? Do  you expect the number of online  tools you use in your daily life as a researcher to be more or less than that?

Looking at the total number of tools that respondents to our survey indicate they use in their research workflow (including any ‘other’ tools mentioned, but excluding all ‘nones’, as described above), it turns out the average number of tools reported per person is 22 (Figure 4). The frequency distribution curve is somewhat skewed as there is a longer tail of people using higher numbers of tools (mean = 22.3; median = 21.0).

Frequency distribution of number of tools (including others) mentioned by researchers (N=14,896)

Figure 4 – Frequency distribution of total number of tools used per person (20,663 respondents)

We also wondered whether the number of tools a researcher uses varies with career stage (e.g. do early career researchers use more tools than senior professors?).

Figure 5 shows the mean values of the number of tools mentioned by researchers, broken down by career stage. We used year of first publication as a proxy for career stage, as it is a more or less objective measure across research cultures and countries, and less likely to invoke ‘refuse to answer’ then asking for age might have been.

Number of tools by year of first publication (researchers)

Figure 5 – Number of tools used across career stage (using year of first publication as proxy)

There is an increase in the number of tools used going from early to mid career stages; peaking for researchers who published their first paper 10-15 years ago. Conversely, more senior researchers seem to use less tools, with the number of tools decreasing most for researchers who first published over 25 years ago.  The differences are fairly small, however, and it remains to be seen whether they can be proven to be significant. There also might be differences across disciplines in these observed trends,  depending on publication culture within disciplines. We have not further explored this yet.

It will be interesting to correlate career stage not only with the number of tools used, but also with type of tools: do more senior researchers use more traditional tools that they have been accustomed to using throughout their career, while younger researchers gravitate more to innovative or experimental tools that have only recently become available? By combining results from our survey with information collected in our database of tools, these are the type of questions that can be explored.