Academic social networks – the Swiss Army Knives of scholarly communication

On December 7, 2016, at the STM Innovations Seminar  we gave a presentation (available from Figshare) on academic social networks. For this, we looked at the functionalities and usage of three of the major networks (ResearchGate, Mendeley and Academia.edu) and also offered some thoughts on the values and choices at play both in offering and using such platforms.

Functionalities of academic social networks
Academic social networks support activities across the research cycle, from getting job suggestions, sharing and reading full-text papers to following use of your research output within the system. We looked at detailed functionalities offered by ResearchGate, Mendeley and Academia.edu (Appendix 1) and mapped these against seven phases of the research workflow (Figure 1).

In total, we identified 170 functionalities, of which 17 were shared by all three platforms. The largest overlap between ResearchGate and Academia lies in functionalities for discovery and publication (a.o. sharing of papers), while for outreach and assessment, these two platforms have many functionalities that do not overlap. Some examples of unique functionalities include publication sessions (time-limited feedback sessions on one of your full text papers) and making metrics public or private in Academia, and Q&A’s, ‘enhanced’ full-text views and downloads and the possibility to add additional resources to publications in ResearchGate. Mendeley is the only platform offering  reference management and specific functionality for data storage, according to FAIR principles. A detailed list of all functionalities identified can be found in Appendix 1.

functionalities_euler

Figure 1. Overlap of functionalities of ResearchGate, Mendeley and Academia in seven phases of the research cycle

Within the seven phases of the research cycle depicted above, we identified 31 core research activities. If the functionalities of ResearchGate, Mendeley and Academia are mapped against these 31 activities (Figure 2), it becomes apparent that Mendeley offers the most complete support of discovery, which ResearchGate supports archiving/sharing of the widest spectrum of research output. All three platforms support outreach and assessment activities, including impact metrics.

functionalities_31activities

Figure 2. Mapping of functionalities of ResearchGate, Mendeley and Academia against 31 activities across the research worflow

What’s missing?
Despite offering 170 distinct functionalities between them, there are still important functionalities that are missing from the three major academic social networks. For a large part, these center around integration with other platforms and services:

  • Connect to ORCID  (only in Mendeley), import from ORCID
  • Show third party altmetrics
  • Export your publication list (only in Mendeley)
  • Automatically show and use clickable DOIs (only in Mendeley)
  • Automatically link to research output/object versions at initial publication platforms (only in Mendeley)

In addition, some research activities are underserved by the three major platforms. Most notably among these are activities in the analysis phase, where functionality to share notebooks and protocols might be a useful addition, as would text mining of full-text publications on the platform. And while Mendeley offers extensive reference management options, support for collaborative writing is currently not available on any of the three platforms.

If you build it, will they come?
Providers of academic social networks clearly aim to offer researchers a broad range of functionalities to support their research workflow. But which of these functionalities are used by which researchers? For that, we looked at the data of 15K researchers from our recent survey on scholarly communication tool usage. Firstly, looking at the question on which researcher profiles people use (Figure 3), it is apparent that of the preselected options, ResearchGate is the most popular. This is despite the factor that overall Academia.edu report a much higher number of accounts (46M compared to 11M for ResearchGate). One possible explanation for this discrepancy could be a high number of lapsed or passive accounts on Academia.edu – possibly set up by students.

survey_profiles

Figure 3. Survey question and responses (researchers only) on use of researcher profiles. For interactive version see http://dashboard101innovations.silk.co/page/Profiles

Looking  a bit more closely at the use of ResearchGate and Academia in different disciplines (Figure 4), ResearchGate proves to be dominant in the ‘hard’ sciences, while Academia is more popular in Arts & Humanities and to a lesser extent in Social Sciences and Economics. Whether this is due to the specific functionalities the platforms offer, the effect of what one’s peers are using or even to the name of the platforms (with researchers from disciplines identifying more with the term ‘Research’ than ‘Academia’ or vice versa) is up for debate.

survey_profiles_disciplines

Figure 4. Percentage of researchers in a given disicpline that indicate using ResearchGate and/or Academia (survey data)

If they come, what do they do?
Our survey results also give some indication as to what researchers are using academic social networks for. We had ResearchGate and Mendeley as preset answer options in a number of questions about different research activities, allowing a quantitative comparison of the use of these platforms for these specific activities (Figure 5). These results show that of these activities, ResearchGate is most often used as researcher profile, followed by its use for getting access to publications and sharing publications, respectively. Mendeley was included as preset answer option for different activities; of these, it is most often used for research management, following by reading/viewing/annotating and searching for literature/data. The results also show that for each activity it was presented as a preset option for, ResearchGate is used by most often by postdocs, while Mendeley is predominantly used by PhD students. Please note that these results do not allow a direct comparison between ResearchGate and Mendeley, except for the fourth activity in both charts: getting alerts/recommendations.

survey_presetactivities.jpg

Figure 5. Percentage of researchers using ResearchGate / Mendeley for selected research activities (survey data)

In addition to choosing tools/platforms presented as preset options, survey respondents could also indicate any other tools they use for a specific activity. This allows us to check for which other activities people use any of the academic social networks, and plot these against the activities these platforms offer functionalities for. The results are shown in Figure 6 and indicate that, in addition to activities supported by the respective platforms, people also carry out activities on social networks for which there are no dedicated functionalities. Some examples are using Academia and ResearchGate for reference management, and sharing all kinds of research outputs, including formats not specifically supported by  the respective networks. Some people even indicate using Mendeley for analysis – we would love to find out what type of research they are carrying out!

For much more and alternative data on use of these platforms’ functionalities please read the analyses by Ortega (2016), based on scraping millions of pages in these systems.

survey__reported_activities

Figure 6. Research ctivities people report using ResearchGate, Mendeley and/or Academia for (survey data)

Good, open or efficient? Choices for platform builders and researchers
Academic social networks are built for and used by many researchers for many different activities. But what kind of scholarly communication do they support? At Force11, the Scholarly Communications Working Group (of which we both are steering committee members) has been working on formulating principles for scholarly communication that encourage open, equitable, sustainable, and research- and culture-led (as opposed to technology- and business-model led) scholarschip.

This requires, among other things, that research objects and all information about them can be freely shared among different platforms, and not be locked into any one platform. While Mendeley has an API  they claim is fully open,  both ResearchGate and Academia are essentially closed systems. For example, all metrics remain inside the system (though Academia offers an export to csv that we could not get working) and by uploading full text to ResearchGate you grant them the right to change your PDFs (e.g. by adding links to cited articles that are also in ResearchGate).

There are platforms that operate from a different perspective, allowing a more open flow of research objects. Some examples are the Open Science Framework, F1000 (with F1000 Workspace), ScienceOpen, Humanities Commons and GitHub (with some geared more towards specific disciplines). Not all platforms support all the same activities as ResearchGate and Academia (Figure 7), and there are marked differences in the level of support for activities: sharing a bit of code through ResearchGate is almost incomparable to the full range of options for this at GitHub. All these platforms pose alternatives for researchers wanting to conduct and share their research in a truly open manner.

functionalities_open_networks

Figure 7. Alternative platforms that support research in multiple phases of the research cycle

Reading list
Some additional readings on academic social networks and their use:

Appendix 1
List of functionalities within ResearchGate, Mendeley and Academia (per 20161204). A live, updated version of this table can be found here: http://tinyurl.com/ACMERGfunctions.

functionalities-list-total

Appendix 1. Detailed functionalities of ResearchGate, Mendeley and Academia per 20161204. Live, updated version at http://tinyurl.com/ACMERGfunctions

Tools that love to be together

[updates in brackets below]
[see also follow-up post: Stringing beads: from tool combinations to workflows]

Our survey data analyses so far have focused on tool usage for specific research activities (e.g. GitHub and others: data sharing, Who is using altmetrics tools, The number games). As a next step, we want to explore which tool combinations occur together in research workflows more often than would be expected by chance. This will also facilitate identification of full research workflows, and subsequent empirical testing of our hypothetical workflows against reality.

Checking which tools occur together more often than expected by chance is not as simple  as looking which tools are most often mentioned together. For example, even if two tools are not used by many people, they might still occur together in people’s workflows more often than expected based on their relatively low overall usage. Conversely, take two tools that each are used by many people: stochastically, a sizable proportion of those people will be shown to use both of them, but this might still be due to chance alone.

Thus, to determine whether the number of people that use two tools together is significantly higher than can be expected by chance, we have to look at the expected co-use of these tools given the number of people that use either of them. This can be compared to the classic example in statistics of taking colored balls out of an urn without replacement: if an urn contains 100 balls (= the population) of which 60 are red (= people in that population who use tool A), and from these 100 balls a sample of 10 balls is taken (= people in the population who use tool B), how many of these 10 balls would be red (=people who use both tool A and B)? This will vary with each try, of course, but when you repeat the experiment many times, the most frequently occurring number of red balls in the sample will be 6. The stochastic distribution in this situation is the hypergeometric distribution.

memrise-heatmap

Figure 1. Source: Memrise

For any possible number x of red balls in the sample (i.e. 1-10), the probability of result x occurring at any given try can be calculated with the hypergeometric probability function. The cumulative hypergeometric probability function gives the probability that the number of red balls in the sample is x or higher. This probability is the p-value of the hypergeometric test (identical to the one-tailed Fisher test), and can be used to assess whether an observed result (e.g. 9 red balls in the sample) is significantly higher than expected by chance. In a single experiment as described above, a p-value of less than 0.05 is commonly considered significant.

In our example, the probability of getting at least 9 red balls in the sample is 0.039 (Figure 2).  Going back to our survey data, this translates to the probability that in a population of 100 people, of which 60 people use tool A and 10 people use tool B, 9 or more people use both tools.

hg-calculation-example

Figure 2 Example of hypergeometric probability calculated using GeneProf.

In applying the hypergeometric test to our survey data, some additional considerations come into play.

Population size
First, for each combination of two tools, what should be taken as total population size (i.e. the 100 balls/100 people in the example above)? It might seem intuitive that that population is the total number of respondents (20,663 for the survey as a whole). However, it is actually better to use only the number of respondents who answered the survey questions where tools A and B occurred as answers.

People who didn’t answer both question cannot possibly have indicated using both tools A and B. In addition, the probability that at least x people are found to use tools A and B together is lower in a large total population than in a small population. This means that the larger the population, the smaller the number of respondents using both tools needs to be for that number to be considered significant. Thus, excluding people that did not answer both questions (and thereby looking at a smaller population) sets the bar higher for two tools to be considered preferentially used together.

Choosing the p-value threshold
The other consideration in applying the hypergeometric test to our survey data is what p-value to use as a cut-off point for significance. As said above, in a single experiment, a result with a p-value lower than 0.05 is commonly considered significant. However, with multiple comparisons (in this case: when a large number of tool combinations is tested in the same dataset), keeping the same p-value will result in an increased number of false-positive results (in this case: tools incorrectly identified as preferentially used together).

The reason is that a p-value of 0.05 indicates there is a 5% chance the observed result is due to chance.  With many observations, there will be inevitably be more results that may seem positive, but are in reality due to chance.

One possible solution to this problem is to divide the p-value threshold by the number of tests  carried out simultaneously. This is called the Bonferroni correction. In our case, where we looked at 119 tools (7 preset answer options for 17 survey questions) and thus at 7,021 unique tool combinations, this results in a p-value threshold of 0.0000071.

Finally, when we not only want to look at tools used more often together than expected by chance, but also at tools used less often together than expected, we are performing a 2-tailed, rather than a 1-tailed test. This means we need to halve the p-value used to determine significance, resulting in a p-value threshold of 0.0000036.

Ready, set, …
Having made the decisions above, we are now ready to apply the hypergeometric test to our survey data. For this, we need to know for each tool combination (e.g. tool A and B, mentioned as answer options in survey questions X and Y, respectively):

a) the number of people that indicate using tool A
b) the number of people that indicate using tool B
c) the number of people that indicate using both tool A and B
d) the number of people that answered both survey questions X and Y (i.e. indicated using at least one tool (including ‘others’) for activity X and one for activity Y).

These numbers were extracted from the cleaned survey data either by filtering in Excel (a,b (12 MB), d (7 MB)) or through an R-script (c, written by Roel Hogervorst during the Mozilla Science Sprint.

The cumulative probability function was calculated in Excel (values and calculations) using the following formulas:

=1-HYPGEOM.DIST((c-1),a,b,d,TRUE)
(to check for tool combination used together more often than expected by chance)

and
=HYPGEOM.DIST(c,a,b,d,TRUE)
(to check for tool combination used together less often than expected by chance)

excel-sucks

Figure 3 – Twitter

Bonferroni correction was applied to the resulting p-values as described above and conditional formatting was used to color the cells. All cells with a p-value less than 0.0000036 were colored green or red, for tools used more or less often together than expected by chance, respectively.

The results were combined into a heatmap with green-, red- and non-colored cells (Fig 4), which can also be found as first tab in the Excel-files (values & calculations).

[Update 20170820: we now also have made the extended heatmap for all preset answer options and the 7 most often mentioned ‘others’ per survey question (Excel files: values & calculations)]

heatmap-2-tailed-at-a-glance

Figure 4 Heatmap of tool combinations used together more (green) or less (red) often than expected by chance (click on the image for a larger, zoomable version).

Pretty colors! Now what?
While this post focused on methodological aspects of identifying relevant tool combinations, in future posts we will show how the results can be used to identify real-life research workflows. Which tools really love each other, and what does that mean for the way researchers (can) work?

Many thanks to Bastian Greshake for his helpful advice and reading of a draft version of this blogpost. All errors in assumptions and execution of the statistics remain ours, of course 😉 

GitHub and more: sharing data & code

A recent Nature News article ‘Democratic databases: Science on GitHub‘ discussed GitHub and other programs used for sharing code and data. As a measure for GitHub’s popularity, NatureNews looked at citations of GitHub repositories in research papers from various disciplines (source: Scopus). The article also mentioned BitBucket, Figshare and Zenodo as alternative tools for data and code sharing, but did not analyze their ‘market share’ in the same way.

Our survey on scholarly communication tools asked a question about tools used for archiving and sharing data & code, and included GitHub, FigShare, Zenodo and Bitbucket among the preselected answer options (Figure 1). Thus, our results can provide another measurement of use of these online platforms for sharing data and code.

sharedata

Figure 1 – Survey question on archiving and sharing data & code

Open Science  – in word or deed

Perhaps the most striking result is that of the 14,896 researchers among our 20,663 respondents (counting PhD students, postdocs and faculty), only 4,358 (29,3%) reported using any tools for archiving/sharing data. Saliently, of the 13,872 researchers who answered the question ‘Do you support the goals of Open Science’ (defined in the survey as  ‘openly creating, sharing and assessing research, wherever viable’), 80,0% said ‘yes’. Clearly, for open science, support in theory and adoption in practice are still quite far apart, at least as far as sharing data is concerned.

os-support-researchers

Figure 2 Support for Open Science among researchers  in our survey

Among those researchers that do archive and share data, GitHub is indeed the most often used, but just as many people indicate using ‘others’ (i.e. tools not mentioned as one of the preselected options). Figshare comes in third, followed by Bitbucket, Dryad, Dataverse, Zenodo and Pangaea (Figure 3).

all-researchers-sharing-data

Figure 3 – Survey results: tools used for archiving and sharing data & code

Among ‘others’, the most often mentioned tool was Dropbox (mentioned by 496 researchers), with other tools trailing far behind.  Unfortunately, the survey setup invalidates direct comparison of the number of responses for preset tools and tools mentioned as ‘others’ (see: Data are out. Start analyzing. But beware). Thus, we cannot say whether Dropbox is used more or less than GitHub, for example, only that it is the most often mentioned ‘other’ tool.

Disciplinary differences

As mentioned above, 29,3% of researchers in our survey reported to engage in the activity of archiving and sharing code/data. Are there disciplinary differences in this percentage? We explored this earlier in our post ‘The number games‘. We found that researchers in engineering & technology are the most inclined to archive/share data or code, followed by those in physical and life sciences. Medicine, social sciences and humanities are lagging behind at more or less comparable levels (figure 4). But is is also clear that in all disciplines archiving/sharing data or code is an activity that still only a minority of researchers engage in.

data-code-archiving-respons-researchers

Figure 4 – Share of researchers archiving/sharing data & code

Do researchers from different disciplines use different tools for archiving and sharing code & data? Our data suggest that they do (Table 1, data here). Percentages given are the share of researchers (from a given discipline) that indicate using a certain tool. For this analysis, we looked at the population of researchers (n=4,358) that indicated using at least one tool for archiving/sharing data (see also figure 4). As multiple answers were allowed for disciplines as well as tools used, percentages do not add up to 100%.

While it may be no surprise that researchers from Physical Sciences and Engineering & Technology are the most dominant GitHub users (and also the main users of BitBucket), GitHub use is strong across most disciplines. Figshare and Dryad are predominantly used in Life Sciences, which may partly be explained by the coupling of these repositories to journals in this domain (i.e. PLOS to Figshare and Gigascience, along with many others, to Dryad).

github-and-more-heatmap-table

Table 1: specific tool usage for sharing data & code across disciplines

As a more surprising finding, Dataverse seems to be adopted by some disciplines more than others. This might be due to the fact that there is often institutional  support from librarians and administrative staff for Dataverse (which was developed by Harvard and is in use at many universities). This might increase use by people who have somewhat less affinity with ‘do-it-yourself’ solutions like GitHub or Figshare. An additional reason, especially for Medicine, could be the possibility of private archiving of data in Dataverse, with control over whom to give access. This is often an important consideration when dealing with potentially sensitive and confidential patient data.

Another surprising finding is the overall low use of Zenodo – a CERN-hosted repository that is the recommended archiving and sharing solution for data from EU-projects and -institutions. The fact that Zenodo is a data-sharing platform that is available to anyone (thus not just for EU project data) might not be widely known yet.

A final interesting observation, which might go against the common idea, is that among researchers in Arts&Humanities who archive and share code, use of these specific tools is not lower than in Social Sciences and Medicine. In some cases, it is even higher.

A more detailed breakdown, e.g. across research role (PhD student, postdoc or faculty), year of first publication or country is possible using the publicly available survey data.

The number games

In our global survey on innovations in scholarly communication, we asked researchers (and people supporting researchers, such as librarians and publishers) what tools they use (or, in the case of people supporting researchers, what tools they advise) for a large number of activities across the research cycle. The results of over 20,000 respondents, publicly available for anyone analyze, can give detailed information on tool usage for specific activities, and on what tools are preferentially used together in the research workflow. It’s also possible to look at results for different disciplines, research roles, career stages and countries specifically.

But we don’t even have to dive into the data at the level of individual tools to see interesting patterns. Focusing on the number of people that answered specific questions, and on the number of tools people indicate they use (regardless of which tools that are) already reveals a lot about research practices in different subsets of our (largely self-selected) sample population.

Number of respondents
In total, we received 20,663 responses. Not all respondents answered all questions, though. The number of responses per activity could be seen to reflect whether that activity plays a role in the research workflow of respondents, or at least, to what extent they use (or advise) concrete tools to carry out that activity (although we also included all answers like ‘manually’, ‘in person’ etc).
On methodology

For each question on tool usage, we offered seven preselected choices that could be clicked (multiple answers allowed), and an ‘and also others’ answer option that, when clicked, invited people to manually enter any other tools they might use for that specific research activity (see Figure 1).

outreach-small

Figure 1 – Example question with answer options

We did not include a ‘none’ option, but at the beginning of the survey stated that people were free to skip any question they felt did not apply to them or did not want to answer. Nonetheless, many people still answered ‘none’ (or any variation thereof) as their ‘other’ options.

Since methodologically, we cannot make a distinction between people who skipped a question or who answered ‘none’, we removed all ‘none’ answers from the survey result. We also adjusted the number of respondents that clicked the ‘and also other’ option to only reflect those that indicated they used at least one tool for the specific research activity (excluding all ‘nones’ and empty answers).

Figure 2 shows the percentage of respondents that answered each specific research activity question, both researchers (PhD students, postdocs and faculty) and librarians. The activities are listed in the order they were asked about, illustrating that the variation in response rate across questions is not not simply due to ‘survey fatigue’ (i.e. people dropping out halfway through the survey).
Respondents per activity (cycle order)

Figure 2 – Response rate per survey question (researchers and librarians)

What simple question response levels already can tell us

The differences in response levels to the various questions are quite marked, ranging from barely 15% to almost 100%. It is likely that two effects are at play here. First, some activities are relevant for all respondents, e.g. writing and searching information, while others like sharing (lab) notebooks are specific to certain fields, explaining lower response levels. Second, there are some activities that are not yet carried out by many or for which respondents choose not to use any tool. This may be the case with sharing posters and presentations and with peer review outside that organized by journals.

Then there are also notable differences between researchers and librarians. Researchers expectedly more often indicate tool usage to publish and librarians are a bit more active in using or advocating tools for reference management. Perhaps more interestingly, it is clear that librarians are “pushing” tools that support sharing and openness of research.

The differences in response levels to the various questions are quite marked, ranging from barely 15% to almost 100%. Some activities seem relevant for all respondents, e.g. writing and searching information. Others, like sharing (lab) notebooks, sharing posters and presentations and peer review outside that organized by journals, may be carried out by . fewer people, or may be done without using specific tools.
Then there are also notable differences between researchers and librarians. As expected, researchers more often indicate tool usage to publish and librarians are a bit more active in using or advocating tools for reference management and selection of journals to publish in. Perhaps more interestingly, it is clear that librarians are “pushing” tools that support sharing and openness of research.

Disciplinary differences
When we look at not just the overall number of respondents per activity, but break that number down for the various disciplines covered (Figure 3), more patterns emerge. Some of these are expected, some more surprising.

Respondents per activity per discipline (researchers)

Figure 3 – Response rate per survey question across disciplines (researchers only)

As expected, almost all respondents, irrespective of discipline, indicate using tools for searching, getting access, writing and reading. Researchers in Arts & Humanities and Law report lower usage of analysis tools than those in other disciplines, and sharing data & code is predominantly done in Engineering & Technology (including computer sciences). The fact that Arts & Humanities and Law also score lower on tool usage for journal selection, publishing and measuring impact than other disciplines might be due to a combination of publication culture and the (related) fact that available tools for these activities are predominantly aimed at journal articles, not books.

Among  the more surprising results are perhaps the lower scores for reference management for Arts & Humanities and Law (again, this could be partly due to publication culture, but most reference management systems enable citation of books as well as journals). Scores for sharing notebooks and protocols were low overall, where we would have expected this activity to occur somewhat more often in the sciences (perhaps especially life sciences). Researchers in Social Sciences & Economics and in Arts & Humanities relatively often use tools to archive & share posters and presentations and to do outreach (phrased in the survey as: “tell about your research outside academia”), and interestingly enough, so do researchers in Engineering & Technology (including computer science). Finally, peer review outside that done by journals is most often done in Medicine, which is perhaps not that surprising given that many developments in open peer review are pioneered in biomedical journals such as The BMJ and BioMedCentral journals.

You’re using HOW many tools?

How many apps do you have on your smartphone*? On your tablet? Do  you expect the number of online  tools you use in your daily life as a researcher to be more or less than that?

Looking at the total number of tools that respondents to our survey indicate they use in their research workflow (including any ‘other’ tools mentioned, but excluding all ‘nones’, as described above), it turns out the average number of tools reported per person is 22 (Figure 4). The frequency distribution curve is somewhat skewed as there is a longer tail of people using higher numbers of tools (mean = 22.3; median = 21.0).

Frequency distribution of number of tools (including others) mentioned by researchers (N=14,896)

Figure 4 – Frequency distribution of total number of tools used per person (20,663 respondents)

We also wondered whether the number of tools a researcher uses varies with career stage (e.g. do early career researchers use more tools than senior professors?).

Figure 5 shows the mean values of the number of tools mentioned by researchers, broken down by career stage. We used year of first publication as a proxy for career stage, as it is a more or less objective measure across research cultures and countries, and less likely to invoke ‘refuse to answer’ then asking for age might have been.

Number of tools by year of first publication (researchers)

Figure 5 – Number of tools used across career stage (using year of first publication as proxy)

There is an increase in the number of tools used going from early to mid career stages; peaking for researchers who published their first paper 10-15 years ago. Conversely, more senior researchers seem to use less tools, with the number of tools decreasing most for researchers who first published over 25 years ago.  The differences are fairly small, however, and it remains to be seen whether they can be proven to be significant. There also might be differences across disciplines in these observed trends,  depending on publication culture within disciplines. We have not further explored this yet.

It will be interesting to correlate career stage not only with the number of tools used, but also with type of tools: do more senior researchers use more traditional tools that they have been accustomed to using throughout their career, while younger researchers gravitate more to innovative or experimental tools that have only recently become available? By combining results from our survey with information collected in our database of tools, these are the type of questions that can be explored.

Mozilla Science Lab Global Sprint 2016 – getting started with analysis

mozillasprint2016

On June 2-3 between 9:00-13:00, we want to bring a group of smart people together in Utrecht to kickstart the analysis of our survey data. This event will be part of the Mozilla Science Lab Global Sprint 2016.

What is Mozilla Science Lab?
Mozilla Science Lab is a community of researchers, developers, and librarians making research open and accessible and empowering open science leaders through fellowships, mentorship, and project-based learning.

What is the Global Sprint?
This two-day sprint event brings together researchers, coders, librarians and the public from around the globe to hack on open science and open data projects in their communities. This year, it has four tracks anyone can contribute to: tools, citizen science, open educational resources and open data. There are about 30 locations participating in this year’s sprint – our event in Utrecht is one of them.

What will we do during the sprint?
Quick exploration of the survey results can be done in an interactive dashboard on Silk (http://dashboard101innovations.silk.co), but many more in-depth analyses are possible. Some examples can already been found on Kaggle. During this Mozilla Science Lab Sprint, we intend to make a head-start with these analyses by bringing together people with expertise in numerical and textual analysis.

The survey results can provide insights into current practices across various fields, research roles, countries and career stages, and can be useful for researchers interested in changing research workflows. The data also makes it possible to correlate research tool usage to stance on Open Access and Open Science, and contains over 10,000 free-text answers on what respondents consider the most important developments in scholarly communication.

So: two half-days (mornings only) of coding/hacking t0 discover patterns, make connections and get information from our data. If you have ideas of your own on what to do with our data, or want to help us realize our ideas, join us! You bring your laptop and your mind, we supply a comfortable space, 220V, coffee, WiFi and something healthy (or less so) to eat – and of course, the data to work with!

You can join us on either of these days or both days if you like. On both days there will be a short introduction to get people started. Ideally the first day will result in some analyses/scripts that participants can build on during the second day.

Who can join?
We invite all smart people, but especially those with experience in e.g. R, Python, Google Refine, NVIVO or AtlasTI. Any and all ideas for analysis are welcome!

You can register here, or just contact us to let us know you’re coming.

Where? When?
The sprint will take place on June 2-3 on the Uithof in Utrecht, either in the Utrecht University Library building, or another space close by.

Please note that the sprint will take place between 9:00-13:00 on both days. Participants are of course free to continue working after these hours, and online support will be given wherever possible.

mozillasprint2016-Utrechtmap

Data are out. Start analyzing. But beware.

Now that we have our data set on research tool usage out and shared the graphical dashboard, let the analysis start! We hope people around the world will find the data interesting and useful.

If you are going to do in depth analyses, make sure to read our article on the survey backgrounds and methods. It helps you understand the type of sampling we used and the resulting response distributions. It also explains the differences between the raw and cleaned data sets.

For more user friendly insights, you can use the graphical dashboard made in Silk. It is easy to use, but still allows for quite sophisticated filtering and even supports filtering answers to one question by answers given to another question. Please be kind on Silk: it crunches a lot of data and may sometimes need a few seconds to render the charts.

example chart with filter options

Example chart that also shows filter options in the dashboard

When looking at the charts and when carrying out your analyses, please note two things.

First, whatever you are going to do, make sure to reckon with the fundamental difference between results from preset answers (entered by simply clicking an image) and those from specifications of other tools used (entered by typing the  tool names manually). The latter are quite probably an underestimation and thus cannot be readily compared with the former. [Update 20160501: This is inherent to the differences between open and closed questions, of which ease of answering the question is one aspect. Specifications of ‘others’ can be seen as an open question]. This is why we present them separately in the dashboard  Integrated lists of these two types of results, if made at all, should be accompanied with the necessary caveats.

Frequency distribution of survey answers

Frequency distribution of 7 preset answers (dark blue) and the first 7 ‘other’ tools (light blue) per survey question

Second, basic statistics tells that when you apply filters, the absolute numbers in some cases can become so low as to render the results unfit for any generalization. And the other way around: when not filtering, please note that usage patterns will vary according to research role, field, country etc. Also, our sample was self-selected and thus not necessarily representative.

Now that we are aware of these two limitations, nothing stops you (and us) to dive in.

Our own priorities, time permitting, are to look at which tools are used together across research activities and why that is, concentration ratios of tools used for the various research activities, and combining these usage data with data on the tools themselves like age, origin, business model etc. More in general, we want to investigate what tool usage says about the way researchers shape their workflow: do they choose tools to make their work more efficient, open and/or reproducible?  We also plan to do a more qualitative analysis of the thousands of answers people gave to the question what they see as the most important development in scholarly communication.

By the way, we’d love to get your feedback and learn what you are using these data for, whether it is research, evaluation and planning of services or something else still. Just mail us, or leave your reply here or on any open commenting/peer review platform!

Support for Open Science in EU member states

In preparation for the EU Open Science Conference on April 4-5 in Amsterdam, we looked at what our survey data reveal about declared support for Open Access and Open Science among researchers in the EU.

Support for Open Access and Open Science

Of the 20,663 survey respondents, 10,297 were from the EU, of which 7,358 were researchers (from PhD-students to faculty). Most respondents provided an answer to the two multiple-choice questions on whether or not they support the goals of Open Access and Open Science, respectively. A large majority expressed support for Open Access (87%) and Open Science (79%) (see Fig 1).

OA/OS support from EU researchers

Fig. 1 Responses from EU researchers to survey questions on support for Open Access and Open Science

Even though support for Open Science is less than for Open Access, this does not mean that many more people actively state they do NOT support Open Science, as compared to Open Access (see Fig 1). Rather, more people indicate ‘I don’t know’ in answer to the question on Open Science. This could mean they have not yet reached an opinion on Open Science,  that they perhaps support some aspects of Open Science and not others, or simply that they found the wording of the question confusing.

It is interesting to note that the Open Access support figure roughly corresponds with results from Taylor & Francis Open Access surveys of 2013 and 2014, that reported only 16 and 11 percent respectively that agreed with the statement that there are no fundamental benefits to Open Access publication.


Differences between member states

When we look at the differences in professed support for Open Access and Open Science in the various EU member states (see Fig 2, Table 1) we see that support for Open Access is relatively high in many Western European countries. Here, more funding opportunities for Open Access are often available, either through institutional funds or increasingly through negotiations with publishers, where APCs are included in institutional subscriptions for hybrid Open Access journals. Perhaps many researchers in Southern and Eastern member states associate Open Access with either expensive APCs or with “free” or nationally oriented journals they wish to avoid because they are required to publish “international, highly ranked” venues.

Conversely, support for Open Science is higher in many ountries in Southern and Eastern Europe. As pure conjecture, may we state that in these regions, with sometimes less developed research infrastructures, the benefits of Open Science, e.g. for collaboration,  might be more apparant? The observed outliers to this general pattern (e.g. Belgium and Italy) illustrate both the limitations of these survey data (number of responses and possible bias) and the fact that the whole picture is likely to be more complicated.

OA-OS support EU member states

Fig. 2 Level of support for Open Access (left panel) and Open Science (right panel) in individual EU member states. Scale is based on non-weighted country averages. Results for states with less than 20 individual responses are omitted (see Table 1).

In general, the above differences between member states come into even clearer focus when support for Open Science is compared to that for Open Access, for each country. Fig 3 shows whether support for Open Science in a given country is higher or lower than for Open Access. Again, in most Western European countries Open Access is easily embraced while Open Science, perhaps because it is going further and being a more recent development, meets more doubt or even resistance. In many Southern and Eastern European countries, the pattern is reversed.  Clearly though, this cannot be the full story. Finding out what is behind these differences may valuably inform discussions on how to proceed with Open Access/Open Science policies and implementation.

OS vs. OA support EU member states

Fig. 3 Ratio of support for Open Science (OS) and Open Access (OA) in individual EU member states (red = relatively more support for OA than for OS, green = relatively more support for OS than OA). Scale is based non-weighted country ratios. Results for states with less than 20 individual responses were omitted (see Table 1).

Irrespective of differences between countries, the overall big majority support of Open Access as well as Open Science among European researchers is perhaps the most striking result. Of course, support not automatically implies that one puts ideas into practice. For this, it will be interesting to look at the actual research workflows of the researchers that took our survey, to see in how far their practices align with their stated support for Open Access and Open Science. Also, since our survey used a self-selected sample (though distribution was very broad), care should be taken in interpretation of the results, as they might be influenced by self-selection bias.

Data

The aggregated data underlying this post are shown in Table 1. For this analysis, we did not yet look at differences between scientific disciplines or career stage. Full (anonymized) data on this and all other survey questions will be made public on April 15th.

Do you support the goal of Open Access? Do you support the goals of Open Science?
Yes No I don’t know # responses Yes No I don’t know # responses
Austria 95% 2% 3% 60 83% 3% 14% 66
Belgium 89% 5% 6% 103 88% 3% 9% 102
Bulgaria 81% 14% 5% 21 72% 0% 28% 18
Croatia 85% 12% 3% 33 94% 0% 6% 31
Cyprus 69% 8% 23% 13 69% 8% 23% 13
Czech Republic 73% 13% 13% 75 69% 13% 18% 78
Denmark 90% 1% 9% 80 84% 0% 16% 82
Estonia 85% 8% 8% 13 92% 8% 0% 13
Finland 84% 4% 12% 92 83% 3% 14% 95
France 87% 5% 8% 686 79% 5% 16% 699
Germany 87% 3% 9% 1165 76% 7% 18% 1179
Greece 81% 7% 12% 214 85% 4% 12% 222
Hungary 89% 9% 2% 45 83% 10% 7% 41
Ireland 81% 5% 15% 62 82% 5% 13% 62
Italy 79% 7% 14% 407 77% 4% 18% 413
Latvia 86% 0% 14% 7 83% 0% 17% 6
Lithuania 88% 0% 13% 8 75% 13% 13% 8
Luxembourg 86% 0% 14% 7 57% 0% 43% 7
Malta 100% 0% 0% 8 75% 0% 25% 8
Netherlands 89% 2% 9% 1610 75% 5% 20% 1627
Poland 86% 7% 7% 85 88% 5% 7% 83
Portugal 88% 5% 8% 129 84% 5% 11% 133
Romania 80% 5% 15% 82 85% 5% 10% 82
Slovakia 70% 5% 25% 20 82% 6% 12% 17
Slovenia 96% 0% 4% 27 96% 0% 4% 28
Spain 87% 3% 10% 537 88% 2% 10% 542
Sweden 90% 3% 6% 146 76% 6% 19% 145
United Kingdom 88% 3% 9% 1113 79% 4% 17% 1123
Total 87% 4% 9% 6848 79% 5% 17% 6923

Table 1 Aggregated data on support of Open Access and Open Science per EU member state.

Rising stars: fastest growing tools on Twitter

To gain some insight in the popularity of online tools for scholarly communication, we have been tracking the number of Twitter followers monthly for each of the 600 tools in our growing database.

Twitter followers as a measure
The number of Twitter followers is one of many possible measures of interest/popularity, each with their own limitations (see for example, this blogpost by Bob Muenchen). Twitter follower data are freely available, allow for semi-automatic collection and are relatively transparent as accounts of followers can be checked. On the other hand, Twitter data have some limitations: only about 2/3 of the tools in our database have their own Twitter account, and following a tool’s tweets does not equal usage (in timing or volume). Also by definition it is restricted to people having Twitter accounts, which will favour younger generations and people more oriented to online communication. Finally, expression of interest by following happens at one moment in time so further rise or fall of a person’s interest in the tool is not reflected with this measure. Our global survey on research tool usage that is currently running, will provide more substantiated data on actual tool usage.

Given these limitations, we consider the number of Twitter followers to be an indication of potential interest in a tool. Looking over time, the rate of growth in Twitter followers can give an indication of tools that are most rapidly gaining interest.

Rising stars
The following tables show  the tools in our database with the largest relative increase in Twitter followers over the last six months (July 1, 2015-January 1, 2016), both for tools (n=207) that had over 1000 followers (Table 1) and tools (n=137) with between 100-1000 followers (Table  2) on July 1, 2015. We used these thresholds to filter out very early stages of Twitter accounts, that often see high growth rates with very low absolute numbers (e.g. a five-fold increase from 10-50)

Rank Tool / site Year of launch Research phase Twitter followers Jan 1, 2016 Twitter followers July 1, 2015 Relative increase
1 GitLab.com 2014 Publication 22.9K 12.9K 1.78
2 Jupyter 2015 Analysis 5519 3362 1.64
3 Open Library of Humanities 2014 Publication 4964 3102 1.60
4 Reddit Science 2008 Outreach 2183 1417 1.54
5 Qualtrics 2002 Analysis 11.7K 7634 1.53
6 BioRxiv 2013 Publication 3281 2235 1.47
7 Open Science Framework 2013 Preparation 4524 3127 1.45
8 Kaggle 2010 Preparation 42.3K 29.9K 1.41
9 Import.io 2013 Analysis 14.0K 10.0K 1.40
10 The Conversation 2011 Outreach 44.5K 33.3K 1.34

Table 1. Tools with the largest relative increase in Twitter followers – July 2015-January 2016 (> 1000 followers on July 1, 2015)

Rank Tool / site Year of launch Research phase Twitter followers Jan 1, 2016 Twitter followers July 1, 2015 Relative increase
1 Benchling 2013 Analysis 1419 689 2.06
2 Piirus 2014 Outreach 1877 915 2.05
3 Sciforum 2009 Outreach 242 120 2.02
4 Before the abstract 2014 Outreach 464 236 1.97
5 Mark2Cure 2014 Discovery 751 425 1.77
6 ManyLabs 2014 Analysis 196 111 1.77
7 SciVal 2009 Assessment 397 225 1.76
8 Elsevier Atlas 2014 Outreach 611 372 1.64
9 BookMetrix 2015 Assessment 286 175 1.63
10 Prolific Academic 2015 Analysis 783 490 1.60

Table 2. Tools with the largest relative increase  in Twitter followers – July 2015-January 2016 (100-1000 followers on July 1, 2015)

Some observations
The two groups of tools with fast-growing popularity on Twitter distinguished here likely represent different phenomena: established tools with continuously rising popularity and new tools that are fast gaining popularity. Assuming a more or less linear growth in Twitter followers, it will take longer to acquire (tens of ) thousands new followers than it will to gain a couple of hundred. This is reflected by the fact that the relative increase in Twitter followers is lower for the tools that had over 1000 followers in July  (Table 1) than for the tools that had between 100-1000 followers (Table 2). Similarly, the tools in Table 2 are somewhat more recent than those in Table 1.

Some notable exceptions are  Open Library of the Humanities and GitLab, which have quickly gained a very substantial following;  and Jupyter, which might have seen a lot of followers from @IPythonDev ‘transfer’ when IPython Notebooks continued as Project Jupyter. Not all ‘smaller’ tools are recent, too:  both SciForum and SciVal have been around for > 5 years but have only started out being active on Twitter recently. Also, SciVal may have been mainly interesting to university administrators at first, but has been made more accessible and interesting for ‘end user’ researchers.

Apparently, in scholarly communication tools do not go ‘viral’. Even for this group of fastest growers, the number of followers rarely doubles over the 6 month period.

Looking at the research phase the tools in the tables are aimed at, we predominantly see tools for Analysis, Publication and Outreach represented. For Analysis and Outreach, this might reflect the fact that potential users of these specific tools are relatively active on Twitter (perhaps more so than users of popular tools for e.g. Writing or Discovery). For Publication, it might also be a reflection of a growing interest in new publication models among various stakeholder groups in scholarly communication.

Of course, these are all post hoc explanations, that have not been tested, e.g. against a comparable set of tools with lower relative increase in Twitter followers, or substantiated by a more in-depth analysis, e.g. of the characteristics of people following these tools and sites on Twitter.

Tools per research activity
To further drill down into tools that are fast gaining popularity for specific research activities across the research cycle, the polar bar chart (Figure 1) below shows the tools with the highest relative increase in Twitter followers over the past six months (July 2015-January 2016) for each of 30 distinct research activities. Again, we focused on tools that had over 100 followers on July 1, 2015.

Twitter risers per research activity_jan2016_bottomlegend

Figure 1. Tools with the largest relative increase  in Twitter followers per research activity – July 2015-January 2016 (> 100 followers on July 1, 2015)

It is interesting to note that almost all tools in the polar bar graph are generic tools that can be used across the board of fields and disciplines. Of the 30 tools in this figure, only BioRxiv, DH Commons, OLH, Flypapers and Benchling are field specific. It’s too early to say that to see people flocking to your account you need to have a tool that can be used widely, but it is in line with a trend towards generic solutions. This could be something to dive further into once we have the tool usage data from our own survey.

Timeline of tools

timeline-of-tools-banner-2

The number and variety of online tools and platforms for all phases of the research cycle has grown tremendously over the years. We have been charting this ‘supply side’ of the scholarly communication landscape, first in our figure of 101 innovative tools in six different phases of the research cycle (Fig 1.), and subsequently in our growing database of tools for 30 distinct research activities within these phases.

InnoScholComm logo 250x250

Fig 1. 101 Innovative tools in six research phases

To get a visual impression of the development of tools over time, we plotted the 600 tools currently in our database against the year they were created (Fig 2., also available on Plot.ly).

tools-by-year-stacked-bar

Fig 2. New tools by research phase, 1994-2015

New online tool development rose sharply at the end of the 1990s and again and the end of the 2000s. The recent rise of 2013 and 2014 may be an artefact of the way we have been collecting these tools since 2013: through (social) media mentions, reviews in journals and crowdsourcing. All three sources focus on tools that have just been launched. Apart from special circumstances in higher education and research there may also be effects here of the dot.com bubble at the end of the 1990s and the web 2.0 explosion in the second half of the 2000s. The interesting peak of new outreach tools in 2008 is lacking a clear explanation. The slump in 2015 for all types of tools is due to the fact that it easily takes 6-12 months before new tools attract media attention.

A more detailed view of tool development emerges when tools are plotted separately for the different activities within research phases, against  the year  (and month, where possible) they were created (Fig 3, also available on plot.ly). As an extra layer of information, we added the current number of Twitter followers (where available) as a proxy for the interest a tool has generated.

In interpreting this plot, there are some important considerations regarding the underlying data that should be taken into account:

  • We have limited inclusion to online tools specifically,  excluding tools that are available as download only. Also, browser extensions are not included.
  • The picture for the early years  (up to c. 2006) is less complete than that for more recent years. For example, early traditional publisher platforms are not included and tools may have risen and fallen during this period and thus not have been added to our list.
  • We have assigned each tool to only one research activity. This affects the picture for tools that can be used for multiple research activities (like Mendeley, Figshare and ResearchGate).
  • The number of tools for the publication phase are distributed over many separate research activities, resulting in a seemingly less densely populated plot area.
  • It takes a while for (new) tools to accrue Twitter followers, which is why tools created in 2015 have relatively few Twitter followers.
  • The number of Twitter followers is one of many possible measures of interest/popularity, each with their own limitations (see for example, this blogpost by Bob Muenchen). Our global survey on research tool usage that is currently running, will provide more substantiated data on actual tool usage.

Taking these considerations into account, some interesting observations can be made:

  • The slow start of the development of online tools for academia is remarkable, given that the internet was developed (at universities !) decades ago. The first website was launched in 1991 and the first graphical browser (MOSAIC) was introduced two years later, but it took until 1997 before PubMed became available and another four years before the first web-based reference management tool (RefWorks) was launched.
  • Of the 30 research activities we identify, search (nr. 3), experiment (nr. 9), publish (nr. 24) and outreach (n.25) have had the longest continuous period of (online) tool development. For these activities, many online tools have been developed prior to 2008. Activities for which tools have only become available more recently are getting access (nr. 4) and post-publication peer review and commenting (nr. 27/28).
  • Relatively few tools exist for sharing posters and presentations specifically, and no new ones have been developed recently. However, there are many other tools (like FigShare and ScienceOpen) that enable archiving posters as one of their functionalities.
  • Tools for the writing, outreach en assessment phases often have many followers, perhaps because these tools are often relevant for all research disciplines (and, for writing and outreach, even beyond academia). Tools for discovery and publication are more often discipline-specific, and might reflect persistent differences in publication cultures and the desire for selectivity.
400+ tools - bubble chart - Dec 2015

Fig 3. Tools per research phase and year – bubble size: Twitter followers (logarithmic)

101 days to go for 101 innovations survey

With 101 days to go before the 101 Innovations in Scholarly Communication survey closes it seems a good moment to let you know how far we have come and what’s still ahead.

Response volume

At October 31st the (English) survey had garnered a total of 5373 responses. Daily responses are steady and sometimes show a peak due to distribution efforts of partnering institutions. It is good to see that many respondents also take the effort to answer the open question.

Breakdown by research roles and disciplines

Faculty, PhD students and postdocs are the three biggest groups responding. Most respondents are from life sciences. Other disciplines are also well represented with only law lagging. But these are absolute figures and should be compared to populations of course. The initial bias of librarians has weakened, but they are still overrepresented. If every librarian that takes the survey would pass it on to three researchers…

101 days - RR

101 days - disciplines

Translation into 6 world languges

In October the survey was translated into 6 languages, after we saw that response in some countries was relatively low. Next to English, it is now available in Spanish, French, Russian and Chinese, while Japanese and Arabic will follow suit. Any help reaching out research communities in these language areas is appreciated.

Custom URL partners

Some 60 institutions have partnered with us so far. In exchange for distributing the survey they get the resulting data for their institution. We hope to find still more partners, especially in the language areas now served with the translations. Reaching out to libraries that often act as intermediary does also involve convincing them this is not acquisition, but an option to gain insight into patrons’ research practices, at no cost.

Press  & presentations

Generally the survey is very well received, because of its timeliness and graphical layout. The initial poster from which the survey grew was featured on the InsideHigherED blog and the survey and broader project subject of a podcast on the Scholarly Kitchen blog. We presented the ideas behind it at OAI9 conference in Geneva and at the Open Access week 2015 meeting in Brussels, and showed an example of some preliminary results at the 2:AM altmetrics conference in Amsterdam.

Still ahead

In the next 101 days we hope to see the number of responses and partnering institutions double. But we will also work on the next steps: preparing the data for release, prepare some scripts for our own analyses, find a way to offer the data in a friendly dashboard style for anyone to wortk with and interest other researchers (you?) to use the data for testing all kinds of hypotheses.