Six paths to a global Open Access Repository

repositories: gathering and opening (img: Design Loft concept for Kent State's College of Architecture and Environmental Design building, by Weiss/Manfredi architects)

repositories: gathering and opening
(img: Design Loft concept for Kent State’s College of Architecture and Environmental Design building, by Weiss/Manfredi architects)

The Web was created for scientific communication, but 20 years after its launch, only a small percentage of scientific/scholarly publications are freely Web-accessible/reusable. Only about 12% of publications are self-archived compared to an estimated 81% that could be[1]. Open Access publishing is growing but it covers only some articles, and comparatively little older, humanities, or book/other content. Repositories, run by parties other than publishers, containing possibly preprint or alternate forms of content, may offer much of the low-hanging fruit in expanding access to research literature.

A late-breaking related development is the CHORUS proposal (ClearingHouse for the Open Research of the United States) from a group of mostly-US scientific publishers and publishing organizations. This proposes a publisher-run system, hosting content on publishers’ sites, to fulfill the new US government mandate of public access to new federally-funded research results, as expressed in the White House Office of Science and Technology Policy (OSTP) policy memorandum. I’ll just say that upon initial inspection I see this as an effort by incumbent players to control and contain the emerging open-access environment, and as much narrower in scope than the type of global, open system I would envision and advocate.

I’d also note Ross Mounce’s recent, excellent guide to researcher self-archiving. “Easy steps towards open scholarship” (LSE Impact of Social Sciences blog, May 24).

Expanding on and contrasting to both CHORUS and Mounce’s guide, here I present six ideas to broaden, accelerate, and amplify the impact & uptake of repositories, from a global perspective of researchers and users, academic and not, and all types of work in all fields and all countries. What might we do to globally make the biggest difference the soonest with the least resources?

  1. OpenRef“: analogous to CrossRef, a single service point to look up, request, or submit materials, offering the simplest, most user-centered possible interface for all needs.
  2. Best available version” concept: recognize preprints, drafts, outlines, alternate articles, book summaries, etc., as legitimate versions for many purposes.
  3. Identifier assignment and association/clustering (e.g. of DOIs) for all materials.
  4. 80/20 approach: discover and focus on the content that is most needed.
  5. Crowdsource the identification, prioritization, discovery/archiving, and creation of archivable works, e.g. with the #paywall hashtag.
  6. Global scope: not limited by institution, discipline, country, educational level, or genre of work.

Also, in section 7. Current and possible players I discuss how various organizations/projects partly already do, or might in future, take these approaches: e.g. Google Scholar, arXiv, and JISC. In section 8. Frequent objections, I discuss questions like, doesn’t Google Scholar do this already? and, don’t most fields lack the necessary preprint culture?  Finally in 9. Lean Startup Approach I ask how we might usefully think of this project like a “lean” or agile startup.


1. “OpenRef”

We have an excellent precedent, in the form of the DOI handle system and CrossRef, for a global lookup hub that connects users to publishers’ versions of content, and manages metadata. Now, what if we built such as thing, but centered on users needs and not bounded by the interests of publishers?

Imagine a single hub at which one could easily a) submit any materials to a default  Open Access repository – say, Zenodo , b) request items not yet available, c) resolve requests, i.e. send automatically check whether there is an available resource for a given DOI or ISBN etc.

Building on the #PDFtribute movement that happened after Aaron Swartz’s death, in which academics worldwide posted links to their self-archived papers with the hashtag #PDFtribute, we might think how to build the simplest and most universally-accessible repository interfaces possible. I.e. submit just by tweeting, or emailing, or using a single simple Web form.

CHORUS appears to be proposing a version of this, whereby a CrossRef-like hub resolves requests to publisher-hosted versions of publicly-accessible work when applicable, or to repositories when certain trigger conditions are met.

2. “Best available version” concept

I believe we should recognize preprints, drafts, outlines, alternate articles, book summaries, etc., as legitimate versions for many purposes.

In scholarship, the concept of “version of record” describes a single authoritative expression of some body of work. The implication is that a single summation is needed, and other versions are less legitimate. There may be some case for the evidentiary or logical necessity of such an single version, but I’d note several obvious issues:

a) it doesn’t account for the wide range of ways works are actually used, both by a single user at different times, and by the variety of audiences; and
b) it tends to grant a monopoly power (via copyright) to those who control that single “version of record,” and this power may be unaligned with public interest or access.
c) Even with a “version of record,” there are different facets for different purposes: e.g., full text, abstract, metadata, & how it’s represented in Google Scholar or Web of Science indexes.

If you consider the various ways a piece of research literature is used, you can see that this so-called “version of record” is frequently not necessary, or perhaps even suitable.

First, the most common use of an academic article is a very brief (< 10 secs) inspection by a researcher scanning it to see if it should be fully downloaded or bookmarked for later attention. This inspection is typically based on the abstract and metadata such as author information. It doesn’t require the complete or original work.

Secondly, the user is usually interested in the ideas or factual content, not in the exact expression of them. The original publication may not be the best exposition of that material. For example, let’s say in studying some area of economics, one wishes to review or even cite a number of classic papers in the area, including say Coase’s often-cited “The Theory of the Firm.” You might prefer to read a concise treatment that has been widely endorsed as a clear statement of Coase’s ideas; it may even be more useful, if the restatement includes analytic clarification, counterargument, other references, etc.

The original expression may be, for various reason, unsuitable for current use:  for example, something expressed at book length, which a professor wishes to assign in a class that would not have time to read a long work. Or simply, the original is just particularly badly written or poorly argued. Why reify the “version of record” if it just isn’t that good an expression of the ideas?

In all of these scenarios, U.S. copyright law at least would allow the creation of alternate, “derivative” works restating or summarizing the original, if they are  “transformative” for any of several reasons. Creating new utility by summary or restatement has been explicitly upheld by courts as such as transformative use.

I’d propose a large-scale project to create summary versions of all often-cited papers and books (including non-scholarly books), which could be openly accessible to all users. Not only would access be expanded, but these versions might often be preferable to researchers who *do* have access to the original forms.


3. Identifier assignment and association/clustering

I’d suggest, particularly for the humanities where this is least done now, that all works current and retrospective be assigned a standard identifier such as a DOI.  This is increasingly easy and inexpensive, as with services such as Figshare (data oriented, but not limited to it) or the recently-launched from CERN.

It costs about $275 to be a CrossRef member and assign DOIs, and $1 per DOI for material from last two years, $0.15 for “backfile” materials older than that. (see CrossRef fees).

What really opens up the system, however, if for the “version of record” identifiers — which are typically what’s cited and indexed — to get associated with identifiers for preprints or other alternate versions. Currently such preprint identifier association is done in some areas, notably on the best-known repository, which collaborates with Inspire (formerly SPIRES) to automatically update arXiv metadata with the DOI and journal references of published versions. (more).

Even more intriguingly, Chris Jack at Mendeley reported that they are successfully running clustering algorithms across their entire corpus of papers, believed to be the most complete scholarly paper archive existing. While this is usually done to group different files referencing the same final published papers, Jack suggested [personal correspondence] that it could feasibly be done to group preprint with final form works as well.


4. 80/20 approach

The 80–20 rule, also known as the Pareto principle states that, for many events, roughly 80% of the effects come from 20% of the causes. Another way to put it is that  in many situations, solving just part of the problem achieves much of the purpose.

There are an estimated 2M scholarly articles published per year, and perhaps 50M in existence; and let’s say 50,000 books published per year likely to be cited in scholarship. However, use and citation among these works is highly concentrated, probably close to a power-law distribution.

What if we had good data about exactly which works were wanted, for which audiences and purposes? and invested resources exactly where the greatest benefit could be created?  Perhaps in some fields, a few hundred works make up the majority of discussion and citation, and so discovering or creating open-access versions of them could be transformative.

There are various ways such data is being or could be gathered:

  • Thomson’s Journal Citation Reports;
  • academic libraries’ COUNTER reports;
  • analyzing the citations in Wikipedia;
  • analyzing the requests to “OpenRef” (see #1 above);
  • by tracking usage of a “#paywall” hashtag (see #5 below).
  • analyzing curricula/syllabi (the Mellon Foundation funded such a study a few years ago. (citation needed).

Starting with the most-requested resources, we might consider crowdfunding public access, translation, organizing a search for preprint or manuscript versions, or creation of an analytical summary, or just a public placeholder page to collect related links and comments.  (various projects have proposed such a “one page per work”, e.g. Open Library, [Berkeley project - citation needed], Bibliopedia).


5. Crowdsourcing & social media

Crowdsourcing and social media might be used to assist the identification, prioritization, discovery/archiving, and creation of archivable works. Here’s an example of how this could be done in social media:

1) encourage a practice of using tag #paywall whenever tweeting or mentioning a scholarly work that isn’t publicly accessible. If you see a mention that doesn’t use the tag, retweet it *with* the tag.

2) monitor for that tag, and crowdsource effort to find any Open Access version (e.g. preprint, author’s last version) or “best available version” (perhaps Wikipedia page, synopsis, precis, or alternate article by same author).

3) Minimally, just reply/retweet with the Open Access version. Better, put the OA version into a repository, perhaps assign it a DOI, and create discoverable associations between it and any other identifiers related to the work, e.g. DOI or ISBN of the non-publicly-accessible version.

4) #paywall tag occurrence is used to study patterns in demand, and focus efforts where there is highest payoff.


6. Global scope

not limited by institution, discipline, country, educational level, or genre of work.

OpenDOAR_poster0607 7. Current and possible players

I’m well aware than many smart people have been thinking about and building digital archive/repository systems for 20+ years. The approaches I’ve described may partly be, or could be, employed by many existing parties, alone or collaborating.

  1. Google Scholar
  2. arXiv
  3. NEH
  4. DPLA
  5. British Library
  6. JISC
  8. Open Scholarship Project
  9. use repository software such as DSPACE, Hydra

8. Frequent objections

a) many smart people have been thinking about and building digital archive/repository systems for 20+ years.

see sec. 7 above. I mean to just lay out six pathways to change, without exhaustively reviewing the huge field of prior work or suggesting that these ideas haven’t at all been considered or tried.

b) Google / Google Scholar already does this well enough.

Google Scholar is probably the best current means to discover openly-accessible scholarly materials, but it is unsatisfactory in many ways.
1) it’s long been a low-priority project at Google whose survival is unassured.
2) its algorithms, and criteria and scope of coverage is non-transparent, non-stable. Coverage varies widely between fields, within fields, even within journals, in a seemingly arbitrary way.
3) Google Scholar returns results in a high-noise, unstructured, search-results form. There is no API, and explicitly no intention to produce one, and strict limits on what you can do with the data. So there is very limited ability to build new services upon GS, or incorporate it into other services.

c) Many repositories and repository search services already exist.

Yes, great. Let’s keep developing them and bringing them closer to the “OpenRef” concept I describe. They’d be much more useful with more universal identifer assignment and clustering/association, as I describe in Sec. 3 above.  Also I’d suggest that ultra-easy repository deposit be integrated with the same service point for searching, which I’ve never seen done.

Whenever the idea of new or global repositories comes up, a normal response is that there are 100s of repositories already, embedded in institutions which fund and manage them, and we should be making unified service interfaces such as search tools on top of them.

I quite agree. There are various ways materials could be stored & managed under the hood, I think we should focus on the service we wish to deliver, and the quality of the user experience; and then examine how that service might be implemented by weaving together possibly many underlying infrastructures.

d) Repositories have only succeeded in disciplines that had existing preprint cultures.

The adoption of repositories such as has long been attributed to prior disciplinary “preprint cultures” in some fields, which don’t exist in others. Fine. I say, let’s do what we can anyway. Lots of things are changing, such as the ease of assigning DOIs to arbitrary new content, and national / institutional open-access mandates. We see with the #PDFtribute movement that there may be large latent interest in sharing/archiving preprints, that current practices are not activating. And as Ross Mounce’s guide “Easy steps towards open scholarship” observes, the problem is cultural, and it can be steadily changed by education and advocacy.

 e) why not just add disciplines to arXiv.

Good question. It’s possible arXiv or clones/branches of it could be the best candidate to do much of what I propose.

f) Mike Taylor says that “Institutional repositories have work to do if they’re going to solve the access problem“. 

Those are all good points, well stated. Generally, he observes that repositories haven’t been all that great so far, some years in; I suggest ways to make them more open, more usable, faster, hopefully getting sooner to what Taylor would prefer.


9.  Afterword: A Lean Startup approach

One possibly new angle would be to look at scholarly repositories like a Silicon Valley-style startup approaching a consumer service might. In that case, you might ask:

1) Is there compelling value or problem solved, conveyable in 1 sentence?
2) Is there an addressable market/value of at least $100Ms?
3) Do we have some competitive advantage(s) that make our entry plausible, such as
a) first-mover advantage, b) unique skill, technology, IP; c) brand, distribution channel, partners.
4) Can we put together a team and resources to execute on this?
5) Can we validate and prototype the offering with fast, lean, learning iterations?
6) Can we develop a Minimal Viable Product soon enough that addresses a sufficient part of the problem to get significant customer/user adoption, and traction in terms of revenue/engagement/feedback?



[1] Björk et al. “Anatomy of Green Open Access” (forthcoming in JASIST). preprint. .