What did 9 million scientific publications teach us?

On January 10th 1963, John F. Kennedy wrote in the foreword of a landmark report:

“One of the major opportunities … lies in the improvement of our ability to communicate information about current research efforts and the results of past efforts”.

In the midst of the Cold War, this report kicked off an effort to get American and Soviet officers together in order to agree on a plan that would provide access to critical scientific outputs from all over the world. And this is how AGRIS, the International System for Agricultural Science and Technology, was born.

Almost 50 years later, the team responsible for managing AGRIS at the Food and Agriculture Organization (FAO) of the United Nations decided that they should address a major challenge: they had built an online search engine looking into millions of publications related to agricultural science and technology; but the way that information was fed into the service was cumbersome and outdated. Their daily data collection and management routine involved plenty of e-mail exchanges, several phone calls, a couple of CD ROMs delivered by regular post, and many FTP links providing access to files in different formats.

OLYMPUS DIGITAL CAMERA

This was when an international task force was set up, bringing together FAO, the Chinese Academy of Agricultural Sciences (CAAS), and Agroknow. As part of a series of projects (including some EU-funded ones such as agINFRA and SemaGrow), the task force worked for almost 5 years on revamping and extending all aspects of technology, content and workflows powering the AGRIS service.

During this collaboration, our team devoted a significant part of its R&D resources into building a state-of-art technological backbone that could meet two major requirements:

Manage information coming from more than 250 research institutions and publishers globally in different forms, with frequent updates.
Combine this information to continuously power a search engine of almost 9 million publications in a large number of languages.

In early 2017, the operational support of the service was handed over again to the team at FAO. And we had become better and wiser — from plenty of achievements and failures.

There is one thing that, I think, we did extremely well: to put together some really elegant pieces of technology. We have developed and deployed information indexing technologies that can support very fast and large scale data indexing. We have automated a major part of the data transformation process, with very efficient algorithms that could transform a variety of data formats into the harmonized one powering the search engine. And we have thoroughly tested the efficiency of state-of-art semantic technologies in a demanding operational environment that required linking and aligning different ontologies and thesauri in order to describe scientific information, across a very large and distributed information network.

There is one area where we could have done much more: elaborating on the mechanisms and workflows through which intellectual property, terms of use, and re-use licenses were managed. The service is dealing with a very large number of data providers and consumers, each of which has its own data sharing or usage policy (or none). And the data flow involved storing and managing information at multiple locations — the providers’ side, at our servers, at FAO’s servers, as well as at the servers of collaborating institutions that were offering additional storage or computing resources.

If we had to do this project again, I think that I would start from mapping all data flows, so that such legal implications are well understood and addressed. Focusing on the technological problems is natural for a computer engineer like me. But let us first understand very well where (and how, and by whom) the data is going to be used, before we start deploying the fancy tech stuff.

What did 9 million scientific publications teach us?

Share The Food Safety Intelligence Blog!