Launched on February 2, 2022, Rapido has just celebrated its first anniversary! A look back at the first phase of this open science–oriented project.
Winner of the 2021 FNSO call for projects, Rapido (short for “Rendre Accessibles des Publications scientifiques Indexées et liées à des DOnnées certifies” — Making Indexed Scientific Publications Accessible and Linked to Certified Data) is led by ENS de Lyon on behalf of Persée, in partnership with Inist, the French School of Rome, the French School at Athens, and Abes.
The goal of the project is to implement — initially on a defined corpus of journals from the French Schools abroad — an automated protocol that links these publications to research data via IdRef toponym records.
What sets Rapido apart is its strong methodological ambition. The project requires defining a method for named entity recognition (NER) and the automated annotation of Persée’s corpora using tools developed for Istex, in close collaboration with researchers. This approach aims to offer a new service that will connect the Persée platform with Inist’s tools to provide a joint service offering.
Persée’s Role
Persée is in charge of coordinating the Rapido project among long-standing partners brought together for this shared objective. It also serves as a provider of structured data resulting from the processing of archaeological journals from the French Schools, which are available on the Persée portal. Moreover, the portal will benefit from new navigation features thanks to the work carried out as part of Rapido.
Development work to implement the concept of “toponym” into Persée’s tools is already underway. Integrating this new type of information into existing documents will enable new forms of content exploration. To ensure the quality of the automatically detected and aligned toponyms, validation interfaces have been set up. A documentation specialist can accept or reject a “candidate” (a term identified as a potential toponym by Inist’s tools). A first round of validation on a training corpus was already completed by experts from the French School at Athens, who reviewed many of the suggestions. Their feedback was essential to assess the scientific relevance of the candidate toponyms.
All actions undertaken by the various partners are thoroughly documented in a shared space to ensure maximum reproducibility of the procedures.
Inist’s Role
While Abes provides the IdRef corpus, Inist is responsible for detecting named entities related to archaeology and aligning them with the appropriate records. To achieve this, the corpus is automatically annotated by a program developed by a data analysis and text mining engineer at Inist. These annotations are then reviewed and corrected if necessary. These corrections feed back into the system to refine and improve the program, incorporating archaeologists’ expertise.
The collaboration currently focuses on a defined corpus of around 4,000 archaeological documents, but the idea is to extend the methodology to other domains. Ultimately, the tools developed could be offered as web services via the Objectif TDM platform.
What’s Next for Rapido?
The coming months will focus on an evaluation phase. On the technical side, this means validating the annotations that make up the training corpus, running the tools across the full dataset, and reintegrating the data into Persée’s information system for deployment on the portal. On the methodological side, the goal is to consolidate and assess the procedures to enable replication in other datasets.
Article written in collaboration with Inist.