The Persée document processing line

The Persée document processing line

Persée digitizes scientific publications to make them available on its www.persee.fr portal.
To make it easy for you to find a document, and navigate through a collection by issue number, by book title or by year, Persée designed and uses a dedicated processing tool: jGalith. It is used to manage the various stages of digitization, documentation, distribution and permanent archiving. It complies with several data standard models (TEI, METS, DC, MODS, MADS, marcXMl). This blog post gives you a glance into back office and front office in parallel to show you how the 700,000 documents currently online on the portal have been described, structured and enriched.

 

The documentation processing line is carried out in 5 successive steps:

 

  • Document pre-processing : retrieval is via OCR (optical character recognition software) of the table of contents of the issue number.

 

  • Initial document processing : verification and enrichment of the data thus recovered (titles of the documentary units, name and responsibility of the authors, their link to the internal database, itself linked to that of ABES -IdReF-, language accuracy, pagination, typology of the documents – article, report, etc.).

 

These two steps make it possible to produce tables of contents that are at least identical to the print version, most often enhanced (headings and detailed reports), so you can retrieve them on the portal at the issue level:

 

 

 

  • Page-level processing : indexing of initially identified elements structuring the document (title levels, annexes, illustrations). Each digitized page is scanned to locate the structuring data, applying types to the ocerized text, checks and corrections.

 

 

For the illustrations, the same procedure is carried out, adding possible captions and any specific rights holders:

 

 

The result of this step is the possibility that you can use the navigation tools available in the Outline and Figures tabs.

 

 

Further enhancements can be made: summaries in text format, links between documents available in the portal (articles that quote or are quoted by the retrieved document) and external repositories.

 

  • Document validation: nternal verification of the data produced via the jGalith interface.

 

  • Editorial validation: verification by the publication managers via the jGalith interface, but with a single access allowing them to give their agreement for distribution.