Production and monitoring tools
Persée has developped a production and monitoring tool that serves all phases of the making of documents on the website throughout the interface jGalith which we have already discussed in other posts.
Thanks to jGalith, we can manage either the qualitative aspects of generated images and the optical character recognition (OCR)
The material preparation : it takes place before the digitization and allows to describe the documents by building an outline (fig 1) which includes information like the number of pages, covers, delicate pages but also documentary data like the location of tables of contents, blank pages, colour pages etc. This outline is then used throughout the production and serves as a reference for all subsequent operations.
Fig 1. Example of outline produced with covers, blank pages, color, dummies, double-pages, inserts
Once described, the documents are scanned and the images are checked via DPUScan, one of the scanner drivers.
The digitization recovery (fig 2): it is the first qualitative step which follows the digitization. The number of pages is automatically checked, according to its type (text, covers, dummies, colour…).
Fig 2 The digitization recovery
The validation (fig 3) : this is the second qualitative step which aims to check if the digitized pages correspond to those described in the outline previously created. This is essentially to make sure that the pages are in their right place and that they correspond to the intended type. Pages may be rejected if the quality is insufficient.
The operation needs to check at least 10 pages.
Fig 3 The validation phase of the digitization
The Flow management :
This very important phase is implemented within a workflow. A series of robots softwares (fig 4) automatically process the pages to straighten them, clean them, mark the margins around the text and illustrations etc…
Fig 4 A robot software, basic component of the workflow
Basically, we use two types of robot softwares :
– OCR (Optical Character Recognition: an operation consisting in extracting words / characters from a digitized image and converting them into digital data as found on word processors). Each word is identified and located on its page, then indexed. Thanks to this operation, words are located and appear highlighted as the result of a search (fig 5).
Fig 5 Here the word “vaudou” is sought. Thanks to the localization of the words during the OCR phase, the word is highlighted in all the images containing it.
The robot automatically straightens the pages for a more pleasant visual aspect and to optimize the result of the OCR. Margins are also automatically placed, the text is framed and the imperfections related to the scanning on the edge of the pages are removed (Fig 6).
Fig 6 Example of conversion of the text in the image into a digital text (in the text box on the right) – Adjustment of margins that cleans the imperfections on the left and bottom edges of the page – Correction of the angle of the page which can be seen by following the black line on the left edge.
– Cleaning of the image and insertion into a database. It consists in making the produced pages homogeneous, whatever the date of their edition. A page from 1821 must appear as a page from 2016 and be usable in the same way without the marks of time that could disturb the reading (fig 7).
Fig 7 Cleaning and insertion of images into the database
The workflow is managed by a Supervisor (fig 8) which collects the processing information of the various operational robots. The workflow is massively parallel. In other words, we can run as many robot softwares as we need – for example 5 robots working on the OCR part, and 8 robots on the cleaning and insertion part of the images. It is a very flexible tool that allows us to process an entire collection, or several simultaneously, or a defined number of scattered documents, or ranges of pages. The granularity goes from the collection to the individual page. These treatments can also be carried out either from the offices of Persée, or from any place with an internet connection.
Fig 8 The workflow supervisor
Eric Astier, Responsable production