Research corpora – Persée UAR

The Persée platform facilitates the production, exploitation and dissemination of digital corpora in different disciplines (in human and social sciences but also in environmental and earth sciences). Our model is characterized by a particular attention paid to the standardization of the description and structuring of documents. We want to make sure that the resulting data can be used and subsequently reused. All this is organized around three elements: a workflow, a software and a document processing line and a shared information system. The work schema proposed to researchers covers the entire life-cycle of the digital corpus:

Digitization of corpora that may include publications, grey literature, archives, iconographic and cartographic sources and/or integration of already digitized documents
Document-type based OCR (optical character recognition)
XML structure encoding and semantic enrichment
Distribution through a dedicated Perséides website and editorialization of digitized content
Open access to metadata and controlled access to documents while respecting the rights of third-parties
Web-based data exposure (triplestore RDF) and interoperability protocol (OAI-PMH)
Referencing in search engines and reporting tools
Association of search, visualization and alignment tools with scientific and documentary repositories
Data hosting and backup

Persée’s information system consists of the Persée portal and interoperable data repositories, derived from the Perséides collections that will be produced. Its value lies not only in the progressive aggregation of content, but above all in the meshing of data and metadata according to the principles of the semantic web.

If you are a researcher and you have a project to build a digital corpus, contact us to find out more about this service and how to access it.

To see our first achievements: the Perséides collections