Automating Data Imports in a DSpace-CRIS’s Institutional Repository
The migration of Infoscience, EPFL’s institutional repository, to DSpace-CRIS required a custom Python-based pipeline to automate the ingestion of research outputs and datasets. Limitations in default DSpace-CRIS import tools, such as insufficient query controls, incomplete metadata mappings, and a lack of deduplication mechanisms, necessitated a tailored approach.
The pipeline leverages the DSpace REST API to enable precise queries, metadata reconciliation, and robust deduplication. It incorporates fallback mechanisms, such as publisher-specific APIs, for full-text retrieval when standard tools like Unpaywall and CrossRef prove insufficient. Key challenges included reconciling authorship with EPFL directories, aligning metadata across diverse collections, and maintaining data consistency during imports.
The developer track presentation will provide a visual breakdown of the pipeline’s architecture, highlight key challenges, and illustrate the solutions implemented. The presentation will complement this by delving deeper into the technical details and lessons learned. Both formats will offer practical insights for repository managers and developers seeking to automate data imports and optimize workflows in institutional repositories.
EPFL_Infoscience-Imports_OR2025_with_videos.pptx
Presentation
Not Applicable (or Unknown)
openaccess
CC BY
314.44 MB
Microsoft Powerpoint XML
8877efbbf64809e4af359c7d5b6a1996
EPFL_Infoscience-Imports_OR2025.pdf
Main Document
Not Applicable (or Unknown)
openaccess
CC BY
2.99 MB
Adobe PDF
a5503a1349e798003c370349d07927f7