Automating Data Imports in a DSpace-CRIS’s Institutional Repository

Rodrigues de Matos, Jorge; Sicot, Julien

doi:10.5075/epfl.20.500.14299/251483

conference presentation

Automating Data Imports in a DSpace-CRIS’s Institutional Repository

Rodrigues de Matos, Jorge

•

Sicot, Julien

June 17, 2025

The 20th International Conference on Open Repositories

The migration of Infoscience, EPFL’s institutional repository, to DSpace-CRIS required a custom Python-based pipeline to automate the ingestion of research outputs and datasets. Limitations in default DSpace-CRIS import tools, such as insufficient query controls, incomplete metadata mappings, and a lack of deduplication mechanisms, necessitated a tailored approach.

The pipeline leverages the DSpace REST API to enable precise queries, metadata reconciliation, and robust deduplication. It incorporates fallback mechanisms, such as publisher-specific APIs, for full-text retrieval when standard tools like Unpaywall and CrossRef prove insufficient. Key challenges included reconciling authorship with EPFL directories, aligning metadata across diverse collections, and maintaining data consistency during imports.

The developer track presentation will provide a visual breakdown of the pipeline’s architecture, highlight key challenges, and illustrate the solutions implemented. The presentation will complement this by delving deeper into the technical details and lessons learned. Both formats will offer practical insights for repository managers and developers seeking to automate data imports and optimize workflows in institutional repositories.

Name

EPFL_Infoscience-Imports_OR2025_with_videos.pptx

Type

Presentation

Version

Not Applicable (or Unknown)

Access type

openaccess

License Condition

CC BY

Size

314.44 MB

Format

Microsoft Powerpoint XML

Checksum (MD5)

8877efbbf64809e4af359c7d5b6a1996

Name

EPFL_Infoscience-Imports_OR2025.pdf