Graph integration of structured, semistructured and unstructured data for data journalism

Anadiotis, Angelos ChristosBalalau, OanaConceicao, CatarinaGalhardas, HelenaHaddad, Mhd YamenManolescu, IoanaMerabti, TayebYou, Jingmao2022-01-312022-01-312022-01-312022-02-0110.1016/j.is.2021.101846https://infoscience.epfl.ch/handle/20.500.14299/184973WOS:000727723200001Digital data is a gold mine for modern journalism. However, datasets which interest journalists are extremely heterogeneous, ranging from highly structured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to define and deploy custom extract-transform-load workflows, especially for dynamically varying sets of data sources.We describe a complete approach for integrating dynamic sets of heterogeneous datasets along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments. (C) 2021 Elsevier Ltd. All rights reserved.Computer Science, Information SystemsComputer Sciencedata journalismheterogeneous data integrationinformation extractionnamed entity recognitionGraph integration of structured, semistructured and unstructured data for data journalismtext::journal::journal article::research article