Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Datasets and Code
  4. The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms
 
dataset

The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms

Orlandic, Lara  
•
Teijeiro, Tomas  
•
Atienza Alonso, David  
2021
Zenodo

<strong>Overview</strong> Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 30,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. Furthermore, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world’s most urgent health crises. <strong>Private Set and Testing Protocol<br></strong> Researchers interested in testing their models on the private test dataset should contact us at coughvid@epfl.ch, briefly explaining the type of validation they wish to make, and their obtained results obtained through cross-validation with the public data. Then, access to the unlabeled recordings will be provided, and the researchers should send the predictions of their models on these recordings. Finally, the performance metrics of the predictions will be sent to the researchers. The private testing data is not included in any file within our Zenodo record, and it can only be accessed by contacting the COUGHVID team at the aforementioned e-mail address. <br><strong>New Semi-Supervised Labeling</strong> The third version of the COUGHVID dataset contains thousands of additional recordings obtained through October 2021. Additionally, the recordings containing coughs were re-labeled according to a semi-supervised learning algorithm that combined the user labels with those of the expert physicians, which were modeled using ML and expanded on the previously unlabeled data. These labels can be found in the "status_SSL" column of the "metadata_compiled.csv" file. For more information about the data collection, pre-processing, validation, and data structure, please refer to the following publication: https://www.nature.com/articles/s41597-021-00937-4 The cough pre-processing and feature extraction code is available from the following c4science repository: https://c4science.ch/diffusion/10770/

  • Details
  • Metrics
Type
dataset
DOI
10.5281/zenodo.4048311
Author(s)
Orlandic, Lara  
Teijeiro, Tomas  
Atienza Alonso, David  
Date Issued

2021

Publisher

Zenodo

Subjects

Diagnostic markers

•

Respiratory signs and symptoms

EPFL units
ESL  
FunderGrant NO

EU funding

825111

FNS

200020_182009

RelationURL/DOI

IsSupplementTo

https://infoscience.epfl.ch/record/286867
Available on Infoscience
January 25, 2023
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/194272
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés