United-Syn-Med

Abstract

United-MedSyn Dataset

Description

The United-MedSyn dataset is a specialized medical speech dataset designed to evaluate and improve Automatic Speech Recognition (ASR) systems within the healthcare domain. It comprises English medical speech recordings, with a particular focus on medical terminology and clinical conversations. The dataset is well-suited for various ASR tasks, including speech recognition, transcription, and classification, facilitating the development of models tailored for medical contexts.

This dataset supports a broad range of applications, including medical documentation automation, transcription of doctor-patient conversations, and medical knowledge extraction from audio data.

Key Features

Language: English (en)
Task Categories:
- Automatic Speech Recognition (ASR)
- Audio Transcription
- Speech Classification
Dataset Size: 100K < n < 1M
License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Usage

This dataset can be used for a wide range of speech-related tasks, particularly within the healthcare domain:

Training and evaluating ASR models specific to medical terminology.
Improving speech-to-text accuracy for medical conversations.
Medical speech classification tasks.

Example Use Cases:

ASR Model Training: Use the dataset to train models that convert spoken medical language into accurate text transcriptions.
Medical Documentation: Automate the transcription of clinical conversations or medical dictations.
Speech Classification: Classify various medical terms or conversation types from audio data.

License

The dataset is released under the CC BY-SA 4.0 License. You are free to:

Share — copy and redistribute the material in any medium or format.
Adapt — remix, transform, and build upon the material for any purpose, even commercially.

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Citation

If you use the United-MedSyn dataset in your research, please cite it as follows:

@dataset{united_medsyn,
  title={United-MedSyn: Medical Speech Dataset for ASR},
  author={United We Care},
  year={2024},
  license={CC BY-SA 4.0}
}

For any questions or issues regarding the dataset, please contact us at: [ayushi@unitedwecare.com].