Dataset Description
The VOiCES corpus is a collaboration between SRI International and Lab41, In-Q-Tel, presenting audio recorded in acoustically challenging conditions. Recordings took place in real rooms of various sizes, capturing different background and reverberation profiles for each room. Various types of distractor noise were simultaneously played with clean speech. Audio was recorded at a distance using various microphones placed throughout the room. To imitate human behavior during conversation, the foreground loudspeaker was placed on a motorized platform that rotated over a range of angles during recordings.
Three hundred distinct speakers from LibriSpeech’s “clean” data subset were selected as the source audio, ensuring a 50-50 female-male split. In preparation for upcoming data challenges, the first release of the VOiCES corpus will include 200 speakers only. The remaining 100 speakers will be reserved for model validation; the full corpus (300 speakers) will be released once the data challenge is closed.
Description of Files
VOiCES_competition
was a release designed for a special competition workshop at InterSpeech 2019. See the README
inside the archive for more information on the structure and arrangement of the dataset.
VOiCES_release
is the full VOiCES dataset with a general purpose directory structure. VOiCES_devkit
is a subset of the data (detailed below) designed for easier experimentation and development. Both VOiCES_release
and VOiCES_devkit
have the same directory structure.
recording_data
contains two files, distances.csv
and quality_metrics.csv
with useful information about the recordings. Both files have a row for every recording.
VOiCES_release
VOiCES_release
contains all recordings from each room, mic, and under all distractor types. Rooms 1 and 2 have 12 mics, and rooms 3 and 4 have 20 mics. As there are four types of distractor noises, there are 256 VOiCES recordings per source recording. All source recordings have unique transcripts.
Subset | # Examples |
---|---|
Train | 661,248 |
Test | 337,920 |
Total | 999,168 |
VOiCES_devkit
VOiCES_devkit
is a subsample of the full VOiCES_release
dataset. All of the speakers of the full dataset are retained. For each speaker we randomly selected two librispeech source recordings. For each source recording we retained all of the VOiCES recordings from microphones 1 and 5, the nearest and furthest mics, respectively, from the speaker. Otherwise all rooms and background distractor types are included for a total of 32 VOiCES recordings per source recording.
Subset | # Examples |
---|---|
Train | 12,800 |
Test | 6,400 |
Total | 19,200 |
recording_data.tar.gz
recording_data.tar.gz
contains two files, distances.csv
and quality_metrics.csv
, with auxiliary information for each recording in the VOiCES dataset, with a row for each recording.
distances.csv
Each row in this file contains the distance (in inches) from the foreground speaker, each of the distractor speakers, and the floor to the microphone for a given recording. Specifically, it has the following columns:
Column | Datatype | Description |
---|---|---|
distractor 1 | integer | Inches from mic to 1st distractor speaker |
distractor 2 | integer | Inches from mic to 2nd distractor speaker |
distractor 3 | integer | Inches from mic to 3rd distractor speaker |
floor | integer | Inches from mic to floor |
foreground | integer | Inches from mic to source/foreground speaker |
query_name | string | The recording filename without directory path or extension, useful as a key to join with other tables (e.g. index files) |
quality_metrics.csv
This file contains a number of precomputed measures of speech quality or intelligibility for each recording. Intrusive methods use the original Librispeech source audio as the ground truth. For recordings where the VOiCES audio does not fully contain the source audio (detectable from comparing source_length and noisy_length in index files), all quality_metrics are set to -1.
Column | Datatype | Description |
---|---|---|
query_name | string | The recording filename without directory path or extension, useful as a key to join with other tables (e.g. index files) |
pesq nb | float | Perceptual Evaluation of Speech Quality, computed using python-pesq with narrow band setting |
pesq wb | float | Perceptual Evaluation of Speech Quality, computed using python-pesq with wide band setting |
STOI | float | Short Time Objective Intelligibility Measure, computed using pystoi |
SIIB | float | Speech Intelligibility in Bits computed using pySIIB with all default settings. |
SRMR | float | Normalized speech-to-reverberation modulation energy ratio, computed using SRMRpy with norm=True. |
Source audio references
Source audio references, per LibriSpeech, are provided in three different tables as follows:
Information on the speaker ID, book ID, and chapter ID
Lab41-SRI-VOiCES-speaker-book-chapter.tbl
Speaker ID, gender, and LibriSpeech data subset
Lab41-SRI-VOiCES-speaker-gender-dataset.tbl
Orthographic transcription of all audio files
Lab41-SRI-VOiCES.refs
Data format
Audio files are available in WAV format with 16 kHz sample rate with 16-bit precision. All files begin with the corpus name Lab41-SRI-VOiCES. Source audio files specify speaker, chapter, and chapter segment identification number. The file naming format sample is shown below:
Lab41-SRI-VOiCES-src-sp< speaker_ID >-ch< chapter_ID >-sg< segment_ID >.wav
The naming convention for audio recorded at a distance includes all the above information, with additional descriptors for room, distractor noise, microphone type, microphone location, and position of foreground loudspeaker in degrees. The file naming format is shown below:
Lab41-SRI-VOiCES-< room >-< distractor_noise >-sp< speaker_ID >-ch< chapter_ID >-seg< segment_ID >-mc< mic_ID >-< mic_type >-< mic_location >-dg< degree >.wav
Audio files to characterize the room response are also available:
Lab41-SRI-VOiCES-< room >-< signal >-mc< mic_ID >-< mic_type >-< mic_location >.wav
As are recordings of distractor noise only or ambient room background only:
Lab41-SRI-VOiCES-< distractor_noise >-mc< mic_ID >-< mic_type >-< mic_location >.wav
Possible descriptors for room, distractor noise, microphone type, and microphone location, are show in the table below.
File Code | Type | Definition |
---|---|---|
rm1 | Room | Room-1: dimensions 146” x 107” (x 107” height) |
rm2 | Room | Room-2: dimensions 225” x 158” (x 109” height) |
scr | Source audio | Source audio for foreground speaker |
none | Distractor noise | No distractor noise played |
musi | Distractor noise | Music distractor noise played |
tele | Distractor noise | Television distractor noise played |
babb | Distractor noise | Babble distractor noise played |
stu | Mic type | Cardioid dynamic studio microphone |
lav | Mic type | Omnidirectional condenser lavalier microphone |
clo | Mic location | Closest to foreground speaker- on table |
mid | Mic location | Mid-distance to foreground speaker- on table |
far | Mic location | Farthest to foreground speaker- on stand |
beh | Mic location | Behind foreground speaker- on stand |
cec | Mic location | Overhead on ceiling, clear |
ceo | Mic location | Overhead on ceiling, fully obstructed |
tbo | Mic location | Partially obstructed - table |
wal | Mic location | Fully obstructed - wall |
ds1 | Mic location | Near distractor 1 |
ds2 | Mic location | Near distractor 2 |
ds3 | Mic location | Near distractor 3 |
tbc | Mic location | Mid-distance, on table |
sho | Mic location | In cupboard, fully obstructed |
ref | Mic location | Across the room, near refrigerator |
obs | Mic location | Fully.partially obstructed in wall/ceiling |
impulse | Signal | Two seconds with transient sound in middle, for room response |
swoop | Signal | Rising tone for 20 seconds, for room response |
tone | signal | Steady tone for 15 seconds, for room response |
All the data is contained in two main folders: distant-16k, containing all the audio recordings, and source-16k, containing the audio files used from LibriSpeech, corrected for DC offset and normalized to each file’s peak amplitude. The WAV files for the source audio are organized in subdirectories by speaker ID. The distant-16k has three main subdirectories:
- distractors : distractor noise recordings with no foreground audio for all rooms
- room-response : recorded sound to determine room-response for all rooms
- speech : for each room, recordings of foreground audio with babble, music, television or no distractor noise, arranged by speaker ID in each subfolder.
Directory Structure
There are three top-level directories in root folder of VOiCES_release
and VOiCES_devkit
: references
, source-16k
, and distant-16k
. The contents of these directories are detailed below.
references
contains a number of files with information about the dataset. All necessary information is gathered in test_index.csv
and train_index.csv
, and other files are redundant.
- references/
- filename_transcripts
- Lab41-SRI-VOiCES-speaker-book-chapter.tbl
- Lab41-SRI-VOiCES-speaker-gender-dataset.tbl
- test_index.csv
- Test_Set_Speakers.csv
- time_values.csv
- train_index.csv
source-16k
contains the original librispeech source audio. The train-test split is the same as the VOiCES data. The audio files are separated by speaker ID. For example, all audio from speaker 115 is stored in sp0115/
.
- source-16k/
- test/
- sp0115/
- Lab41-SRI-VOiCES-src-sp0115-ch121720-sg0008.wav
- ...
- ...
- train/
distant-16k
contains the VOiCES data. There are subdirectories for the audio files used to create the distractor sounds, as well as for room responses when test sounds (impulse, swoop, and tone) are played.
- distant-16k
- distractors/
- rm1/
- babb/
- Lab41-SRI-VOiCES-rm1-babb-mc01-stu-clo.wav
- ...
- ...
- ...
- room_response/
- rm1/
- impulse/
- Lab41-SRI-VOiCES-rm1-impulse-mc01-stu-clo.wav
- ...
- swoop/
- tone/
- ...
- speech/
- test/
- train/
- rm1/
- babb/
- sp0032/
- Lab41-SRI-VOiCES-rm1-babb-sp0032-ch004137-sg0007-mc01-stu-clo-dg150
- ...
- ...
- ...
Index Files
In the references
directory there are two csv files, train_index.csv
and test_index.csv
, that serve as index files for the training and test sets, respectively. Each file contains a single row for each recording in the given subset of the data. Both files have the following columns.
Column | Datatype | Description |
---|---|---|
index | integer | Unique index for recording |
chapter | integer | Librispeech chapter ID |
degrees | integer | Angle (in degrees) between source speaker and mic |
distractor | string | Distractor type, options are ‘none’, ‘babb’, ‘tele, ‘musi’ |
filename | string | Path to recording .wav, relative to root directory |
gender | string | Speaker gender, options are ‘M’ and ‘F’ |
mic | integer | The mic used for this recording |
query_name | string | The filename without directory path or extension |
room | string | The room recorded in, options are ‘rm1’, ‘rm2’, ‘rm3’, ‘rm4’ |
segment | integer | Librispeech segment ID |
source | string | Path to .wav file for Librispeech source audio for this recording |
speaker | integer | Librispeech speaker ID |
transcript | string | Orthographic transcript of the Librispeech source audio |
noisy_length | integer | Sample length of recording |
noisy_sr | integer | Sample rate (hz) of recording |
noisy_time | float | Duration of recording in seconds |
source_length | integer | Sample length of Librispeech source audio |
source_sr | integer | Sample rate (hz) of Librispeech source audio |
source_time | float | Duration of Librispeech source audio in seconds |
Rooms 1 and 2
Microphone Details
Microphone identification numbers are unique to a specific microphone location and type, defined below.
Mic_ID | Location | Model | Type |
---|---|---|---|
01 | clo | SHURE SM58 | stu |
02 | clo | AKG 417L | lav |
03 | mid | SHURE SM58 | stu |
04 | mid | AKG 417L | lav |
05 | far | SHURE SM58 | stu |
06 | far | AKG 417L | lav |
07 | beh | SHURE SM58 | stu |
08 | beh | AKG 417L | lav |
09 | tbo | AKG 417L | lav |
10 | cec | AKG 417L | lav |
11 | ceo | AKG 417L | lav |
12 | wal | SHURE SM11 | lav |
Distance (inches) between microphones and loudspeakers or floor, for Room-1 and Room-2 recordings.
Foreground | Distractor 1 | Distractor 2 | Distractor 3 | Floor | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Mic_ID | rm-1 | rm-2 | rm-1 | rm-2 | rm-1 | rm-2 | rm-1 | rm-2 | rm-1 | rm-2 |
01 | 38 | 80 | 71 | 112 | 71 | 84 | 53 | 64 | 42 | 39 |
02 | 38 | 80 | 71 | 112 | 71 | 84 | 53 | 64 | 42 | 39 |
03 | 72 | 131 | 35 | 81 | 56 | 58 | 52 | 95 | 42 | 39 |
04 | 72 | 131 | 35 | 81 | 56 | 58 | 52 | 95 | 42 | 39 |
05 | 119 | 228 | 72 | 101 | 33 | 104 | 83 | 186 | 70 | 70 |
06 | 119 | 228 | 72 | 101 | 33 | 104 | 83 | 186 | 70 | 70 |
07 | 29 | 29 | 115 | 193 | 133 | 170 | 94 | 94 | 70 | 70 |
08 | 29 | 29 | 115 | 193 | 133 | 170 | 94 | 94 | 70 | 70 |
09 | 58 | 109 | 64 | 98 | 60 | 65 | 49 | 82 | 28 | 25 |
10 | 75 | 128 | 90 | 107 | 108 | 103 | 106 | 104 | 105 | 105 |
11 | 75 | 128 | 90 | 107 | 108 | 103 | 106 | 104 | 106 | 106 |
12 | 130 | 116 | 861 | 116 | 40 | 115 | 81 | 164 | 12 | 10 |
Rooms 3 and 4
Microphone Details
Microphone identification numbers are unique to a specific microphone location and type, defined below.
Mic_ID | Location | Model | Type |
---|---|---|---|
01 | clo | SHURE SM58 | stu |
02 | clo | AKG 417L | lav |
03 | mid | SHURE SM58 | stu |
04 | mid | AKG 417L | lav |
05 | far | SHURE SM58 | stu |
06 | far | AKG 417L | lav |
07 | beh | SHURE SM58 | stu |
08 | beh | AKG 417L | lav |
09 | tbo | AKG 417L | lav |
10 | cec | AKG 417L | lav |
11 | ceo | AKG 417L | lav |
12 | wal | SHURE SM58 | lav |
13 | ds1 | ATR1500 | stu |
14 | ds2 | ATR1500 | stu |
15 | ds3 | ATR1500 | stu |
16 | tbc | ATR4697 | bar |
17 | sho | L41 | mem |
18 | clo | ADA I2S | mem |
19 | ref | ADA I2S | mem |
20 | obs | ADA I2S | mem |
Distance (inches) between microphones and loudspeakers or floor, for Room-1 and Room-2 recordings. For microphones 13, 14 and 15 the distances are reported first for non-babble sessions, and then for babble sessions.
Foreground | Distractor 1 | Distractor 2 | Distractor 3 | Floor | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Mic_ID | rm-3 | rm-4 | rm-3 | rm-4 | rm-3 | rm-4 | rm-3 | rm-4 | rm-3 | rm-4 |
01 | 67 | 72 | 179 | 291 | 170 | 222 | 141 | 79 | 41 | 41 |
02 | 67 | 72 | 179 | 291 | 170 | 222 | 141 | 79 | 41 | 41 |
03 | 146 | 167 | 117 | 200 | 157 | 150 | 106 | 126 | 41 | 41 |
04 | 146 | 167 | 117 | 200 | 157 | 150 | 106 | 126 | 41 | 41 |
05 | 281 | 387 | 103 | 76 | 207 | 106 | 165 | 306 | 67 | 71 |
06 | 281 | 387 | 103 | 76 | 207 | 106 | 165 | 306 | 67 | 71 |
07 | 58 | 71 | 292 | 367 | 252 | 420 | 232 | 167 | 67 | 70 |
08 | 58 | 71 | 292 | 367 | 252 | 420 | 232 | 167 | 67 | 67 |
09 | 88 | 128 | 163 | 241 | 165 | 183 | 129 | 101 | 27 | 27 |
10 | 176 | 173 | 135 | 208 | 174 | 159 | 135 | 128 | 126 | 95 |
11 | 176 | 175 | 135 | 210 | 174 | 155 | 135 | 130 | 125 | 97 |
12 | 262 | 380 | 54 | 39 | 157 | 131 | 188 | 128 | 10 | 12 |
13 | 230/224 | 259/337 | 10/30 | 10/30 | 107/121 | 42/92 | 196/175 | 277/259 | 42 | 42 |
14 | 219/178 | 344/305 | 23/61 | 23/61 | 108/95 | 42/62 | 185/176 | 263/224 | 42 | 42 |
15 | 201/195 | 334/331 | 40/60 | 40/61 | 102/55 | 42/51 | 176/213 | 151/242 | 42 | 42 |
16 | 108 | 128 | 144 | 241 | 156 | 179 | 120 | 128 | 28 | 28 |
17 | 151 | 353 | 145 | 40 | 72 | 115 | 238 | 281 | 22 | 14 |
18 | 75 | 78 | 176 | 287 | 175 | 216 | 132 | 74 | 30 | 30 |
19 | 236 | 286 | 87 | 208 | 36 | 232 | 267 | 273 | 38 | 1 |
20 | 250 | 407 | 173 | 68 | 261 | 130 | 80 | 320 | 45 | 97 |
Licensing
VOiCES is publicly available released under Creative Commos BY 4.0, free for commercial, academic, and government use. Please do reference VOiCES if using the data in publications.