Detailed VOiCES Readme

Voices Obscured in Complex Environmental Settings

Dataset Description

The VOiCES corpus is a collaboration between SRI International and Lab41, In-Q-Tel, presenting audio recorded in acoustically challenging conditions. Recordings took place in real rooms of various sizes, capturing different background and reverberation profiles for each room. Various types of distractor noise were simultaneously played with clean speech. Audio was recorded at a distance using various microphones placed throughout the room. To imitate human behavior during conversation, the foreground loudspeaker was placed on a motorized platform that rotated over a range of angles during recordings.

Three hundred distinct speakers from LibriSpeech’s “clean” data subset were selected as the source audio, ensuring a 50-50 female-male split. In preparation for upcoming data challenges, the first release of the VOiCES corpus will include 200 speakers only. The remaining 100 speakers will be reserved for model validation; the full corpus (300 speakers) will be released once the data challenge is closed.

Description of Files

VOiCES_competition was a release designed for a special competition workshop at InterSpeech 2019. See the README inside the archive for more information on the structure and arrangement of the dataset.

VOiCES_release is the full VOiCES dataset with a general purpose directory structure. VOiCES_devkit is a subset of the data (detailed below) designed for easier experimentation and development. Both VOiCES_release and VOiCES_devkit have the same directory structure.

recording_data contains two files, distances.csv and quality_metrics.csv with useful information about the recordings. Both files have a row for every recording.

VOiCES_release

VOiCES_release contains all recordings from each room, mic, and under all distractor types. Rooms 1 and 2 have 12 mics, and rooms 3 and 4 have 20 mics. As there are four types of distractor noises, there are 256 VOiCES recordings per source recording. All source recordings have unique transcripts.

Subset # Examples
Train 661,248
Test 337,920
Total 999,168


VOiCES_devkit

VOiCES_devkit is a subsample of the full VOiCES_release dataset. All of the speakers of the full dataset are retained. For each speaker we randomly selected two librispeech source recordings. For each source recording we retained all of the VOiCES recordings from microphones 1 and 5, the nearest and furthest mics, respectively, from the speaker. Otherwise all rooms and background distractor types are included for a total of 32 VOiCES recordings per source recording.

Subset # Examples
Train 12,800
Test 6,400
Total 19,200

recording_data.tar.gz

recording_data.tar.gz contains two files, distances.csv and quality_metrics.csv, with auxiliary information for each recording in the VOiCES dataset, with a row for each recording.

distances.csv

Each row in this file contains the distance (in inches) from the foreground speaker, each of the distractor speakers, and the floor to the microphone for a given recording. Specifically, it has the following columns:

Column Datatype Description
distractor 1 integer Inches from mic to 1st distractor speaker
distractor 2 integer Inches from mic to 2nd distractor speaker
distractor 3 integer Inches from mic to 3rd distractor speaker
floor integer Inches from mic to floor
foreground integer Inches from mic to source/foreground speaker
query_name string The recording filename without directory path or extension, useful as a key to join with other tables (e.g. index files)

quality_metrics.csv

This file contains a number of precomputed measures of speech quality or intelligibility for each recording. Intrusive methods use the original Librispeech source audio as the ground truth. For recordings where the VOiCES audio does not fully contain the source audio (detectable from comparing source_length and noisy_length in index files), all quality_metrics are set to -1.

Column Datatype Description
query_name string The recording filename without directory path or extension, useful as a key to join with other tables (e.g. index files)
pesq nb float Perceptual Evaluation of Speech Quality, computed using python-pesq with narrow band setting
pesq wb float Perceptual Evaluation of Speech Quality, computed using python-pesq with wide band setting
STOI float Short Time Objective Intelligibility Measure, computed using pystoi
SIIB float Speech Intelligibility in Bits computed using pySIIB with all default settings.
SRMR float Normalized speech-to-reverberation modulation energy ratio, computed using SRMRpy with norm=True.

Source audio references

Source audio references, per LibriSpeech, are provided in three different tables as follows:

Information on the speaker ID, book ID, and chapter ID

Lab41-SRI-VOiCES-speaker-book-chapter.tbl

Speaker ID, gender, and LibriSpeech data subset

Lab41-SRI-VOiCES-speaker-gender-dataset.tbl

Orthographic transcription of all audio files

Lab41-SRI-VOiCES.refs

Data format

Audio files are available in WAV format with 16 kHz sample rate with 16-bit precision. All files begin with the corpus name Lab41-SRI-VOiCES. Source audio files specify speaker, chapter, and chapter segment identification number. The file naming format sample is shown below:

Lab41-SRI-VOiCES-src-sp< speaker_ID >-ch< chapter_ID >-sg< segment_ID >.wav

The naming convention for audio recorded at a distance includes all the above information, with additional descriptors for room, distractor noise, microphone type, microphone location, and position of foreground loudspeaker in degrees. The file naming format is shown below:

Lab41-SRI-VOiCES-< room >-< distractor_noise >-sp< speaker_ID >-ch< chapter_ID >-seg< segment_ID >-mc< mic_ID >-< mic_type >-< mic_location >-dg< degree >.wav

Audio files to characterize the room response are also available:

Lab41-SRI-VOiCES-< room >-< signal >-mc< mic_ID >-< mic_type >-< mic_location >.wav

As are recordings of distractor noise only or ambient room background only:

Lab41-SRI-VOiCES-< distractor_noise >-mc< mic_ID >-< mic_type >-< mic_location >.wav


Possible descriptors for room, distractor noise, microphone type, and microphone location, are show in the table below.

File Code Type Definition
rm1 Room Room-1: dimensions 146” x 107” (x 107” height)
rm2 Room Room-2: dimensions 225” x 158” (x 109” height)
scr Source audio Source audio for foreground speaker
none Distractor noise No distractor noise played
musi Distractor noise Music distractor noise played
tele Distractor noise Television distractor noise played
babb Distractor noise Babble distractor noise played
stu Mic type Cardioid dynamic studio microphone
lav Mic type Omnidirectional condenser lavalier microphone
clo Mic location Closest to foreground speaker- on table
mid Mic location Mid-distance to foreground speaker- on table
far Mic location Farthest to foreground speaker- on stand
beh Mic location Behind foreground speaker- on stand
cec Mic location Overhead on ceiling, clear
ceo Mic location Overhead on ceiling, fully obstructed
tbo Mic location Partially obstructed - table
wal Mic location Fully obstructed - wall
ds1 Mic location Near distractor 1
ds2 Mic location Near distractor 2
ds3 Mic location Near distractor 3
tbc Mic location Mid-distance, on table
sho Mic location In cupboard, fully obstructed
ref Mic location Across the room, near refrigerator
obs Mic location Fully.partially obstructed in wall/ceiling
impulse Signal Two seconds with transient sound in middle, for room response
swoop Signal Rising tone for 20 seconds, for room response
tone signal Steady tone for 15 seconds, for room response


All the data is contained in two main folders: distant-16k, containing all the audio recordings, and source-16k, containing the audio files used from LibriSpeech, corrected for DC offset and normalized to each file’s peak amplitude. The WAV files for the source audio are organized in subdirectories by speaker ID. The distant-16k has three main subdirectories:


Directory Structure

There are three top-level directories in root folder of VOiCES_release and VOiCES_devkit: references, source-16k, and distant-16k. The contents of these directories are detailed below.

references contains a number of files with information about the dataset. All necessary information is gathered in test_index.csv and train_index.csv, and other files are redundant.

- references/
  - filename_transcripts
  - Lab41-SRI-VOiCES-speaker-book-chapter.tbl
  - Lab41-SRI-VOiCES-speaker-gender-dataset.tbl
  - test_index.csv
  - Test_Set_Speakers.csv
  - time_values.csv
  - train_index.csv

source-16k contains the original librispeech source audio. The train-test split is the same as the VOiCES data. The audio files are separated by speaker ID. For example, all audio from speaker 115 is stored in sp0115/.

- source-16k/
  - test/
    - sp0115/
      - Lab41-SRI-VOiCES-src-sp0115-ch121720-sg0008.wav
      - ...
    - ...
  - train/

distant-16k contains the VOiCES data. There are subdirectories for the audio files used to create the distractor sounds, as well as for room responses when test sounds (impulse, swoop, and tone) are played.

- distant-16k
  - distractors/
    - rm1/
      - babb/
        - Lab41-SRI-VOiCES-rm1-babb-mc01-stu-clo.wav
        - ...
      - ...
    - ...
  - room_response/
    - rm1/
      - impulse/
        - Lab41-SRI-VOiCES-rm1-impulse-mc01-stu-clo.wav
        - ...
      - swoop/
      - tone/
    - ...
  - speech/
    - test/
    - train/
      - rm1/
        - babb/
          - sp0032/
            - Lab41-SRI-VOiCES-rm1-babb-sp0032-ch004137-sg0007-mc01-stu-clo-dg150
          - ...
        - ...
      - ...

Index Files

In the references directory there are two csv files, train_index.csv and test_index.csv, that serve as index files for the training and test sets, respectively. Each file contains a single row for each recording in the given subset of the data. Both files have the following columns.

Column Datatype Description
index integer Unique index for recording
chapter integer Librispeech chapter ID
degrees integer Angle (in degrees) between source speaker and mic
distractor string Distractor type, options are ‘none’, ‘babb’, ‘tele, ‘musi’
filename string Path to recording .wav, relative to root directory
gender string Speaker gender, options are ‘M’ and ‘F’
mic integer The mic used for this recording
query_name string The filename without directory path or extension
room string The room recorded in, options are ‘rm1’, ‘rm2’, ‘rm3’, ‘rm4’
segment integer Librispeech segment ID
source string Path to .wav file for Librispeech source audio for this recording
speaker integer Librispeech speaker ID
transcript string Orthographic transcript of the Librispeech source audio
noisy_length integer Sample length of recording
noisy_sr integer Sample rate (hz) of recording
noisy_time float Duration of recording in seconds
source_length integer Sample length of Librispeech source audio
source_sr integer Sample rate (hz) of Librispeech source audio
source_time float Duration of Librispeech source audio in seconds

Rooms 1 and 2

Microphone Details

Microphone identification numbers are unique to a specific microphone location and type, defined below.

Mic_ID Location Model Type
01 clo SHURE SM58 stu
02 clo AKG 417L lav
03 mid SHURE SM58 stu
04 mid AKG 417L lav
05 far SHURE SM58 stu
06 far AKG 417L lav
07 beh SHURE SM58 stu
08 beh AKG 417L lav
09 tbo AKG 417L lav
10 cec AKG 417L lav
11 ceo AKG 417L lav
12 wal SHURE SM11 lav


Distance (inches) between microphones and loudspeakers or floor, for Room-1 and Room-2 recordings.

Foreground Distractor 1 Distractor 2 Distractor 3 Floor
Mic_ID rm-1 rm-2 rm-1 rm-2 rm-1 rm-2 rm-1 rm-2 rm-1 rm-2
01 38 80 71 112 71 84 53 64 42 39
02 38 80 71 112 71 84 53 64 42 39
03 72 131 35 81 56 58 52 95 42 39
04 72 131 35 81 56 58 52 95 42 39
05 119 228 72 101 33 104 83 186 70 70
06 119 228 72 101 33 104 83 186 70 70
07 29 29 115 193 133 170 94 94 70 70
08 29 29 115 193 133 170 94 94 70 70
09 58 109 64 98 60 65 49 82 28 25
10 75 128 90 107 108 103 106 104 105 105
11 75 128 90 107 108 103 106 104 106 106
12 130 116 861 116 40 115 81 164 12 10

Rooms 3 and 4

Microphone Details

Microphone identification numbers are unique to a specific microphone location and type, defined below.

Mic_ID Location Model Type
01 clo SHURE SM58 stu
02 clo AKG 417L lav
03 mid SHURE SM58 stu
04 mid AKG 417L lav
05 far SHURE SM58 stu
06 far AKG 417L lav
07 beh SHURE SM58 stu
08 beh AKG 417L lav
09 tbo AKG 417L lav
10 cec AKG 417L lav
11 ceo AKG 417L lav
12 wal SHURE SM58 lav
13 ds1 ATR1500 stu
14 ds2 ATR1500 stu
15 ds3 ATR1500 stu
16 tbc ATR4697 bar
17 sho L41 mem
18 clo ADA I2S mem
19 ref ADA I2S mem
20 obs ADA I2S mem


Distance (inches) between microphones and loudspeakers or floor, for Room-1 and Room-2 recordings. For microphones 13, 14 and 15 the distances are reported first for non-babble sessions, and then for babble sessions.

Foreground Distractor 1 Distractor 2 Distractor 3 Floor
Mic_ID rm-3 rm-4 rm-3 rm-4 rm-3 rm-4 rm-3 rm-4 rm-3 rm-4
01 67 72 179 291 170 222 141 79 41 41
02 67 72 179 291 170 222 141 79 41 41
03 146 167 117 200 157 150 106 126 41 41
04 146 167 117 200 157 150 106 126 41 41
05 281 387 103 76 207 106 165 306 67 71
06 281 387 103 76 207 106 165 306 67 71
07 58 71 292 367 252 420 232 167 67 70
08 58 71 292 367 252 420 232 167 67 67
09 88 128 163 241 165 183 129 101 27 27
10 176 173 135 208 174 159 135 128 126 95
11 176 175 135 210 174 155 135 130 125 97
12 262 380 54 39 157 131 188 128 10 12
13 230/224 259/337 10/30 10/30 107/121 42/92 196/175 277/259 42 42
14 219/178 344/305 23/61 23/61 108/95 42/62 185/176 263/224 42 42
15 201/195 334/331 40/60 40/61 102/55 42/51 176/213 151/242 42 42
16 108 128 144 241 156 179 120 128 28 28
17 151 353 145 40 72 115 238 281 22 14
18 75 78 176 287 175 216 132 74 30 30
19 236 286 87 208 36 232 267 273 38 1
20 250 407 173 68 261 130 80 320 45 97

Licensing

VOiCES is publicly available released under Creative Commos BY 4.0, free for commercial, academic, and government use. Please do reference VOiCES if using the data in publications.