Readme

Dataset Description

The VOiCES corpus is a collaboration between SRI International and Lab41, In-Q-Tel, presenting audio recorded in acoustically challenging conditions. Recordings took place in real rooms of various sizes, capturing different background and reverberation profiles for each room. Various types of distractor noise were simultaneously played with clean speech. Audio was recorded at a distance using various microphones placed throughout the room. To imitate human behavior during conversation, the foreground loudspeaker was placed on a motorized platform that rotated over a range of angles during recordings.

Three hundred distinct speakers from LibriSpeech’s “clean” data subset were selected as the source audio, ensuring a 50-50 female-male split. In preparation for upcoming data challenges, the first release of the VOiCES corpus will include 200 speakers only. The remaining 100 speakers will be reserved for model validation; the full corpus (300 speakers) will be released once the data challenge is closed.

Description of Files

VOiCES_competition was a release designed for a special competition workshop at InterSpeech 2019. See the README inside the archive for more information on the structure and arrangement of the dataset.

VOiCES_release is the full VOiCES dataset with a general purpose directory structure. VOiCES_devkit is a subset of the data (detailed below) designed for easier experimentation and development. Both VOiCES_release and VOiCES_devkit have the same directory structure.

recording_data contains two files, distances.csv and quality_metrics.csv with useful information about the recordings. Both files have a row for every recording.

VOiCES_release

VOiCES_release contains all recordings from each room, mic, and under all distractor types. Rooms 1 and 2 have 12 mics, and rooms 3 and 4 have 20 mics. As there are four types of distractor noises, there are 256 VOiCES recordings per source recording. All source recordings have unique transcripts.

Subset	# Examples
Train	661,248
Test	337,920
Total	999,168

VOiCES_devkit

VOiCES_devkit is a subsample of the full VOiCES_release dataset. All of the speakers of the full dataset are retained. For each speaker we randomly selected two librispeech source recordings. For each source recording we retained all of the VOiCES recordings from microphones 1 and 5, the nearest and furthest mics, respectively, from the speaker. Otherwise all rooms and background distractor types are included for a total of 32 VOiCES recordings per source recording.

Subset	# Examples
Train	12,800
Test	6,400
Total	19,200

recording_data.tar.gz

recording_data.tar.gz contains two files, distances.csv and quality_metrics.csv, with auxiliary information for each recording in the VOiCES dataset, with a row for each recording.

distances.csv

Each row in this file contains the distance (in inches) from the foreground speaker, each of the distractor speakers, and the floor to the microphone for a given recording. Specifically, it has the following columns:

Column	Datatype	Description
distractor 1	integer	Inches from mic to 1st distractor speaker
distractor 2	integer	Inches from mic to 2nd distractor speaker
distractor 3	integer	Inches from mic to 3rd distractor speaker
floor	integer	Inches from mic to floor
foreground	integer	Inches from mic to source/foreground speaker
query_name	string	The recording filename without directory path or extension, useful as a key to join with other tables (e.g. index files)

quality_metrics.csv

This file contains a number of precomputed measures of speech quality or intelligibility for each recording. Intrusive methods use the original Librispeech source audio as the ground truth. For recordings where the VOiCES audio does not fully contain the source audio (detectable from comparing source_length and noisy_length in index files), all quality_metrics are set to -1.

Column	Datatype	Description
query_name	string	The recording filename without directory path or extension, useful as a key to join with other tables (e.g. index files)
pesq nb	float	Perceptual Evaluation of Speech Quality, computed using python-pesq with narrow band setting
pesq wb	float	Perceptual Evaluation of Speech Quality, computed using python-pesq with wide band setting
STOI	float	Short Time Objective Intelligibility Measure, computed using pystoi
SIIB	float	Speech Intelligibility in Bits computed using pySIIB with all default settings.
SRMR	float	Normalized speech-to-reverberation modulation energy ratio, computed using SRMRpy with norm=True.

Source audio references

Source audio references, per LibriSpeech, are provided in three different tables as follows:

Information on the speaker ID, book ID, and chapter ID

Lab41-SRI-VOiCES-speaker-book-chapter.tbl

Speaker ID, gender, and LibriSpeech data subset

Lab41-SRI-VOiCES-speaker-gender-dataset.tbl

Orthographic transcription of all audio files

Lab41-SRI-VOiCES.refs

Data format

Audio files are available in WAV format with 16 kHz sample rate with 16-bit precision. All files begin with the corpus name Lab41-SRI-VOiCES. Source audio files specify speaker, chapter, and chapter segment identification number. The file naming format sample is shown below:

Lab41-SRI-VOiCES-src-sp< speaker_ID >-ch< chapter_ID >-sg< segment_ID >.wav

The naming convention for audio recorded at a distance includes all the above information, with additional descriptors for room, distractor noise, microphone type, microphone location, and position of foreground loudspeaker in degrees. The file naming format is shown below:

Lab41-SRI-VOiCES-< room >-< distractor_noise >-sp< speaker_ID >-ch< chapter_ID >-seg< segment_ID >-mc< mic_ID >-< mic_type >-< mic_location >-dg< degree >.wav

Audio files to characterize the room response are also available:

Lab41-SRI-VOiCES-< room >-< signal >-mc< mic_ID >-< mic_type >-< mic_location >.wav

As are recordings of distractor noise only or ambient room background only:

Lab41-SRI-VOiCES-< distractor_noise >-mc< mic_ID >-< mic_type >-< mic_location >.wav

Possible descriptors for room, distractor noise, microphone type, and microphone location, are show in the table below.

File Code	Type	Definition
rm1	Room	Room-1: dimensions 146” x 107” (x 107” height)
rm2	Room	Room-2: dimensions 225” x 158” (x 109” height)
scr	Source audio	Source audio for foreground speaker
none	Distractor noise	No distractor noise played
musi	Distractor noise	Music distractor noise played
tele	Distractor noise	Television distractor noise played
babb	Distractor noise	Babble distractor noise played
stu	Mic type	Cardioid dynamic studio microphone
lav	Mic type	Omnidirectional condenser lavalier microphone
clo	Mic location	Closest to foreground speaker- on table
mid	Mic location	Mid-distance to foreground speaker- on table
far	Mic location	Farthest to foreground speaker- on stand
beh	Mic location	Behind foreground speaker- on stand
cec	Mic location	Overhead on ceiling, clear
ceo	Mic location	Overhead on ceiling, fully obstructed
tbo	Mic location	Partially obstructed - table
wal	Mic location	Fully obstructed - wall
ds1	Mic location	Near distractor 1
ds2	Mic location	Near distractor 2
ds3	Mic location	Near distractor 3
tbc	Mic location	Mid-distance, on table
sho	Mic location	In cupboard, fully obstructed
ref	Mic location	Across the room, near refrigerator
obs	Mic location	Fully.partially obstructed in wall/ceiling
impulse	Signal	Two seconds with transient sound in middle, for room response
swoop	Signal	Rising tone for 20 seconds, for room response
tone	signal	Steady tone for 15 seconds, for room response

All the data is contained in two main folders: distant-16k, containing all the audio recordings, and source-16k, containing the audio files used from LibriSpeech, corrected for DC offset and normalized to each file’s peak amplitude. The WAV files for the source audio are organized in subdirectories by speaker ID. The distant-16k has three main subdirectories:

distractors : distractor noise recordings with no foreground audio for all rooms
room-response : recorded sound to determine room-response for all rooms
speech : for each room, recordings of foreground audio with babble, music, television or no distractor noise, arranged by speaker ID in each subfolder.

Directory Structure

There are three top-level directories in root folder of VOiCES_release and VOiCES_devkit: references, source-16k, and distant-16k. The contents of these directories are detailed below.

references contains a number of files with information about the dataset. All necessary information is gathered in test_index.csv and train_index.csv, and other files are redundant.

- references/
  - filename_transcripts
  - Lab41-SRI-VOiCES-speaker-book-chapter.tbl
  - Lab41-SRI-VOiCES-speaker-gender-dataset.tbl
  - test_index.csv
  - Test_Set_Speakers.csv
  - time_values.csv
  - train_index.csv

source-16k contains the original librispeech source audio. The train-test split is the same as the VOiCES data. The audio files are separated by speaker ID. For example, all audio from speaker 115 is stored in sp0115/.

- source-16k/
  - test/
    - sp0115/
      - Lab41-SRI-VOiCES-src-sp0115-ch121720-sg0008.wav
      - ...
    - ...
  - train/

distant-16k contains the VOiCES data. There are subdirectories for the audio files used to create the distractor sounds, as well as for room responses when test sounds (impulse, swoop, and tone) are played.

- distant-16k
  - distractors/
    - rm1/
      - babb/
        - Lab41-SRI-VOiCES-rm1-babb-mc01-stu-clo.wav
        - ...
      - ...
    - ...
  - room_response/
    - rm1/
      - impulse/
        - Lab41-SRI-VOiCES-rm1-impulse-mc01-stu-clo.wav
        - ...
      - swoop/
      - tone/
    - ...
  - speech/
    - test/
    - train/
      - rm1/
        - babb/
          - sp0032/
            - Lab41-SRI-VOiCES-rm1-babb-sp0032-ch004137-sg0007-mc01-stu-clo-dg150
          - ...
        - ...
      - ...

Index Files

In the references directory there are two csv files, train_index.csv and test_index.csv, that serve as index files for the training and test sets, respectively. Each file contains a single row for each recording in the given subset of the data. Both files have the following columns.

Column	Datatype	Description
index	integer	Unique index for recording
chapter	integer	Librispeech chapter ID
degrees	integer	Angle (in degrees) between source speaker and mic
distractor	string	Distractor type, options are ‘none’, ‘babb’, ‘tele, ‘musi’
filename	string	Path to recording .wav, relative to root directory
gender	string	Speaker gender, options are ‘M’ and ‘F’
mic	integer	The mic used for this recording
query_name	string	The filename without directory path or extension
room	string	The room recorded in, options are ‘rm1’, ‘rm2’, ‘rm3’, ‘rm4’
segment	integer	Librispeech segment ID
source	string	Path to .wav file for Librispeech source audio for this recording
speaker	integer	Librispeech speaker ID
transcript	string	Orthographic transcript of the Librispeech source audio
noisy_length	integer	Sample length of recording
noisy_sr	integer	Sample rate (hz) of recording
noisy_time	float	Duration of recording in seconds
source_length	integer	Sample length of Librispeech source audio
source_sr	integer	Sample rate (hz) of Librispeech source audio
source_time	float	Duration of Librispeech source audio in seconds

Rooms 1 and 2

Microphone Details

Microphone identification numbers are unique to a specific microphone location and type, defined below.

Mic_ID	Location	Model	Type
01	clo	SHURE SM58	stu
02	clo	AKG 417L	lav
03	mid	SHURE SM58	stu
04	mid	AKG 417L	lav
05	far	SHURE SM58	stu
06	far	AKG 417L	lav
07	beh	SHURE SM58	stu
08	beh	AKG 417L	lav
09	tbo	AKG 417L	lav
10	cec	AKG 417L	lav
11	ceo	AKG 417L	lav
12	wal	SHURE SM11	lav

Distance (inches) between microphones and loudspeakers or floor, for Room-1 and Room-2 recordings.

	Foreground		Distractor 1		Distractor 2		Distractor 3		Floor
Mic_ID	rm-1	rm-2	rm-1	rm-2	rm-1	rm-2	rm-1	rm-2	rm-1	rm-2
01	38	80	71	112	71	84	53	64	42	39
02	38	80	71	112	71	84	53	64	42	39
03	72	131	35	81	56	58	52	95	42	39
04	72	131	35	81	56	58	52	95	42	39
05	119	228	72	101	33	104	83	186	70	70
06	119	228	72	101	33	104	83	186	70	70
07	29	29	115	193	133	170	94	94	70	70
08	29	29	115	193	133	170	94	94	70	70
09	58	109	64	98	60	65	49	82	28	25
10	75	128	90	107	108	103	106	104	105	105
11	75	128	90	107	108	103	106	104	106	106
12	130	116	861	116	40	115	81	164	12	10

Rooms 3 and 4

Microphone Details

Microphone identification numbers are unique to a specific microphone location and type, defined below.

Mic_ID	Location	Model	Type
01	clo	SHURE SM58	stu
02	clo	AKG 417L	lav
03	mid	SHURE SM58	stu
04	mid	AKG 417L	lav
05	far	SHURE SM58	stu
06	far	AKG 417L	lav
07	beh	SHURE SM58	stu
08	beh	AKG 417L	lav
09	tbo	AKG 417L	lav
10	cec	AKG 417L	lav
11	ceo	AKG 417L	lav
12	wal	SHURE SM58	lav
13	ds1	ATR1500	stu
14	ds2	ATR1500	stu
15	ds3	ATR1500	stu
16	tbc	ATR4697	bar
17	sho	L41	mem
18	clo	ADA I2S	mem
19	ref	ADA I2S	mem
20	obs	ADA I2S	mem

Distance (inches) between microphones and loudspeakers or floor, for Room-1 and Room-2 recordings. For microphones 13, 14 and 15 the distances are reported first for non-babble sessions, and then for babble sessions.

	Foreground		Distractor 1		Distractor 2		Distractor 3		Floor
Mic_ID	rm-3	rm-4	rm-3	rm-4	rm-3	rm-4	rm-3	rm-4	rm-3	rm-4
01	67	72	179	291	170	222	141	79	41	41
02	67	72	179	291	170	222	141	79	41	41
03	146	167	117	200	157	150	106	126	41	41
04	146	167	117	200	157	150	106	126	41	41
05	281	387	103	76	207	106	165	306	67	71
06	281	387	103	76	207	106	165	306	67	71
07	58	71	292	367	252	420	232	167	67	70
08	58	71	292	367	252	420	232	167	67	67
09	88	128	163	241	165	183	129	101	27	27
10	176	173	135	208	174	159	135	128	126	95
11	176	175	135	210	174	155	135	130	125	97
12	262	380	54	39	157	131	188	128	10	12
13	230/224	259/337	10/30	10/30	107/121	42/92	196/175	277/259	42	42
14	219/178	344/305	23/61	23/61	108/95	42/62	185/176	263/224	42	42
15	201/195	334/331	40/60	40/61	102/55	42/51	176/213	151/242	42	42
16	108	128	144	241	156	179	120	128	28	28
17	151	353	145	40	72	115	238	281	22	14
18	75	78	176	287	175	216	132	74	30	30
19	236	286	87	208	36	232	267	273	38	1
20	250	407	173	68	261	130	80	320	45	97

Licensing

VOiCES is publicly available released under Creative Commos BY 4.0, free for commercial, academic, and government use. Please do reference VOiCES if using the data in publications.