Harvesting speech datasets

Distinctions of prosody (rhythm, stress, and intonation) are ubiquitous in spoken language. It often seems obvious to a native speakers of English what prosody is most appropriate in a given sentence and context, and researchers in Linguistics and related fields have proposed numerous formalized hypotheses about it. But establishing the validity of these hypotheses is remarkably elusive. Much of the problem is that it is difficult to observe enough examples of a given phenomenon to evaluate hypotheses. The project aims to address this problem of a dearth of data by collecting or "harvesting" examples of specific word sequences or word patterns from web sources. It is often possible to find hundreds or thousands of examples of people using the very same word pattern. If these examples are collected together into a dataset and made available to the research community, it will be possible to evaluate theories about the form and meaning of prosody on an unprecedented scale.

This site provides web distribution of data in the form of audio snippets, accompanied by transcriptions of the surrounding utterance context. Follow the links on the left to examine individual targets.

The project was one of eight winners in the first round of the Digging into Data competition, which challenged international teams to apply data-intensive methods in the humanities and social sciences. Our project teams Cornell University (USA) with McGill University (Canada). Follow the links at the in the left column for information about the research team, project sites, and the Digging into Data Challenge.

Funded by the National Science Foundation (USA) and the Social Sciences and Humanities Research Council of Canada/Conseil de recherches en sciences humaines du Canada under the these grants.

NSF 1035151 "RAPID: Harvesting Speech Datasets for Linguistic Research on the Web (Digging into Data Challenge)". PI Mats Rooth.
SSHRC Digging into Data Challenge Grant 869-2009-0004. PI Michael Wagner.

Harvesting Speech Datasets for Linguistic Research on the Web

Targets