Projekt

Daten zum Projekt

Phylomilia - Phylogenetic linguistic inference from acoustic speech data

Initiative: Pioniervorhaben Exploration
Ausschreibung: Geistes- und Gesellschaftswissenschaften
Bewilligung: 04.07.2024
Laufzeit: 3 Jahre

Projektinformationen

Computational comparative linguistics traditionally relies heavily on manual data preprocessing, which limits progress, scalability, and reproducibility. This project aims to transform this field by leveraging advanced deep learning methodologies, specifically focusing on automatic speech processing, to perform phylogenetic analysis directly from acoustic speech data without manual intervention. Utilizing speech as the primary data source marks a significant shift from writing-based analyses, allowing for more direct and nuanced insights into language evolution. The proposed methodology simplifies the traditional multi-step workflow of linguistic analysis into two core processes: 1) Transforming speech data into vector space representations using self-supervised deep learning models which effectively captures the linguistic features directly from audio data; 2) Conducting phylogenetic inference from these vectorized representations to construct language family trees and deduce historical language relationships. As preparation for the pre-training for the first step, the project team will device a language-independent end-to-end automatic speech recognition tool that transcribes spoken language into the International Phonetic Alphabet. Leveraging autoencoder techniques, the project will, furthermore, probabilistically reconstruct aspects of the vocabulary and phonotactics of earlier language stages. As the deep-learning methods utilized in the project have a black-box character, the project will, finally, devote attention to post-hoc explainability of the trained model in linguistic terms.

Projektbeteiligte