Daten zum Projekt
Initiative: | Pioniervorhaben Exploration |
---|---|
Ausschreibung: | Geistes- und Gesellschaftswissenschaften |
Bewilligung: | 04.07.2024 |
Laufzeit: | 3 Jahre |
Projektinformationen
Computational comparative linguistics traditionally relies heavily on manual data preprocessing, which limits progress, scalability, and reproducibility. This project aims to transform this field by leveraging advanced deep learning methodologies, specifically focusing on automatic speech processing, to perform phylogenetic analysis directly from acoustic speech data without manual intervention. Utilizing speech as the primary data source marks a significant shift from writing-based analyses, allowing for more direct and nuanced insights into language evolution. The proposed methodology simplifies the traditional multi-step workflow of linguistic analysis into two core processes: 1) Transforming speech data into vector space representations using self-supervised deep learning models which effectively captures the linguistic features directly from audio data; 2) Conducting phylogenetic inference from these vectorized representations to construct language family trees and deduce historical language relationships. As preparation for the pre-training for the first step, the project team will device a language-independent end-to-end automatic speech recognition tool that transcribes spoken language into the International Phonetic Alphabet. Leveraging autoencoder techniques, the project will, furthermore, probabilistically reconstruct aspects of the vocabulary and phonotactics of earlier language stages. As the deep-learning methods utilized in the project have a black-box character, the project will, finally, devote attention to post-hoc explainability of the trained model in linguistic terms.
Projektbeteiligte
-
Prof. Dr. Gerhard Jäger
Universität Tübingen
Philosophische Fakultät
Seminar für Sprachwissenschaft
Tübingen