Gnowsis: Multimodal Multitask Learning for Oral Proficiency Assessments
Computer Speech & Language
Abstract
Although oral proficiency assessments are crucial to understand second language (L2) learners’ progress, they are resource-intensive. Herein we propose a multimodal multitask learning model to assess L2 proficiency levels from multiple aspects on the basis of multimodal dialogue data. To construct the model, we first created a dataset of speech samples collected through oral proficiency interviews between Japanese learners of English and a conversational virtual agent. Expert human raters subsequently categorized the samples into the six levels based on the rating scales defined in the Common European Framework of Reference for Languages with respect to proficiency in one holistic and five analytic assessment criteria (vocabulary richness, grammatical accuracy, fluency, goodness of pronunciation, and coherence). The model was trained using this dataset via the multitask learning approach to simultaneously predict the proficiency levels of these language competences from various linguistic features. These features were extracted via multiple encoder modules, which were composed of feature extractors pretrained through various natural language processing tasks such as grammatical error correction, coreference resolution, discourse marker prediction, and pronunciation scoring. In experiments comparing the proposed model to baseline models with a feature extractor pretrained with single modality (textual or acoustic) features, the proposed model outperformed the baseline models. In particular, the proposed model was robust even with limited training data or short dialogues with a smaller number of topics because it considered rich features.