Gnowsis: Multimodal Multitask Learning for Oral Proficiency Assessments

Computer Speech & Language

Automated Scoring
Lexical Sophistication
CEFR Assessment
Rater-mediated Assessment
Author

Takatsu, H., Suzuki, S., Eguchi, M., Matsuura, R., Saeki, M., Matsuyama, Y.

Published

January 1, 2026

Doi

Abstract

Although oral proficiency assessments are crucial to understand second language (L2) learners’ progress, they are resource-intensive. Herein we propose a multimodal multitask learning model to assess L2 proficiency levels from multiple aspects on the basis of multimodal dialogue data. To construct the model, we first created a dataset of speech samples collected through oral proficiency interviews between Japanese learners of English and a conversational virtual agent. Expert human raters subsequently categorized the samples into the six levels based on the rating scales defined in the Common European Framework of Reference for Languages with respect to proficiency in one holistic and five analytic assessment criteria (vocabulary richness, grammatical accuracy, fluency, goodness of pronunciation, and coherence). The model was trained using this dataset via the multitask learning approach to simultaneously predict the proficiency levels of these language competences from various linguistic features. These features were extracted via multiple encoder modules, which were composed of feature extractors pretrained through various natural language processing tasks such as grammatical error correction, coreference resolution, discourse marker prediction, and pronunciation scoring. In experiments comparing the proposed model to baseline models with a feature extractor pretrained with single modality (textual or acoustic) features, the proposed model outperformed the baseline models. In particular, the proposed model was robust even with limited training data or short dialogues with a smaller number of topics because it considered rich features.

APA Reference

Takatsu, H., Suzuki, S., Eguchi, M., Matsuura, R., Saeki, M., & Matsuyama, Y. (2026). Gnowsis: Multimodal Multitask Learning for Oral Proficiency Assessments. Computer Speech & Language, 95, 101860. https://doi.org/10.1016/j.csl.2025.101860