Corpus Lab 3

Mini-research on vocabulary and multiword units

Author

Masaki EGUCHI, Ph.D.

Modified

August 6, 2025

Assignment Overview

This assignment aims to help you practice the following skills:

~~Constructing lists of formulaic language through concordance software~~
Planning and conducting a small-scale corpus study using single- and multi-word indices

Assignment Details

Task 1: Describing statistical characteristics of collocations (4 points)

~~In the first task, I would like you to calculate major Strengths Of Association (SOA) measures to quantify the association between two words (node words and their collocates.)~~

~~The frequency of node words, their collocates and entire corpus size will be given to you.~~

~~Your task is to calculate T-score, MI, MI^2, and LogDice.~~

~~Submission~~ (This task was canceled)

~~A spreadsheet file with SOA values.~~
~~A word file (.docx) for plots and prose descriptions.~~

Success Criteria

~~Your submission …~~

~~contains accurate T-score, MI, MI^2 and LogDice scores~~
~~provides visualization of the relations between SOA indices~~
~~describe the relationships among SOA indices and typical collocations~~

Task 2 & 3: Mini-research project (8 points altogether 12 points altogether)

The task 2 and 3 are related to the mini-research project.

In this part of the assignment, you will conduct a mini-research project to describe uses of single- and multi-word units in a corpus you choose.

Specifically, you will:

select lexical richness or phraseological sophistication indices to answer a set of research questions
analyze the chosen corpus with the selected indices
present the results and interpretation in a written prose

Submission

The final report are one-to-two page lengths report.

Short background and Research Questions (one paragraph)
Method section
- Corpus descriptions (one paragraph)
- Index descriptions (one paragraph)
- Analysis (one paragraph)
- Research hypothesis (one paragraph)
Results (data interpretation and commentary)
- Figures or statistical report
Conclusion

Assignment Guideline

Step 1: Construct research questions

In this type of research, researchers typically set RQs about the relationships between lexical characteristncs and variables that defines subsection of the corpus (e.g., grade, genre, or proficieincy score).

The following information is available through the GiG corpus:

The following information is available through the ICNALE corpus:

Ratings performed by external raters

Step 2: Understand and choose the corpus

In this assignment, please choose one of the following corpora:

Growth in Grammar (GiG) corpus (Durrant, 2023)
ICNALE corpus (Edited Essay OR GRA)
Some Japanese corpus here (Ask Masaki about availability).

Step 3: Construct hypothesis

Based on what you’ve learned about the vocabulary use of learner, state several hypotheses that you expect as the findings for the research question.

In other words, what do you expect as the relationship between lexical characteristics X and external variable Y?

Step 4: Select index

Based on the RQs and hypotheses, you will select indices that can capture the lexical characteristics X in your corpus.

Step 5: Compute the index

You will now use the tools we have covered in this course to derive lexical richness scores for the text.

Step 6: Conduct analysis

To answer the research questions, you may want to do the followings: - Obtain descriptive statistics of the lexical richness indices - Visualize the relationship between variables - Optionally run statistical analyses

Step 7: Interpret and write-up the results

You will write-up what you found in your mini research in a one-to-two page short report.

Success Criteria

Your submission …

outlines research questions and hypotheses
provide description of lexical richness measures that you used and how you calculated the measures
provides analysis results and their interpretations in relation to the research questions

--- title: "Corpus Lab 3" subtitle: "Mini-research on vocabulary and multiword units" --- # Assignment Overview This assignment aims to help you practice the following skills: - ~~Constructing lists of formulaic language through concordance software~~ - Planning and conducting a small-scale corpus study using single- and multi-word indices # Assignment Details ## ~~Task 1: Describing statistical characteristics of collocations (4 points)~~ ~~In the first task, I would like you to calculate major Strengths Of Association (SOA) measures to quantify the association between two words (`node words` and `their collocates`.)~~ ~~The frequency of node words, their collocates and entire corpus size will be given to you.~~ ~~Your task is to calculate T-score, MI, MI^2, and LogDice.~~ ::: {.callout-note} ## ~~Submission~~ (This task was canceled) - ~~A spreadsheet file with SOA values.~~ - ~~A word file (`.docx`) for plots and prose descriptions.~~ ::: ::: {.callout-important} # Success Criteria ~~Your submission ...~~ - ~~contains accurate T-score, MI, MI^2 and LogDice scores~~ - ~~provides visualization of the relations between SOA indices~~ - ~~describe the relationships among SOA indices and typical collocations~~ ::: ## Task 2 & 3: Mini-research project (~~8 points altogether~~ 12 points altogether) The task 2 and 3 are related to the mini-research project. In this part of the assignment, you will conduct a mini-research project to describe uses of single- and multi-word units in a corpus you choose. Specifically, you will: - select lexical richness or phraseological sophistication indices to answer a set of research questions - analyze the chosen corpus with the selected indices - present the results and interpretation in a written prose ::: {.callout-note} # Submission The final report are one-to-two page lengths report. - Short background and Research Questions (one paragraph) - Method section - Corpus descriptions (one paragraph) - Index descriptions (one paragraph) - Analysis (one paragraph) - Research hypothesis (one paragraph) - Results (data interpretation and commentary) - Figures or statistical report - Conclusion ::: ### Assignment Guideline ### Step 1: Construct research questions In this type of research, researchers typically set RQs about the relationships between lexical characteristncs and variables that defines subsection of the corpus (e.g., grade, genre, or proficieincy score). The following information is available through the GiG corpus: The following information is available through the ICNALE corpus: - Ratings performed by external raters ### Step 2: Understand and choose the corpus In this assignment, please choose one of the following corpora: 1. *Growth in Grammar (GiG) corpus* (Durrant, 2023) 2. *ICNALE corpus* (Edited Essay OR GRA) 3. Some Japanese corpus here (Ask Masaki about availability). #### Step 3: Construct hypothesis Based on what you've learned about the vocabulary use of learner, state several hypotheses that you expect as the findings for the research question. In other words, what do you expect as the relationship between lexical characteristics X and external variable Y? #### Step 4: Select index Based on the RQs and hypotheses, you will select indices that can capture the lexical characteristics X in your corpus. #### Step 5: Compute the index You will now use the tools we have covered in this course to derive lexical richness scores for the text. #### Step 6: Conduct analysis To answer the research questions, you may want to do the followings: - Obtain descriptive statistics of the lexical richness indices - Visualize the relationship between variables - Optionally run statistical analyses #### Step 7: Interpret and write-up the results You will write-up what you found in your mini research in a one-to-two page short report. ::: {.callout-important} # Success Criteria Your submission ... - [ ] outlines research questions and hypotheses - [ ] provide description of lexical richness measures that you used and how you calculated the measures - [ ] provides analysis results and their interpretations in relation to the research questions :::