Corpus Lab 1 Due 8/5 (Tue) 10:30 am
Any questions?
A strand of research investigating:
In this course, we will mostly focus on question 3 and 4.
To make a somewhat bold statement, many SLA researchers are interested in describing:
Goal: Revealing development paces and patterns
We can use corpus linguistic methods to help us identify the features of language use to understand constructs of our interest.
→ More and more SLA researchers rely on corpus methods.
Benefits:
Caveats:
By the end of this session, students will be able to:
- Discuss measurement issues in SLA
- Explain the purposes of linguistic measures
- List commonly used lexical measures in second language acquisition research
- Explain sub-constructs of lexical richness measures
- Lexical Diversity
- Lexical Sophistication

The Measurement process
Conceptual work
Construct definition: delineates the theoretical interpretation that can be attached to the observation data (or a measure).
Behavior identification: What kind of behavior do you need to observe?
Task specification: In what condition do you need to set up to observe the intended behavior?
Procedual work
Behavior elicitation: Target behavior is elicited, observed and recorded.
Observation scoring: Classifying or coding observed behavior into categories or values to link them to theoretically meaningful interpretation.
Data analysis: Scores are summarized and patterns are described to provide probablistic evaluation of the data.


Assessing Lexical Richness in Learner Language
Which one do you think reflect “better” vocabulary use?
Is important for college students to have a part-time job? I think that has much opinion to answer it. The part-time job is a job that can do in partial time. So the college student can do part-time job when they has spare time (if they want). There are many reasons why the college student do part-time job (if they do).
I find it hard to make a generalisation on whether it’s important or not for college students to have a part-time job, because this seems like something very individual and highly dependent on the individual student and their circumstances. Jobs serve a few main functions: to earn money, to gain experience, to get a head-start in a career, and to have something to do.
Assumption
Skehan (2009) distinguished:
| Type | Description | Example |
|---|---|---|
| Text-internal | The (learner-produced) text is sufficient for calculation | Lexical Diversity, Lexical Density |
| Text-external | Reference corpus is needed to derive score | Lexical Sophistication, Phraseological Sophistication |
Skehan, P. (2009). Modelling Second Language Performance: Integrating Complexity, Accuracy, Fluency, and Lexis. Applied Linguistics, 30(4), 510–532. https://doi.org/10.1093/applin/amp047
Jarvis, S. (2013). Capturing the Diversity in Lexical Diversity. Language Learning, 63(s1), 87–106. https://doi.org/10.1111/j.1467-9922.2012.00739.x
Old LD measures include:
\(TTR = {nType \over nToken}\)
\(RootTTR (Guiraud Index) = {nType \over \sqrt{nToken}}\)
\(LogTTR = {\log(nType) \over \log(nToken)}\)
\(Maas = {\log(nTokens) - \log(nTypes) \over \log(nToken)^2}\)
→ Never use TTR, RootTTR, LogTTR
Relationship between text-lengths and LD measures
Kyle, K., Sung, H., Eguchi, M., & Zenker, F. (2024). Evaluating evidence for the reliability and validity of lexical diversity indices in L2 oral task responses. Studies in Second Language Acquisition, 46(1), 278–299. https://doi.org/10.1017/S0272263123000402
Zenker & Kyle
Process:
Text with 100 words:
Window 1: [1-50] → TTR = 0.68
Window 2: [2-51] → TTR = 0.69
Window 3: [3-52] → TTR = 0.67
...
Window 51: [51-100] → TTR = 0.70Then by taking average of all windows = 0.685
Measures how many words it takes for lexical diversity to “stabilize”
Process:
Score: Average factor length (higher = more diverse)
In the following MTLD was calculated as:
Text: "The cat sat on the mat... The cat was happy. ... The bird flew by."
|--- Factor 1: 32 words ----------||---- Factor 2: 44 words-----|
(TTR drops to 0.72) (TTR drops to 0.72)word frequency information in reference corpora has been used to operationalize LSExample 1: The big company bought the small business.
Example 2: The major corporation acquired the diminutive enterprise.
In LFP, words in learner text are categorized into different frequency bands in the reference corpus.
| Band | Proportion |
|---|---|
| First 1000 | 75% |
| Second 2000 | 10% |
| Academic Word List | 10% |
| Others | 5% |
Beyond 2000 (Laufer, 1995) is used more often
More recent research recognizes multidimensionality in conceptualize and operationalize LS.
Kyle & Crossley (2015) proposed the Tool of Automated Analysis of Lexical Sophistication (TAALES).
Kim, Crossley & Kyle (2018): proposed multidimensional LS for writing.
Eguchi & Kyle (2020): followed up the concept in L2 speaking.
Durrant (2023) highlighted several categories:
LS should also tap into the extent to which the text uses items that are characteristics of the target register
A word used in a narrower context (specific situation) should be considered more sophisticated.
A word that involke more concrete concepts recieve higher score in concreteness.
| Concrete | Abstract | |
|---|---|---|
| Frequent | dog, car | love, idea |
| Infrequent | charger, helmet | empathy, hypothesis |

| Hypernymy value | Example words |
|---|---|
| 3 – 4 | part, group |
| 5 – 6 | way, thing |
| 7 – 8 | house, site |
| 9 – 11 | car, city |
| 12 – 14 | dog, cancer |
| 15 – 17 | buffalo, stallion |
| 18 – 19 | bulls, gaur |
Goal: Investigate multidimensional nature of lexical sophistication in L2 oral proficiency interviews (OPIs)
Corpus: NICT JLE corpus (1,281 Japanese L2 English OPIs)
Method: Exploratory Factor Analysis (EFA) + regression analysis
Factor 1
Factor 4
final regression to predict OPI
The measurement process
Lexical Diversity (LD)
Lexical Sophistication (LS)
You’ll learn to:
TagAntTAALED
Linguistic Data Analysis I