[1] 0.25
By the end of this session, students will be able to:
- Search for window-based collocations and n-grams in AntConc
- Calculate commonly used strengths of association measures by hand using spreadsheet software
- Discuss benefits and drawbacks of different strength of association measures
Instruction
Compile a Japanese frequency list based on a corpus.
Aozora 500 from Google Drive..tsv or .txt.Success Criteria
Your submission …
Goal: to replicate analysis on GiG.
You will need to have access to both metadata file.
The corpus data is here.
GiG Metadata
Success Criteria
Your submission …
quality of word use in each textWe can also compare two texts in simple text analyzer.
two-text
!
Success Criteria
Your submission …
Expected frequency tries to get “number of times two words occur together if they were truly independent at chance level.”
Expected frequency are usually calculated as follows: \[E_{11} = {(\text{freq of node word} * \text{freq of collocate } ) \over Corpus size}\]
If word1 and word 2 occur 500 times each in a million word corpus…
They are mathematical conversion between the two.
\(\text{Joint Probability} = {500 \over 1000000} \times {500 \over 1000000}\)
This is probability, to convert back to COUNT over all corpus, you multiply the corpus size
\(\text{Expected frequency} = {500 \over 1000000} \times {500 \over 1000000} \times 1000000\)
Let’s generate a list of n-grams.
You can use either English or Japanese.
For Japanese you can use Aozora 500 data we used.
For English you can use Ame06 or B06 data in AntConc.
Trigram in BE06
Quintgram in BE06
In AntConc, we can jump from the item in the list to show KWIC.
KWIC
P-frames
playrole, game, sports, etc.You can search collocation by entering node words
BROWN corpus.Collocateplay in search window and hit StartCollocation search in AntConc
| Option name | Description |
|---|---|
| Window Span | Specifies how many words on the left or right do you consider as candidates. |
| Min. freq | how many times the collocation must occur |
| Min. Range | how many document must the collocation occur in |
frequency tabfrequency list
vlookup.retrieve frequency
Observed frequency
Enter O11
Now we will enter formula for the expected frequency.
\[E_{11} = {(freq_{node} * freq_{collocate} ) \over Corpus size}\]
expected-frequency
Finally, we will enter the following formula.
\[MI = {log_2{ Observed freq \over Expected frequency }}\]
Calculating MI
AntConc MI
window size.\(E_{11} = {(freq_{node} * freq_{collocate} * \color{red}{window size}) \over Corpus size}\)
Let’s fix the expected frequency count.
Fixed expected frequency
Collocation
Replicability is key to science.
For example, you might say.
e.g., I used AntCont version 4.2.x
e.g., MI was calculated using the following formula
R1 = Frequency of node word
C1 = Frequency of collocate
\(MI = {log_2{ Observed freq \over Expected frequency }}\)
\(\text{T-score} = {\text{Observed} - \text{Expected} \over \sqrt{Observed}}\)
\(\text{log Dice} = 14 + \log_2( {{2 \times Observed} \over {R_1 + C_1}})\)
Linguistic Data Analysis I