Frequency list
By the end of this session, you will be able to:
- Compute frequency of a single-word lexical item in reference corpora
- Derive vocabulary frequency list using concordancing software (e.g., AntConc)
- Apply tokenization on the Japanese language corpus for frequency analysis
- Conduct Lexical Profiling using a web-application or desktop application (e.g., AntWordProfiler)
AntConc
AntConc2
Now, let’s load a corpus.
Load-corpus
Let’s now create a word frequency list from a corpus
Select Word analysis option
Set Min. Freq and Min. Range
StartFile hit save the current resultssave-list
BROWN frequency
tokenize texts into words.English text
I am planning to eat Oysters after this intensive course.
Japanese text
この短期集中講座が終わったら、カキを食べたいと思っています。
Tokenization = segmenting running text into words
It needs more advanced statistical algorithms for Asian languages.
この短期集中講座が終わったら、カキを食べたいと思っています。TagAnt. You can choose from two output formats.


TagAnt can do more than tokenization.
It allows you to automatically annotate the token for Part-Of-Speech (POS).
POS = Grammatical category of lexical items (NOUN, VERB, etc.)
You can ask TagAnt for different output formats.
For now, let’s choose word+POS.
この短期集中講座が終わったら、カキを食べたいと思っています。
この_DET 短期_NOUN 集中_NOUN 講座_NOUN が_ADP 終わっ_VERB たら_AUX 、_PUNCT カキ_NOUN を_ADP 食べ_VERB たい_AUX と_ADP 思っ_VERB て_SCONJ い_VERB ます_AUX 。_PUNCT
word+posword+pos_tagword+lemmaword+pos+lemma| Display Information | Example |
|---|---|
| word | カキ を 食べ たい |
| word+pos | カキ_NOUN を_ADP 食べ_VERB たい_AUX |
| word+pos_tag | カキ_名詞-普通名詞-一般 を_助詞-格助詞 食べ_動詞-一般 たい_助動詞 |
| word+lemma | カキ_カキ を_を 食べ_食べる たい_たい |
Instruction
Compile a Japanese frequency list based on a corpus.
Aozora 500 from Google Drive..tsv or .txt.Success Criteria
Your submission …
Due to possible time limitation, let’s come back to this if we have time at the end of session 6.
Linguistic Data Analysis I