Session 11: Hands-on Activity

Masaki EGUCHI, Ph.D.

Housekeeping

Preparation

  • Create corpus_data under MyDrive
  • Put ICNALE corpus under MyDrive/corpus_file

🎯 Learning Objectives

By the end of this session, students will be able to:

  • Understand NLP tasks such as POS tagging and dependency parsing
  • Understand how automated parsing works
  • Conduct multi-lingual Part-Of-Speech (POS) tagging using TagAnt
  • Conduct POS tagging using spaCy library in Python (through Google Colab)
  • Conduct Dependency parsing using spaCy library in Python (through Google Colab)

Hands-on Activity

Task 1: POS tagging with TagAnt

Task 2: POS-sensitive frequency list

Task 3: Understanding dependency grammar through visualization

POS tagging with TagAnt

Tagging with TagAnt

  1. Open TagAnt
  2. Select Input Files
  3. Select Language
  4. Select Display information (see next)

Display setting info in TagAnt

Followings are basic selection in TagAnt.

Menu Function Example
word tokenization dogs, ran
pos POS tag (simple) NOUN, VERB
pos_tag POS tag (detailed) NNS, VBD
lemma lemmatized word dog, run

Now try different options…

  • Let’s analyze your own example with the following:
    • word+pos
    • word+pos_tag
    • word+lemma
    • word+pos+lemma
  • What do these choice give you? Share it with your neighbor.

Other choices and expected results

Display Information Example
word カキ を 食べ たい
word+pos カキ_NOUN を_ADP 食べ_VERB たい_AUX
word+pos_tag カキ_名詞-普通名詞-一般 を_助詞-格助詞 食べ_動詞-一般 たい_助動詞
word+lemma カキ_カキ を_を 食べ_食べる たい_たい

Task 1: Annotating Japanese text (10 mins)

  • Annotate 50 Japanese text files with TagAnt.
  • Create frequency list for aozora_50

Task 1: Answer

Before

「大溝」

 僕は本所界隈のことをスケツチしろといふ社命を受け、同じ社のO君と一しよに久振りに本所へ出かけて行つた。

After

「_補助記号-括弧開_「 大溝_名詞-固有名詞-人名-姓_大溝 」_補助記号-括弧閉_」

 _SPACE_  僕_代名詞_僕 は_助詞-係助詞_は 本所_名詞-固有名詞-地名-一般_本所 界隈_名詞-普通名詞-一般_界隈 の_助詞-格助詞_の こと_名詞-普通名詞-一般_こと を_助詞-格助詞_を スケツチ_名詞-普通名詞-一般_スケツチ しろ_動詞-非自立可能_する と_助詞-格助詞_と いふ_動詞-一般_いふ 社命_名詞-普通名詞-一般_社命 を_助詞-格助詞_を 受け_動詞-一般_受ける 、_補助記号-読点_、 同じ_連体詞_同じ 社_名詞-普通名詞-助数詞可能_社 の_助詞-格助詞_の O_名詞-普通名詞-一般_o 君_接尾辞-名詞的-一般_君 と_助詞-格助詞_と 一しよ_名詞-普通名詞-サ変可能_一しよ に_助詞-格助詞_に 久_形容詞-一般_久い 振り_接尾辞-名詞的-一般_振り に_助詞-格助詞_に 本所_名詞-固有名詞-地名-一般_本所 へ_助詞-格助詞_へ 出_動詞-一般_出る かけ_動詞-非自立可能_かける て_助詞-接続助詞_て 行つ_動詞-一般_行ふ た_助動詞_た 。_補助記号-句点_。

Task 2: Frequency-list by POS tags (10 mins)

  • Using AntConc, create following frequency lists:
    • Create a frequency list of 動詞-非自立可能
  • If you are done, please create another frequency list with different search terms.

Task 2: Key

*_動詞-非自立可能

非自立可能動詞

Advanced options in TagAnt

  • In TagAnt, you can download models for other languages.

loading other models

Any questions?

  • Now you can parse multilingual text with TagAnt.

Understanding dependency grammar

Goals

  • Describe grammatical structure of a simple sentence using terminology such as ROOT,head, dependency type, and dependent.

Dependency grammar (係受け)

  • Dependency grammar is particular type of syntactic tree.
  • forms a tree by defining binary relations between running tokens.
  • Each token in the sentence is governed by one token (i.e., head)
  • The highest in the syntactic tree is termed as ROOT

Dependency grammar (係受け) - 2

  • There are a few different approaches to formalize dependency
    • Universal Dependency
    • Stanford Dependency
    • ClearNLP
    • etc.

Simple example

The following is a dependency for I play baseball.

simple-dependency

In table format

The same sentence, I play baseball can be expressed in the following format

tid token dep head
1 I nsubj 2
2 play ROOT
3 baseball dobj 2
4 . punct 2

This type of vertical format is often used to represent multi-layered token information.

Typical dependency labels

Note that spaCy English model is trained on ClearNLP tag set

Core Dependency relations

Nominals Clauses Modifier words Function Words
Core arguments nsubj, obj, iobj csubj, ccomp, xcomp
Non-core dependents obl, vocative, expl, dislocated advcl advmod, discourse aux, cop, mark
Nominal dependents nmod, appos, nummod acl amod det, clf, case

Additional Relations

Coordination Headless Loose Special Other
conj, cc fixed, flat list, parataxis compound, orphan, goeswith, reparandum punct, root, dep

Example 1

xcomp

Example 2

conj

Some excercise - Problem 1

Try filling in the gap in the following table.

tid token dep head
1 I
2 love ROOT
3 beef
4 tongue
5 . punct 2

Some excercise - Problem 2

Try filling in the gap in the following table.

tid token dep head
1 The
2 cat
3 sleeps ROOT
4 on
5 the
6 mat
7 . punct 3

Some excercise - Problem 3

Try filling in the gap in the following table.

tid token dep head
1 She
2 quickly
3 reads ROOT
4 interesting
5 books
6 . punct 3

Using dependency parsers to identify grammatical dependency

About Google Colab (5 mins)

Python basics (15 mins)

  1. First thing is to load the package.
import spacy
nlp = spacy.load("en_core_web_sm")
  1. Then you will define a variable example_text
example_text = "Hi. This is my first awesome sentence to analyze."
  1. Analyze this using spacy
doc = nlp(example_text)

Result of your first text analysis

  1. Let’s print analysis results
for token in doc:
    print(token.text, token.pos_, token.tag_, sep="\t")

In what way are these sentence complex?

Describe complexification strategies:

  • She hopes to join an international research team after graduation.
  • Experts agree that collaboration improves problem-solving efficiency.
  • Students often struggle because they lack sufficient guidance.
  • He succeeded in the most demanding and competitive program at the university.
  • The growing influence of social media on youth behavior is concerning.
  • Policies that encourage innovation are essential for economic growth.

Let’s parse the sentence.

  • Visit our webapp

  • Try the sentences above and analyze their dependencies

Dependency-based collocation