Session 12: Hands-on Activity

Masaki EGUCHI, Ph.D.

Session overview

🎯 Learning Objectives

By the end of this session, students will be able to:

Define extraction rules to identify fine-grained grammatical features in language

Conduct analysis using a template Python code or web application provided by the instructor.

In what way are these sentence complex?

Describe complexification strategies:

She hopes to join an international research team after graduation.
Experts agree that collaboration improves problem-solving efficiency.
Students often struggle because they lack sufficient guidance.
He succeeded in the most demanding and competitive program at the university.
The growing influence of social media on youth behavior is concerning.
Policies that encourage innovation are essential for economic growth.

Let’s parse the sentence.

Visit our webapp
Try the sentences above and analyze their dependencies

Dependency collocation

Dependency-based collocation

Let’s talk about how to use dependency to extract collocations
Google Colab Notebook

Assignment 4

In assignment 4, you will conduct a grammatical analysis on a corpus combining POS tagger and dependency parser.

You will be able to: - extract fine-grained grammatical features from either a Japanese or an English corpus. - write a short report describing the results and interpretation of the analysis results.

Let’s start

Colab Notebook

Access the notebook from the folder.

Introduction

The notebook is very basic version of what TAASSC can do for you in Kyle & Crossley (2018).
This is meant for educational use; for research more rigorous approach may be needed to identify T-units more consistently.

Algorithm used in the notebook

In this notebook, the following analysis pipeline is implemented for you.

Your input is file path to yout corpus files.
The current code loads the corpus files onto colab.
It then iterate through the corpus files one by one.

Algorithm used in the notebook (cont’d)

Parse the sentence using spacy
Conduct basic analysis (such as calculating the number of tokens, sentences, etc.)
Count the number of specific grammatical structures (MAIN FEATURE)
Store the results into a Python dictionary
After every corpus file is processed, it can create a dataset to export.
You can export the results for further analysis

Let’s now work on the notebook

Let’s first identify adjectival modifier (amod)

coming up with extraction rules

From Table 5.1 in Durrant (2023, p. 102), pick one or two sentences.

Using POS/ Dependency parsing option in simple text analyzer, try to describe how to identify those structure.
For example, to identify that-clause complement, you will first look for ccomp and look if the ccomp has mark that is that.

spaCy token information

Some useful token information are following:

code	what it does	example
token.lemma_	lemmatized form	be, child
token.pos_	simple POS (Universal Dependency)	NOUN, VERB
token.tag_	fine-grained POS (PennTag set)	NN, JJ, VB, BBZ
token.dep_	dependency type	amod, advmd
token.head	token information of the head of the dependency

Thinking grammartically

In pair, brainstorm 3 - 5 grammatical constructions you would like to identify in your Corpus Lab.

Describe the grammatical feature
Give some examples that fall under the grammatical construction.
Explain why you are interested in.

Corpus Lab 4

Overview

The final corpus lab is about syntactic features.

Using the colab notebook and python, you will extract fine-grained grammatical features that might distinguish scores in ICNALE.
You will extract the features on colab notebook.
Then you export the dataset and plot the relationships with the ICNALE score.

Task 1: Research questions, Hypotheses and Methods

In this task you will describe research questions, hypothesis, and methods.

Research questions

Research questions should include:
- type of features you are looking at (e.g., adverbial clauses)
- situational variables that defines your sub-corpora (e.g., grade, genre, proficiency)

Hypothesis

Your research hypothesis should:
- describe your predictions in terms of:
  - quantitative trends of the feature in relation to the factor you are interested in.

Task 2: Definitions and operationalization of grammatical features to extract

You must describe the specific grammatical features that you plan to extract.
For example, for clausal features you need to specify if you are interested in :
- subordinate clauses or embedded clauses
- particular type of clauses
Description of rules to identify desirable linguistic feature.
- For example, you will need to specify amod for dependency label to extract adjective + noun phrase.

Fine-grained Descriptive grammatical features

Once you articulated the information above, you will now conduct a search over the corpus.
You should use either simple text analyzer or your own Colab Notebook.
- I will specify which option should be used by the time we start working on this assignment (that is, this depends on your progress as a group.)

Task 3: Results and interpretation

Provide the results of your corpus analysis in a way you think is most effective to address your research questions. Make effective use of tables, plots, or other data presentation technique as you think.
Provide descriptive paragraphs to walk the reader through the results and how to interprete that results.

Submission

A word file (.docx file) that addresses requirements in a written format (one or two pages depending on your analysis results.).
- Screenshots of your search settings on the simple text analyzer tool.
IF you use colab, Google Colab notebook (.ipynb file) with extraction code and results.

Success Criteria

Your submission …

outlines research questions and hypotheses
selected syntactic feature and why you think that distinguishes the proficiency score
provide description of your approach (algorithms and rules) to identify the desired linguistic structure
provides scatter plot examining the relationships and their interpretations in relation to the research questions