Session 13: Special Session on LLM

Masaki EGUCHI, Ph.D.

Housekeeping

  • We will spend afternoon to focus on the projects.

🎯 Learning Objectives

By the end of this session, students will be able to:

  • Describe how LLMs are trained generally and what LLMs do to produce language.
  • Explain the benefits and drawbacks of using LLMs for linguistic annotation.
  • Demonstrate/discuss potential impacts of prompts on the LLMs performance on linguistic annotation.
  • Design an experiment to investigate LLMs output accuracy on a given annotation task.

Large Language Models

Chatting about ChatGPT

  • Have you ever used ChatGPT or any other kind of LLM?
  • What do you ask it to do?

Question!

What word would be suitable in the context?

Fill in the blank

How does it work?

  • (Large) Language Models are statistical model.

Predicting next word

How does it work?

  • Essentially, what it does is predicting the next word given the preceeding context.
  • They are gigantic model (3-70B parameters for example; parameters analogous to neurons)

The picture taken from the following NVIDIA blog

How is it trained?

Why is it so intelligent…?

  • Gigantic neural network (1B-70B)
  • Gigantic corpus (Over 500GB - 2TB of texts; 35,000 times bigger than BROWN corpus)
    • meaning roughly 300,000 different contexts of chocolate it learns.
    • dog: 79 times * 35,000 = 23,700,000 occurrences

Prompt strategies

  • Essentially, LLM produces next tokens given the preceding contexts.
  • Looking at LLM this way, prompt (the written instruction) conditions LLM to perform certain tasks.

→ Effective prompting approach may help LLM achieve higher performances

In-context learning

In-context learning refers to the use of examples embedded in the prompt as a way to augment the responses of LLM.

  • Zero-shot: prompt without example
  • One-shot: prompt with one example
  • Few-short: prompt with a few examples

→ in-context learning, or few-shot prompting, can help us achieve better results without further “training” of the model.

LLM’s strengths and weeknesses in applied linguistics domain

  • Scoring may not be a strengths (Mizumoto & Eguchi, 2023)
  • However, annotation could be done better
    • Move-step analysis (Kim & Lu, 2024)
    • Linguistic accuracy analysis (Mizumoto et al., 2024)

Annotating move-steps in academic paper introduction

Kim & Lu (2024) examined the use of GPT to annotate discourse move.

What is discourse move and steps?

  • Conventional rhetorical stages in writing:
Move Description
Move 1 Establishing a research territory
Move 2 Establishing a niche
Move 3 Presenting the present work via some steps

Kim & Lu’s validation process

validation-process

Precision, Recall, and F1

Remember COVID test…

  • You have True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
\(Yes_{pred}\) \(No_{pred}\)
\(Yes_{true}\) TP FN
\(No_{true}\) FP TN

Precision, Recall, and F1

  • Precision = \(TP \over TP + FP\)
    • (How much of retrieved YES actually YES?)
  • Recall = \(TP \over TP + FN\)
    • (How much of actual YES is retrieved?)
  • F1 score = \({ (2 \times \text{Precision} \times \text{Recall})} \over {\text{Precision} + \text{Recall}}\)

Result - Move

Almost perfect when fine-tuned

Result - Step

step analysis can be difficult

Mizumoto et al. (2024)

RQ1: To what extent is the accuracy evaluation by ChatGPT comparable to that conducted by human evaluators?

RQ2: How does the accuracy evaluation by ChatGPT compare to Grammarly?

Corpus

  • the Cambridge Learner Corpus First Certificate in English (CLC FCE) dataset (Yannakoudakis et al., 2011).
  • Freely available corpus with error tags and metadata:
    • L1 (first language)
    • age
    • the overall score

Why this corpus?

  • One of the few openly available error tagged corpus.
  • Making it a “gold standard” for evaluating accuracy.

Error tagged corpus

  • 80 types of errors were tagged
No. Code Error Type
6 <AGV> Verb agreement
7 <AS> Argument structure
10 <CL> Collocation
11 <CN> Noun countability
28 <FV> Verb form
31 <ID> Idiom
37 <L> Register
64 <TV> Verb tense
76 <W> Word order
61 <S> Spelling (non-word)

XML format

  • The corpus is provided in XML (Extensible Markup Language) format
<?xml version="1.0" encoding="UTF-8"?>
<learner><head sortkey="TR160*0100*2000*01">
  <candidate><personnel><language>Chinese</language><age>21-25</age></personnel><score>17.0</score></candidate>
  <text>
     <answer1>
       <question_number>1</question_number>
       <exam_score>2.2</exam_score>
         <coded_answer>
          <p>Dear <NS type="RN"><i>Madam</i><c>Ms</c></NS> Helen Ryan,</p>
          ... (More text here)
         </coded_answer>
     </answer1>
  </text>
</head></learner>

Participants

  • 232 learners
    • 86 Korean
    • 80 Japanese
    • 66 Chinese

Text cleaning

  • They focused on Task 1 (letter writing)
  • Etracted each sentence as one row in a data
  • Removed metadata and error tags

The dataset

The dataset looked as follows:

  • you have hand-tagged dataset
  • clean dataset without any correction

Dataset view

Measures for accuracy

  • Counted errors and words from the human-annotated corpus
  • Accuracy is operationally defined as errors per 100 words

Evaluation of GPT-4 correction

  • Sent each sentence with following prompt:
Reply with a corrected version of the input sentence with all grammatical, spelling, and punctuation errors fixed. Be strict about the possible errors. If there are no errors, reply with a copy of the original sentence. Then, provide the number of errors. 

Input sentence: {each line of sentence} 

Corrected sentence: 
Number of errors:

Distribution of error per 100 words

Errors per 100 words

Correlations among human, GPT, and Grammarly

Correlations

Summary of findings

  • Correlation with manual error tag: GPT > Grammarly
  • Correlation with proficiency score: GPT > Grammarly

Take-away

  • Corpus research requires cleaning of the corpus.

Let’s try