Using LLMs for Text Classification

Overview

This document provides instructions to help you use LLMs for text classification. We will explore three approaches:

Building GPTs using ChatGPT EDU or ChatGPT Plus.
Using NotebookLM.
Using any chatbot that can process files.

The task in this example is to classify a biography into one of the predefined categories found in local gazetteers.

Create a GPT in ChatGPT EDU

Access Your ChatGPT Edu Account:

Ensure you have an active ChatGPT Edu account provided by Harvard. If you haven’t set up your account yet, follow the instructions provided by Harvard University Information Technology (HUIT) to get started. Before utilizing ChatGPT Edu, you may need to complete a brief training module. Visit https://hub.harvardonline.harvard.edu and complete the “FAS Training Module for ChatGPT Edu” under “My Courses”. Note that it may take up to a week to obtain the account after completing the training.

Log into ChatGPT Edu:

Navigate to the ChatGPT Edu website and sign in using your Harvard credentials. At this moment, you cannot create a GPT in a desktop client.

Navigate to the Custom GPT Creation Section:

Once logged in, look for an option labeled “Explore GPTs” on the left sidebar.
Click on “Explore GPTs” and then select “+ Create” in the top right corner to begin the customization process.

Define Your GPT’s Persona (system prompt):
- Provide a clear description of the GPT’s intended role, behavior, and knowledge scope. This may include:
  - Setting the tone and style of responses.
  - Specifying particular domains or subjects the GPT should be knowledgeable about.
- In our case, click the “Configure” tab, and paste the system prompt into the “Instructions” field.
Upload the Documents for Knowledge Base:
- Download the labeled data here.
- Then click “Upload files” under the “Knowledge” section to upload the file you have just downloaded.
- Now the GPT can reference to the provided information.
Enable/Disable Capabilities:
- To make sure the GPT focuses on text classification with the provided knowledge base, uncheck all capabilities (e.g. “Web Search”, Canvas, etc.)
Test Your Custom GPT:
- You can test your GPT with the evaluation data which are not included in the knowledge base.

NotebookLM

Log In to NotebookLM
- Access the Website: Open your preferred web browser and navigate to NotebookLM.
- Sign In: Click on the “Try NotebookLM” button and sign in using your Google account credentials, either your personal account or the Harvard G account.
Create a New Notebook
- Initiate Notebook Creation: Once logged in, you’ll be directed to the NotebookLM homepage. Click on the “Create new notebook” button.
Add a Source by Uploading the File
- Open the Sources Panel: With your new notebook open, the “Add sources” screen shows up.
- Insert the Source: Upload the labeled data to the sources.
Chat with the Notebook
- Paste the System Prompt: Paste the “system prompt” into the “Chat” section.
- Feed Text for Classification: Paste text from the evaluation data to test the classification.
- Save the Results: When you close the notebook or refresh it, the results will NOT be saved. If you want to keep the results, remember to pin them as notes.

Chatbots with File Upload Capability

Upload Labeled Data: You can upload the labeled data to any chatbot that supports file uploads.
Initiate Conversation: Use the system prompt to start the conversation with the chatbot.
Classify Text: Once the chatbot confirms it understands the task, you can feed it text from the evaluation data for text classification.

System Prompts

This is a system prompt designed by Wang Hongsu. There are some minor differences between the two versions.

GPT System Prompt

You are an expert language model tasked with categorizing user-submitted texts into one of the following categories: 名賢, 宦績, 武功, 忠節, 孝義, 儒林, 文苑, 隱逸, 藝術, 流寓, 僧, 道. You will use a provided knowledge base to assist in your classification. The knowledge base is formatted as:  

| category | biography |
|----------|-----------|
|Category Name|Biography Details|

Your job is to:  
1. Analyze the content of the submitted text.  
2. Compare it to the biographies in the knowledge base, finding the most similar ones based on themes, roles, and key characteristics.  
3. Use the category associated with the most similar biographies to determine the appropriate category for the submitted text.  

When giving your answer:  
- Provide a detailed explanation of why you assigned the text to the specific category.  
- Cite the relevant biographies from the knowledge base (with punctuation) that support your decision.  
- Ensure that your reasoning is clear, logical, and well-grounded in the knowledge base.  

Output format for your response:  
**Reasoning:** [Detailed explanation of the reasoning process.]  
**Cited Biographies:**  
1. **Category:** [Category Name]  
   **Biography:** [Biography Details]  (Don't translate, just cite the original text)
2. [Repeat for additional relevant biographies, if necessary.]  
**Category:** [Chosen Category]  

Make sure to base your categorization on the combined understanding of the submitted text and the examples from the knowledge base. Your judgment should reflect a deep understanding of the text's themes and its alignment with the knowledge base categories.

NotebookLM System Prompt

You are an expert language model tasked with categorizing user-submitted texts into one of the following categories: 名賢, 宦績, 武功, 忠節, 孝義, 儒林, 文苑, 隱逸, 藝術, 流寓, 僧, 道. Based on the source "category_biography.md" to classify the user-submitted text. The knowledge base is formatted as:  

| category | biography |
|----------|-----------|
|Category Name|Biography Details|

Your job is to:  
1. Analyze the content of the submitted text.  
2. Compare it to the biographies in the source, finding the most similar ones based on themes, roles, and key characteristics.  
3. Use the category associated with the most similar biographies to determine the appropriate category for the submitted text.  

When giving your answer:  
- Provide a detailed explanation of why you assigned the text to the specific category.  
- Cite the relevant biographies from the source (with punctuation) that support your decision.  
- Ensure that your reasoning is clear, logical, and well-grounded in the knowledge base.  

Output format for your response:  
**Reasoning:** [Detailed explanation of the reasoning process.]  
**Cited Biographies:**  
1. **Category:** [Category Name]  
   **Biography:** [Biography Details]  (Don't translate, just cite the original text)
2. [Repeat for additional relevant biographies, if necessary.]  
**Category:** [Chosen Category]  

Make sure to base your categorization on the combined understanding of the submitted text and the examples from the source. Your judgment should reflect a deep understanding of the text's themes and its alignment with the knowledge base categories.

Downloads

Principle Components Analysis (PCA)

The following plot shows the distribution of the documents in the knowledge base using principal component analysis (PCA).

Explained Variance Ratio:
PC1: 1.75%
PC2: 1.39%

Top features (characters) contributing to each principal component:

PC1 top features:
賊: 0.2243 (appears in 9.5% of documents)
畫: -0.1448 (appears in 13.4% of documents)
醫: -0.1404 (appears in 15.0% of documents)
兵: 0.1241 (appears in 18.2% of documents)
年: 0.1192 (appears in 54.2% of documents)
大: 0.1191 (appears in 36.0% of documents)
史: 0.1154 (appears in 27.3% of documents)
都: 0.1110 (appears in 13.4% of documents)
部: 0.1094 (appears in 17.0% of documents)
鏞: 0.1083 (appears in 1.2% of documents)

- **First Principal Component (PC1)**
   - Top positive contributing characters:
     * Military/conflict terms: 賊 (bandit), 兵 (soldier)
     * Administrative terms: 都 (capital), 部 (department)
   - Top negative contributing characters:
     * Civilian professions: 畫 (painting), 醫 (medicine)
   
   This component appears to separate military/administrative content from civilian/cultural content.   

PC2 top features:
女: 0.2513 (appears in 10.7% of documents)
醫: -0.2465 (appears in 15.0% of documents)
死: 0.1386 (appears in 27.3% of documents)
氏: 0.1377 (appears in 21.3% of documents)
母: 0.1275 (appears in 23.7% of documents)
南: -0.1047 (appears in 26.1% of documents)
不: 0.1023 (appears in 70.4% of documents)
日: 0.1004 (appears in 30.8% of documents)
婦: 0.0984 (appears in 3.6% of documents)
書: -0.0972 (appears in 36.0% of documents)


- **Second Principal Component (PC2)**
   - Top positive contributing characters:
     * Family/gender terms: 女 (female), 母 (mother), 父 (father)
     * Life events: 死 (death), 氏 (clan name)
   - Top negative contributing characters:
     * Professional terms: 醫 (medicine)
     * Directional terms: 南 (south)
   
   This component seems to separate family/personal life content from professional/geographical content.