Vitoria Lima

GPT2 Finetuning

Fine-tuning GPT-2 for dictionary-based language generation (Fall 2021)

Transformer Layer
Figure 1: A transformer Layer

Overview

This project explores the fine-tuning of the GPT-2 model using Wiktionary data to enhance its ability to generate dictionary-style content. The goal is to adapt GPT-2 to model the relationship between words, definitions, and example usages effectively.

Introduction to GPT-2

GPT-2 is a unidirectional, causal language model pre-trained on a diverse dataset obtained from the internet. The core architecture of GPT-2 is based on the Transformer model, which employs self-attention mechanisms to process text sequences.

Key Components

Input Representation:

Mathematically: \(h_0 = \text{Dropout}(U W_e + W_p)\)

where \(W_e\) is the token embedding matrix, \(W_p\) is the positional embedding matrix, and \(U\) is the one-hot encoded sequence.

Transformer Blocks:

Self-Attention Mechanism:

The self-attention operation is defined as:

\[\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{Q K^T}{\sqrt{d}}\right) V\]

where \(Q\), \(K\), and \(V\) are the query, key, and value matrices, and \(d\) is the dimensionality of the key vectors.

Model Tuning and Evaluation

Dataset Preparation

Tuning Procedure

Cosine Similarity for Evaluation

Used to evaluate similarity between generated and actual embeddings:

\[\text{cosine similarity} = \frac{\vec{x} \cdot \vec{y}}{\|\vec{x}\|_2 \|\vec{y}\|_2}\]

Word embeddings
Figure 2: Word embeddings plotted across 6 principal components
Semantic meanings
Figure 3: Word embeddings capturing semantic meanings in latent space

Findings

Conclusion

This project successfully fine-tuned GPT-2 for dictionary-based language generation. The Transformer architecture's self-attention mechanisms effectively captured complex relationships between words, definitions, and usages.

Resources

GitHub Repository
Full Analysis (PDF)