NLP From Scratch

NLP From Scratch: Classifying Names with a Character-Level RNN

我们将构建并训练一个基础的字符级循环神经网络（RNN）来实现词汇分类
字符级RNN将单词作为字符序列进行读取，在每一步输出预测结果和hidden state，并将其前一步的hidden state输入到下一个步骤。以最终的预测结果作为输出，即判断该词汇属于哪个类别
我们将使用来自18种语源的数千个姓氏进行训练，然后根据拼写来预测名字的来源语言

Preparing Torch

设置 torch 默认使用对应硬件所支持的设备（CPU 或 CUDA）以实现 GPU 加速

import torch

# Check if CUDA is available
device = torch.device('cpu')
if torch.cuda.is_available():
    device = torch.device('cuda')

torch.set_default_device(device)
print(f"Using device = {torch.get_default_device()}")

"""
out:
Using device = cuda:0
"""

Preparing the Data

data/names 目录中包含 18 个名为 [Language].txt 的文本文件。每个文件包含若干姓名，每行一个姓名，大多已罗马化（但我们仍需将 Unicode 转换为 ASCII）
第一步是定义和清理数据。首先需要将 Unicode 转换为纯 ASCII 以限制 RNN 输入层维度。通过将 Unicode 字符串转换为 ASCII 并仅允许保留特定字符集来实现

import string
import unicodedata

# We can use "_" to represent an out-of-vocabulary character, that is, any character we are not handling in our model
allowed_characters = string.ascii_letters + " .,;'" + "_"
n_letters = len(allowed_characters)

# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in allowed_characters
    )

print (f"converting 'Ślusàrski' to {unicodeToAscii('Ślusàrski')}")

"""
out:converting 'Ślusàrski' to Slusarski
"""

Turning Names into Tensors

现在我们已经整理好所有姓名，需要将其转换为 Tensor 才能进行后续处理
表示单个字母时，我们使用大小为 <1 x n_letters> 的 “one-hot vector”。one-hot vector 除当前字母索引位置为 1 外其余全为 0，例如 “b” = <0 1 0 0 0 …>
对于单词而言，我们将多个 one-hot vector 组合成二维矩阵 <line_length x 1 x n_letters>
额外的维度 1 是因为 PyTorch 默认所有输入都包含 batch 维度——这里我们使用的 batch size 为 1

# Find letter index from all_letters, e.g. "a" = 0
def letterToIndex(letter):
    # return our out-of-vocabulary character if we encounter a letter unknown to our model
    if letter not in allowed_characters:
        return allowed_characters.find("_")
    else:
        return allowed_characters.find(letter)

# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

print (f"The letter 'a' becomes {lineToTensor('a')}") #notice that the first position in the tensor = 1
print (f"The name 'Ahn' becomes {lineToTensor('Ahn')}") #notice 'A' sets the 27th index to 1

"""
out:
The letter 'a' becomes tensor([[[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0.]]], device='cuda:0')
The name 'Ahn' becomes tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0.]]], device='cuda:0')
"""

可以使用类似的方法处理其他基于文本的 RNN 任务
接下来，我们需要将所有样本整合为 dataset 以便训练、测试和验证模型
为此我们将使用 Dataset 和 DataLoader 类来存储数据。每个 Dataset 需要实现三个函数：init、len 和 getitem

from io import open
import glob
import os
import time

import torch
from torch.utils.data import Dataset

class NamesDataset(Dataset):

    def __init__(self, data_dir):
        self.data_dir = data_dir #for provenance of the dataset
        self.load_time = time.localtime #for provenance of the dataset
        labels_set = set() #set of all classes

        self.data = []
        self.data_tensors = []
        self.labels = []
        self.labels_tensors = []

        #read all the ``.txt`` files in the specified directory
        text_files = glob.glob(os.path.join(data_dir, '*.txt'))
        for filename in text_files:
            label = os.path.splitext(os.path.basename(filename))[0]
            labels_set.add(label)
            lines = open(filename, encoding='utf-8').read().strip().split('\n')
            for name in lines:
                self.data.append(name)
                self.data_tensors.append(lineToTensor(name))
                self.labels.append(label)

        #Cache the tensor representation of the labels
        self.labels_uniq = list(labels_set)
        for idx in range(len(self.labels)):
            temp_tensor = torch.tensor([self.labels_uniq.index(self.labels[idx])], dtype=torch.long)
            self.labels_tensors.append(temp_tensor)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        data_item = self.data[idx]
        data_label = self.labels[idx]
        data_tensor = self.data_tensors[idx]
        label_tensor = self.labels_tensors[idx]

        return label_tensor, data_tensor, data_label, data_item

alldata = NamesDataset("data/names")
print(f"loaded {len(alldata)} items of data")
print(f"example = {alldata[0]}")

"""
out:
loaded 20074 items of data
example = (tensor([11], device='cuda:0'), tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0.]]], device='cuda:0'), 'Arabic', 'Khoury')
"""

使用 dataset 对象可以轻松将数据划分为 train 和 test sets。这里我们采用 80/20 的比例进行分割，但 torch.utils.data 还提供了更多实用工具
此处我们指定 generator 是因为需要使其与 PyTorch 默认使用的设备保持一致

train_set, test_set = torch.utils.data.random_split(alldata, [.85, .15], generator=torch.Generator(device=device).manual_seed(2024))

print(f"train examples = {len(train_set)}, validation examples = {len(test_set)}")

"""
out:
train examples = 17063, validation examples = 3011
"""

现在我们已构建包含 20074 个样本的基础 dataset，每个样本都是 label 和 name 的配对组合。同时已完成 training 和 testing 的数据集划分，以便后续验证所构建的模型

Creating the Network

在 autograd 之前，在 Torch 中构建 recurrent neural network 需要通过多个 timesteps 复制 layer 的参数。这些层原本需要维护的 hidden state 和 gradients 现在完全由 graph 自身处理
这意味着可以用非常”纯粹”的方式实现 RNN，就像常规的 feed-forward layers 一样
该 CharRNN 类实现了包含三个组件的 RNN。首先使用 nn.RNN 实现，接着定义将 RNN hidden layers 映射到输出的 layer，最后应用 softmax function
与将每个层实现为 nn.Linear 相比，使用 nn.RNN 能带来显著的性能提升（例如使用 cuDNN-accelerated kernels），同时简化了 forward() 的实现

import torch.nn as nn
import torch.nn.functional as F

class CharRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(CharRNN, self).__init__()

        self.rnn = nn.RNN(input_size, hidden_size)
        self.h2o = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, line_tensor):
        rnn_out, hidden = self.rnn(line_tensor)
        output = self.h2o(hidden[0])
        output = self.softmax(output)

        return output

n_hidden = 128
rnn = CharRNN(n_letters, n_hidden, len(alldata.labels_uniq))
print(rnn)

"""
out:
CharRNN(
  (rnn): RNN(58, 128)
  (h2o): Linear(in_features=128, out_features=18, bias=True)
  (softmax): LogSoftmax(dim=1)
)
"""

此后我们可以将 Tensor 传递给 RNN 以获得 predicted output。随后使用辅助函数 label_from_output 来推导出对应类别的文本 label

def label_from_output(output, output_labels):
    top_n, top_i = output.topk(1)
    label_i = top_i[0].item()
    return output_labels[label_i], label_i

input = lineToTensor('Albert')
output = rnn(input) #this is equivalent to ``output = rnn.forward(input)``
print(output)
print(label_from_output(output, alldata.labels_uniq))

"""
out:
tensor([[-2.9734, -2.8501, -2.8939, -2.7652, -2.8237, -2.8254, -2.9046, -2.8312,
         -2.9726, -2.9509, -2.8488, -2.8630, -2.9266, -3.0637, -3.0039, -2.8271,
         -2.8252, -2.9285]], device='cuda:0', grad_fn=<LogSoftmaxBackward0>)
('Polish', 3)
"""