LLaMA3模型

  • LLaMA3挑选了规模高达50TB的预训练语料,是LLaMA2的7倍之多,在性能上实现了质的飞跃,充分证明了数据的力量

  • 这一语料库不仅包含丰富的代码数据以增强模型的逻辑推理能力,还涵盖了超过5%的非英文数据,覆盖30多种语言,显著扩展了模型的跨语言处理能力

  • LLaMA3还进行了与LLaMA2一样的人类反馈强化学习,这一策略已被证明能显著提升模型性能

  • 在模型架构上,LLaMA3与前一代LLaMA2几乎完全相同,只是在分词(tokenizer)阶段,由sentencepiece 换成 tiktoken,将字典长度扩大了三倍,极大提升了推理效率

  • 这一改进减少了文字符等语言元素被拆分为多个Token的情况,有效降低了总体Token数量,从而提高了模型处理语言的连贯性和准确性

  • 另一方面,扩大的字典有助于减少对具有完整意义的语义单元进行分割,使模型在处理文本时可以更准确的捕捉词义和上下文,提高生成文本的流畅性和连贯性

  • LLaMA3均采用了分组查询注意力机制

tokenizer改进

分词工具的更换

  • LLaMA2 使用 SentencePiece:
1
2
3
self.sp_model = SentencePieceProcessor(model_file=model_path)
...
t = self.sp_model.encode(s)
  • LLaMA2 依赖 SentencePieceProcessor 来加载模型,并直接调用 encode 和 decode 方法进行分词与反分词

  • LLaMA3 则使用 tiktoken:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
mergeable_ranks = load_tiktoken_bpe(model_path)
...
self.model = tiktoken.Encoding(
name=Path(model_path).name,
pat_str=self.pat_str,
mergeable_ranks=mergeable_ranks,
special_tokens=self.special_tokens,
)
...
t.extend(
self.model.encode(
substr,
allowed_special=allowed_special,
disallowed_special=disallowed_special,
)
)
  • LLaMA3 利用 tiktoken 加载 BPE 模型,并使用自定义正则表达式 pat_str 来指导文本拆分,使得分词更细粒度和灵活

  • 这种改进让分词效率更高,特别是在多语言或特殊字符的处理上更准确

词汇表(字典)的扩展

  • LLaMA2 的词汇表:
1
self.n_words: int = self.sp_model.vocab_size()
  • 词汇表大小由 SentencePiece 模型的 vocab_size() 决定,相对固定

  • LLaMA3 扩展了词汇表:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
num_base_tokens = len(mergeable_ranks)
special_tokens = [
"<|begin_of_text|>",
"<|end_of_text|>",
"<|reserved_special_token_0|>",
"<|reserved_special_token_1|>",
"<|reserved_special_token_2|>",
"<|reserved_special_token_3|>",
"<|start_header_id|>",
"<|end_header_id|>",
"<|reserved_special_token_4|>",
"<|eot_id|>", # end of turn
] + [
f"<|reserved_special_token_{i}|>"
for i in range(5, self.num_reserved_special_tokens - 5)
]
self.special_tokens = {
token: num_base_tokens + i for i, token in enumerate(special_tokens)
}
  • 在加载基础的 BPE 模型后,LLaMA3 动态地将大量特殊 token 添加到词汇表中,使得词汇表大小大约扩大了三倍

  • 更大的词汇表可以使得常见词组被合并为一个 token,从而减少生成的 token 数量,提升推理效率

长文本处理和拆分策略

  • LLaMA2 的 encode 实现:
1
t = self.sp_model.encode(s)
  • 直接调用 SentencePiece 模型对整个字符串进行编码,没有额外的分段处理

  • LLaMA3 的 encode 实现:

1
2
3
4
5
6
7
8
9
10
TIKTOKEN_MAX_ENCODE_CHARS = 400_000
MAX_NO_WHITESPACES_CHARS = 25_000

substrs = (
substr
for i in range(0, len(s), TIKTOKEN_MAX_ENCODE_CHARS)
for substr in self._split_whitespaces_or_nonwhitespaces(
s[i : i + TIKTOKEN_MAX_ENCODE_CHARS], MAX_NO_WHITESPACES_CHARS
)
)
  • LLaMA3 对于超长文本做了分段处理,首先按照一个最大字符数(400k)进行分块,然后在每个分块内部进一步调用 _split_whitespaces_or_nonwhitespaces 方法,防止连续字符过长

  • 这种分段机制保证了在面对极长或无空格连续字符时,分词过程不会因字符数量过大而出错,同时也提高了整体编码的鲁棒性

特殊 Token 处理的灵活性

  • LLaMA2 的 encode 参数较简单:
1
def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
  • 仅仅通过 bos(起始)和 eos(结束)参数来决定是否添加特殊 token

  • LLaMA3 的 encode 增加了更多选项:

1
2
3
4
5
6
7
8
9
def encode(
self,
s: str,
*,
bos: bool,
eos: bool,
allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),
disallowed_special: Union[Literal["all"], Collection[str]] = (),
) -> List[int]:
  • 除了 bos 和 eos 参数,LLaMA3 还引入了 allowed_special 和 disallowed_special 参数,用于更细致地控制特殊 token 的编码行为。这为模型提供了更高的灵活性,以应对复杂对话或多种文本格式的场景

LLaMA3 tokenizer.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed in accordance with the terms of the Llama 3 Community License Agreement.

import os
from logging import getLogger
from pathlib import Path
from typing import (
AbstractSet,
cast,
Collection,
Dict,
Iterator,
List,
Literal,
Sequence,
TypedDict,
Union,
)

import tiktoken
from tiktoken.load import load_tiktoken_bpe


logger = getLogger(__name__)


Role = Literal["system", "user", "assistant"]


class Message(TypedDict):
role: Role
content: str


Dialog = Sequence[Message]


class Tokenizer:
"""
Tokenizing and encoding/decoding text using the Tiktoken tokenizer.
"""

special_tokens: Dict[str, int]

num_reserved_special_tokens = 256

pat_str = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+" # noqa: E501

def __init__(self, model_path: str):
"""
Initializes the Tokenizer with a Tiktoken model.

Args:
model_path (str): The path to the Tiktoken model file.
"""
assert os.path.isfile(model_path), model_path

mergeable_ranks = load_tiktoken_bpe(model_path)
num_base_tokens = len(mergeable_ranks)
special_tokens = [
"<|begin_of_text|>",
"<|end_of_text|>",
"<|reserved_special_token_0|>",
"<|reserved_special_token_1|>",
"<|reserved_special_token_2|>",
"<|reserved_special_token_3|>",
"<|start_header_id|>",
"<|end_header_id|>",
"<|reserved_special_token_4|>",
"<|eot_id|>", # end of turn
] + [
f"<|reserved_special_token_{i}|>"
for i in range(5, self.num_reserved_special_tokens - 5)
]
self.special_tokens = {
token: num_base_tokens + i for i, token in enumerate(special_tokens)
}
self.model = tiktoken.Encoding(
name=Path(model_path).name,
pat_str=self.pat_str,
mergeable_ranks=mergeable_ranks,
special_tokens=self.special_tokens,
)
logger.info(f"Reloaded tiktoken model from {model_path}")

self.n_words: int = self.model.n_vocab
# BOS / EOS token IDs
self.bos_id: int = self.special_tokens["<|begin_of_text|>"]
self.eos_id: int = self.special_tokens["<|end_of_text|>"]
self.pad_id: int = -1
self.stop_tokens = {
self.special_tokens["<|end_of_text|>"],
self.special_tokens["<|eot_id|>"],
}
logger.info(
f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
)

def encode(
self,
s: str,
*,
bos: bool,
eos: bool,
allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),
disallowed_special: Union[Literal["all"], Collection[str]] = (),
) -> List[int]:
"""
Encodes a string into a list of token IDs.

Args:
s (str): The input string to be encoded.
bos (bool): Whether to prepend the beginning-of-sequence token.
eos (bool): Whether to append the end-of-sequence token.
allowed_tokens ("all"|set[str]): allowed special tokens in string
disallowed_tokens ("all"|set[str]): special tokens that raise an error when in string

Returns:
list[int]: A list of token IDs.

By default, setting disallowed_special=() encodes a string by ignoring
special tokens. Specifically:
- Setting `disallowed_special` to () will cause all text corresponding
to special tokens to be encoded as natural text (insteading of raising
an error).
- Setting `allowed_special` to "all" will treat all text corresponding
to special tokens to be encoded as special tokens.
"""
assert type(s) is str

# The tiktoken tokenizer can handle <=400k chars without
# pyo3_runtime.PanicException.
TIKTOKEN_MAX_ENCODE_CHARS = 400_000

# https://github.com/openai/tiktoken/issues/195
# Here we iterate over subsequences and split if we exceed the limit
# of max consecutive non-whitespace or whitespace characters.
MAX_NO_WHITESPACES_CHARS = 25_000

substrs = (
substr
for i in range(0, len(s), TIKTOKEN_MAX_ENCODE_CHARS)
for substr in self._split_whitespaces_or_nonwhitespaces(
s[i : i + TIKTOKEN_MAX_ENCODE_CHARS], MAX_NO_WHITESPACES_CHARS
)
)
t: List[int] = []
for substr in substrs:
t.extend(
self.model.encode(
substr,
allowed_special=allowed_special,
disallowed_special=disallowed_special,
)
)
if bos:
t.insert(0, self.bos_id)
if eos:
t.append(self.eos_id)
return t

def decode(self, t: Sequence[int]) -> str:
"""
Decodes a list of token IDs into a string.

Args:
t (List[int]): The list of token IDs to be decoded.

Returns:
str: The decoded string.
"""
# Typecast is safe here. Tiktoken doesn't do anything list-related with the sequence.
return self.model.decode(cast(List[int], t))

@staticmethod
def _split_whitespaces_or_nonwhitespaces(
s: str, max_consecutive_slice_len: int
) -> Iterator[str]:
"""
Splits the string `s` so that each substring contains no more than `max_consecutive_slice_len`
consecutive whitespaces or consecutive non-whitespaces.
"""
current_slice_len = 0
current_slice_is_space = s[0].isspace() if len(s) > 0 else False
slice_start = 0

for i in range(len(s)):
is_now_space = s[i].isspace()

if current_slice_is_space ^ is_now_space:
current_slice_len = 1
current_slice_is_space = is_now_space
else:
current_slice_len += 1
if current_slice_len > max_consecutive_slice_len:
yield s[slice_start:i]
slice_start = i
current_slice_len = 1
yield s[slice_start:]


class ChatFormat:
def __init__(self, tokenizer: Tokenizer):
self.tokenizer = tokenizer

def encode_header(self, message: Message) -> List[int]:
tokens = []
tokens.append(self.tokenizer.special_tokens["<|start_header_id|>"])
tokens.extend(self.tokenizer.encode(message["role"], bos=False, eos=False))
tokens.append(self.tokenizer.special_tokens["<|end_header_id|>"])
tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))
return tokens

def encode_message(self, message: Message) -> List[int]:
tokens = self.encode_header(message)
tokens.extend(
self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)
)
tokens.append(self.tokenizer.special_tokens["<|eot_id|>"])
return tokens

def encode_dialog_prompt(self, dialog: Dialog) -> List[int]:
tokens = []
tokens.append(self.tokenizer.special_tokens["<|begin_of_text|>"])
for message in dialog:
tokens.extend(self.encode_message(message))
# Add the start of an assistant message for the model to complete.
tokens.extend(self.encode_header({"role": "assistant", "content": ""}))
return tokens