LLaMA3模型

LLaMA3挑选了规模高达50TB的预训练语料，是LLaMA2的7倍之多，在性能上实现了质的飞跃，充分证明了数据的力量
这一语料库不仅包含丰富的代码数据以增强模型的逻辑推理能力，还涵盖了超过5%的非英文数据，覆盖30多种语言，显著扩展了模型的跨语言处理能力
LLaMA3还进行了与LLaMA2一样的人类反馈强化学习，这一策略已被证明能显著提升模型性能
在模型架构上，LLaMA3与前一代LLaMA2几乎完全相同，只是在分词（tokenizer）阶段，由sentencepiece 换成 tiktoken，将字典长度扩大了三倍，极大提升了推理效率
这一改进减少了文字符等语言元素被拆分为多个Token的情况，有效降低了总体Token数量，从而提高了模型处理语言的连贯性和准确性
另一方面，扩大的字典有助于减少对具有完整意义的语义单元进行分割，使模型在处理文本时可以更准确的捕捉词义和上下文，提高生成文本的流畅性和连贯性
LLaMA3均采用了分组查询注意力机制

tokenizer改进

分词工具的更换

LLaMA2 使用 SentencePiece：

1
2
3

self.sp_model = SentencePieceProcessor(model_file=model_path)
...
t = self.sp_model.encode(s)

LLaMA2 依赖 SentencePieceProcessor 来加载模型，并直接调用 encode 和 decode 方法进行分词与反分词
LLaMA3 则使用 tiktoken：

mergeable_ranks = load_tiktoken_bpe(model_path)
...
self.model = tiktoken.Encoding(
    name=Path(model_path).name,
    pat_str=self.pat_str,
    mergeable_ranks=mergeable_ranks,
    special_tokens=self.special_tokens,
)
...
t.extend(
    self.model.encode(
        substr,
        allowed_special=allowed_special,
        disallowed_special=disallowed_special,
    )
)

LLaMA3 利用 tiktoken 加载 BPE 模型，并使用自定义正则表达式 pat_str 来指导文本拆分，使得分词更细粒度和灵活
这种改进让分词效率更高，特别是在多语言或特殊字符的处理上更准确

词汇表（字典）的扩展

LLaMA2 的词汇表：

1	self.n_words: int = self.sp_model.vocab_size()

词汇表大小由 SentencePiece 模型的 vocab_size() 决定，相对固定
LLaMA3 扩展了词汇表：

num_base_tokens = len(mergeable_ranks)
special_tokens = [
    "<|begin_of_text|>",
    "<|end_of_text|>",
    "<|reserved_special_token_0|>",
    "<|reserved_special_token_1|>",
    "<|reserved_special_token_2|>",
    "<|reserved_special_token_3|>",
    "<|start_header_id|>",
    "<|end_header_id|>",
    "<|reserved_special_token_4|>",
    "<|eot_id|>",  # end of turn
] + [
    f"<|reserved_special_token_{i}|>"
    for i in range(5, self.num_reserved_special_tokens - 5)
]
self.special_tokens = {
    token: num_base_tokens + i for i, token in enumerate(special_tokens)
}

在加载基础的 BPE 模型后，LLaMA3 动态地将大量特殊 token 添加到词汇表中，使得词汇表大小大约扩大了三倍
更大的词汇表可以使得常见词组被合并为一个 token，从而减少生成的 token 数量，提升推理效率

长文本处理和拆分策略

LLaMA2 的 encode 实现：

1	t = self.sp_model.encode(s)

直接调用 SentencePiece 模型对整个字符串进行编码，没有额外的分段处理
LLaMA3 的 encode 实现：

TIKTOKEN_MAX_ENCODE_CHARS = 400_000
MAX_NO_WHITESPACES_CHARS = 25_000

substrs = (
    substr
    for i in range(0, len(s), TIKTOKEN_MAX_ENCODE_CHARS)
    for substr in self._split_whitespaces_or_nonwhitespaces(
        s[i : i + TIKTOKEN_MAX_ENCODE_CHARS], MAX_NO_WHITESPACES_CHARS
    )
)

LLaMA3 对于超长文本做了分段处理，首先按照一个最大字符数（400k）进行分块，然后在每个分块内部进一步调用 _split_whitespaces_or_nonwhitespaces 方法，防止连续字符过长
这种分段机制保证了在面对极长或无空格连续字符时，分词过程不会因字符数量过大而出错，同时也提高了整体编码的鲁棒性

特殊 Token 处理的灵活性

LLaMA2 的 encode 参数较简单：

1	def encode(self, s: str, bos: bool, eos: bool) -> List[int]:

仅仅通过 bos（起始）和 eos（结束）参数来决定是否添加特殊 token
LLaMA3 的 encode 增加了更多选项：

def encode(
    self,
    s: str,
    *,
    bos: bool,
    eos: bool,
    allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),
    disallowed_special: Union[Literal["all"], Collection[str]] = (),
) -> List[int]:

除了 bos 和 eos 参数，LLaMA3 还引入了 allowed_special 和 disallowed_special 参数，用于更细致地控制特殊 token 的编码行为。这为模型提供了更高的灵活性，以应对复杂对话或多种文本格式的场景

LLaMA3 tokenizer.py

# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed in accordance with the terms of the Llama 3 Community License Agreement.

import os
from logging import getLogger
from pathlib import Path
from typing import (
    AbstractSet,
    cast,
    Collection,
    Dict,
    Iterator,
    List,
    Literal,
    Sequence,
    TypedDict,
    Union,
)

import tiktoken
from tiktoken.load import load_tiktoken_bpe


logger = getLogger(__name__)


Role = Literal["system", "user", "assistant"]


class Message(TypedDict):
    role: Role
    content: str


Dialog = Sequence[Message]


class Tokenizer:
    """
    Tokenizing and encoding/decoding text using the Tiktoken tokenizer.
    """

    special_tokens: Dict[str, int]

    num_reserved_special_tokens = 256

    pat_str = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"  # noqa: E501

    def __init__(self, model_path: str):
        """
        Initializes the Tokenizer with a Tiktoken model.

        Args:
            model_path (str): The path to the Tiktoken model file.
        """
        assert os.path.isfile(model_path), model_path

        mergeable_ranks = load_tiktoken_bpe(model_path)
        num_base_tokens = len(mergeable_ranks)
        special_tokens = [
            "<|begin_of_text|>",
            "<|end_of_text|>",
            "<|reserved_special_token_0|>",
            "<|reserved_special_token_1|>",
            "<|reserved_special_token_2|>",
            "<|reserved_special_token_3|>",
            "<|start_header_id|>",
            "<|end_header_id|>",
            "<|reserved_special_token_4|>",
            "<|eot_id|>",  # end of turn
        ] + [
            f"<|reserved_special_token_{i}|>"
            for i in range(5, self.num_reserved_special_tokens - 5)
        ]
        self.special_tokens = {
            token: num_base_tokens + i for i, token in enumerate(special_tokens)
        }
        self.model = tiktoken.Encoding(
            name=Path(model_path).name,
            pat_str=self.pat_str,
            mergeable_ranks=mergeable_ranks,
            special_tokens=self.special_tokens,
        )
        logger.info(f"Reloaded tiktoken model from {model_path}")

        self.n_words: int = self.model.n_vocab
        # BOS / EOS token IDs
        self.bos_id: int = self.special_tokens["<|begin_of_text|>"]
        self.eos_id: int = self.special_tokens["<|end_of_text|>"]
        self.pad_id: int = -1
        self.stop_tokens = {
            self.special_tokens["<|end_of_text|>"],
            self.special_tokens["<|eot_id|>"],
        }
        logger.info(
            f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
        )

    def encode(
        self,
        s: str,
        *,
        bos: bool,
        eos: bool,
        allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),
        disallowed_special: Union[Literal["all"], Collection[str]] = (),
    ) -> List[int]:
        """
        Encodes a string into a list of token IDs.

        Args:
            s (str): The input string to be encoded.
            bos (bool): Whether to prepend the beginning-of-sequence token.
            eos (bool): Whether to append the end-of-sequence token.
            allowed_tokens ("all"|set[str]): allowed special tokens in string
            disallowed_tokens ("all"|set[str]): special tokens that raise an error when in string

        Returns:
            list[int]: A list of token IDs.

        By default, setting disallowed_special=() encodes a string by ignoring
        special tokens. Specifically:
        - Setting `disallowed_special` to () will cause all text corresponding
          to special tokens to be encoded as natural text (insteading of raising
          an error).
        - Setting `allowed_special` to "all" will treat all text corresponding
          to special tokens to be encoded as special tokens.
        """
        assert type(s) is str

        # The tiktoken tokenizer can handle <=400k chars without
        # pyo3_runtime.PanicException.
        TIKTOKEN_MAX_ENCODE_CHARS = 400_000

        # https://github.com/openai/tiktoken/issues/195
        # Here we iterate over subsequences and split if we exceed the limit
        # of max consecutive non-whitespace or whitespace characters.
        MAX_NO_WHITESPACES_CHARS = 25_000

        substrs = (
            substr
            for i in range(0, len(s), TIKTOKEN_MAX_ENCODE_CHARS)
            for substr in self._split_whitespaces_or_nonwhitespaces(
                s[i : i + TIKTOKEN_MAX_ENCODE_CHARS], MAX_NO_WHITESPACES_CHARS
            )
        )
        t: List[int] = []
        for substr in substrs:
            t.extend(
                self.model.encode(
                    substr,
                    allowed_special=allowed_special,
                    disallowed_special=disallowed_special,
                )
            )
        if bos:
            t.insert(0, self.bos_id)
        if eos:
            t.append(self.eos_id)
        return t

    def decode(self, t: Sequence[int]) -> str:
        """
        Decodes a list of token IDs into a string.

        Args:
            t (List[int]): The list of token IDs to be decoded.

        Returns:
            str: The decoded string.
        """
        # Typecast is safe here. Tiktoken doesn't do anything list-related with the sequence.
        return self.model.decode(cast(List[int], t))

    @staticmethod
    def _split_whitespaces_or_nonwhitespaces(
        s: str, max_consecutive_slice_len: int
    ) -> Iterator[str]:
        """
        Splits the string `s` so that each substring contains no more than `max_consecutive_slice_len`
        consecutive whitespaces or consecutive non-whitespaces.
        """
        current_slice_len = 0
        current_slice_is_space = s[0].isspace() if len(s) > 0 else False
        slice_start = 0

        for i in range(len(s)):
            is_now_space = s[i].isspace()

            if current_slice_is_space ^ is_now_space:
                current_slice_len = 1
                current_slice_is_space = is_now_space
            else:
                current_slice_len += 1
                if current_slice_len > max_consecutive_slice_len:
                    yield s[slice_start:i]
                    slice_start = i
                    current_slice_len = 1
        yield s[slice_start:]


class ChatFormat:
    def __init__(self, tokenizer: Tokenizer):
        self.tokenizer = tokenizer

    def encode_header(self, message: Message) -> List[int]:
        tokens = []
        tokens.append(self.tokenizer.special_tokens["<|start_header_id|>"])
        tokens.extend(self.tokenizer.encode(message["role"], bos=False, eos=False))
        tokens.append(self.tokenizer.special_tokens["<|end_header_id|>"])
        tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))
        return tokens

    def encode_message(self, message: Message) -> List[int]:
        tokens = self.encode_header(message)
        tokens.extend(
            self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)
        )
        tokens.append(self.tokenizer.special_tokens["<|eot_id|>"])
        return tokens

    def encode_dialog_prompt(self, dialog: Dialog) -> List[int]:
        tokens = []
        tokens.append(self.tokenizer.special_tokens["<|begin_of_text|>"])
        for message in dialog:
            tokens.extend(self.encode_message(message))
        # Add the start of an assistant message for the model to complete.
        tokens.extend(self.encode_header({"role": "assistant", "content": ""}))
        return tokens