Tokenization
Tokenization is the process of breaking text into smaller units called tokens, which serve as the basic input units for language models. Tokens typically represent word fragments, whole words, or punctuation.
理解する Tokenization
Before a language model can process text, that text must be converted into tokens. Modern LLMs use subword tokenization algorithms like Byte Pair Encoding (BPE) or SentencePiece that balance vocabulary size with coverage. Common words get single tokens; rare words get split into multiple subword tokens. On average, one token corresponds to roughly four characters or three-quarters of an English word. Tokenization matters for three practical reasons. First, the context window is measured in tokens, not words or characters. A 128,000-token context window holds roughly 96,000 English words. Second, API costs are priced per token, both for input and output. Third, tokenization affects how models handle different languages. Tokenizers are language-specific. The OpenAI tiktoken library, Hugging Face tokenizers, and Anthropic's tokenizer all use different vocabularies, meaning the same text tokenizes differently across models. This affects context window calculations and cost estimates. Special tokens mark the start and end of sequences, separate system prompts from user messages, and indicate tool call boundaries. These structural tokens are part of every LLM interaction even when invisible to the user.
GAIAの活用方法 Tokenization
GAIA manages token budgets carefully across its agent workflows. Long emails and documents are chunked into token-sized segments before embedding or summarization. When constructing prompts, GAIA balances the amount of retrieved context against the LLM's context window limit to maximize information density while staying within model constraints. Token-aware chunking also ensures GAIA's semantic search operates on coherent units of meaning.
関連概念
Context Window
The context window is the maximum number of tokens a language model can process in a single inference call, encompassing the system prompt, conversation history, retrieved documents, and generated output.
Large Language Model (LLM)
A Large Language Model (LLM) is a deep learning model trained on massive text datasets that can understand, generate, and reason about human language across a wide range of tasks.
Embeddings
Embeddings are dense numerical vector representations of data, such as text, images, or audio, that capture semantic meaning and relationships in a high-dimensional space.
大規模言語モデル(LLM)
大規模言語モデル(LLM)は、膨大なテキストデータでトレーニングされた人工知能モデルであり、人間のような流暢さで言語を理解、生成、推論できます。


