小川さんが作成した一覧に基づき以下を調査(語彙数でソート)。環境はGoogle Colab。 「lighttransport/japanese-tokenizer-cc100」は単体で使うものではないらしく調査対象外。 検証に使ったデータは以下。
※ CC100-ja (6)3.28GB ※range3/cc100-ja at main の「train_6.parquetresults.filtering.jsonl」のhead -10000 をしたものをtxtに変換
Tokenizer Name | Vocabulary Size | Average Token Length | Coverage | Unknown Token Count | Hex Token ('<0x') Count | Total Tokens | Processing Time (seconds) |
---|---|---|---|---|---|---|---|
pfnet/plamo-13b | 50,432 | 1.90 | 99.60% | 0 | 1,259 (0.40%) | 312,047 | 2.42 |
llm-jp/hf-fast-tokenizer-v22b2 | 96,867 | 1.58 | 99.95% | 0 | 187 (0.05%) | 373,243 | 4.40 |
rinna/nekomata-14b | 151,643 | 4.77 | 100.00% | 0 | 0 | 358,440 | 4.85 |
lightblue/qarasu-14B-chat-plus-unleashed | 151,643 | 4.77 | 100.00% | 0 | 0 | 358,440 | 4.35 |
novelai/nerdstash-tokenizer-v1 | 65,538 | 2.23 | 96.52% | 0 | 10,006 (3.48%) | 287,312 | 5.11 |
cyberagent/calm2-7b | 65,000 | 6.41 | 100.00% | 0 | 0 | 266,427 | 5.06 |
Rakuten/RakutenAI-7B-chat | 48,000 | 2.00 | 94.71% | 0 | 18,106 (5.29%) | 342,093 | 2.96 |
karakuri-ai/karakuri-lm-70b-chat-v0.1 | 45,416 | 2.35 | 95.20% | 0 | 13,410 (4.80%) | 279,489 | 2.64 |
tokyotech-llm/Swallow-7b-instruct-hf | 43,176 | 1.68 | 97.16% | 0 | 10,827 (2.84%) | 381,860 | 4.43 |
tokyotech-llm/Swallow-13b-hf | 43,176 | 1.68 | 97.16% | 0 | 10,827 (2.84%) | 381,860 | 3.68 |
nitky/Superswallow-70b-v0.1 | 43,176 | 1.68 | 97.16% | 0 | 10,827 (2.84%) | 381,860 | 3.08 |
lightblue/karasu-7B-chat-plus-unleashed | 32,055 | 1.78 | 84.61% | 0 | 98,974 (15.39%) | 642,926 | 5.53 |
SakanaAI/EvoLLM-JP-A-v1-7B | 32,000 | 1.78 | 84.61% | 0 | 98,974 (15.39%) | 642,926 | 4.88 |
geniacllm/ja-tokenizer-unigram-v1 | 32,000 | 1.94 | 100.00% | 0 | 0 | 301,976 | 3.73 |
elyza/ELYZA-japanese-Llama-2-7b | 32,000 | 2.21 | 75.93% | 0 | 165,456 (24.07%) | 687,350 | 5.78 |
moneyforward/houou-instruction-7b-v2 | 32,000 | 2.21 | 75.93% | 0 | 165,456 (24.07%) | 687,350 | 5.67 |
rinna/youri-7b | 32,000 | 2.21 | 75.93% | 0 | 165,456 (24.07%) | 687,350 | 5.76 |
なお、上記を出力したソースは以下。
※上記表を出力後にトークナイザーから得られたデータをファイルに出力する処理を追加
出力結果