Reports

Tokenizer Name Vocabulary Size Average Token Length Coverage Unknown Token Count Hex Token ('<0x') Count Total Tokens Processing Time (seconds)
pfnet/plamo-13b 50,432 1.90 99.60% 0 1,259 (0.40%) 312,047 2.42
llm-jp/hf-fast-tokenizer-v22b2 96,867 1.58 99.95% 0 187 (0.05%) 373,243 4.40
rinna/nekomata-14b 151,643 4.77 100.00% 0 0 358,440 4.85
lightblue/qarasu-14B-chat-plus-unleashed 151,643 4.77 100.00% 0 0 358,440 4.35
novelai/nerdstash-tokenizer-v1 65,538 2.23 96.52% 0 10,006 (3.48%) 287,312 5.11
cyberagent/calm2-7b 65,000 6.41 100.00% 0 0 266,427 5.06
Rakuten/RakutenAI-7B-chat 48,000 2.00 94.71% 0 18,106 (5.29%) 342,093 2.96
karakuri-ai/karakuri-lm-70b-chat-v0.1 45,416 2.35 95.20% 0 13,410 (4.80%) 279,489 2.64
tokyotech-llm/Swallow-7b-instruct-hf 43,176 1.68 97.16% 0 10,827 (2.84%) 381,860 4.43
tokyotech-llm/Swallow-13b-hf 43,176 1.68 97.16% 0 10,827 (2.84%) 381,860 3.68
nitky/Superswallow-70b-v0.1 43,176 1.68 97.16% 0 10,827 (2.84%) 381,860 3.08
lightblue/karasu-7B-chat-plus-unleashed 32,055 1.78 84.61% 0 98,974 (15.39%) 642,926 5.53
SakanaAI/EvoLLM-JP-A-v1-7B 32,000 1.78 84.61% 0 98,974 (15.39%) 642,926 4.88
geniacllm/ja-tokenizer-unigram-v1 32,000 1.94 100.00% 0 0 301,976 3.73
elyza/ELYZA-japanese-Llama-2-7b 32,000 2.21 75.93% 0 165,456 (24.07%) 687,350 5.78
moneyforward/houou-instruction-7b-v2 32,000 2.21 75.93% 0 165,456 (24.07%) 687,350 5.67
rinna/youri-7b 32,000 2.21 75.93% 0 165,456 (24.07%) 687,350 5.76

なお、上記を出力したソースは以下。

※上記表を出力後にトークナイザーから得られたデータをファイルに出力する処理を追加

Torkenizer_eval.ipynb

出力結果

tokenize_results.zip

Next actions

Memo