合成データを作成してみました！(Self-Instruct)

notionデビューです。

コンペ中に、合成データを作ってみたくなり、調べていくと、以下のサイトが見つかりました。

LLMによる合成データ(Synthetic Data)生成のテクニック(https://note.com/hatti8/n/n193430331561)

初めて、合成データを作るので、上記サイトで、「合成データ生成の基本形」と書かれていた、Self-Instructを試してみることにしました。幸い、このサイトには、著者のGitHubのアドレスも記載されていたため、そのソースコードを参考に、合成データ作成を試してみました。著者のソースコードは、Mistral-8x22Bを使ってましたが、私は、Qwen/Qwen2.5を使いました。

Self-Instructionについて

上記サイトには、Self-Instructionについて、以下のように書かれていました。

seed taskと呼ばれる人間が作成したデータを元に、「こんな感じでLLMに与えるタスクを生成して」とLLMに投げることで、データセットを生成していくというのが基本の流れになります。

seed taskは、elyza-tasks-100を使いました。ソースコードに則り、以下のように、instruction(指示)、input(入力)、output(出力)の組のフォーマットに変更しています。

{"id": "seed_task_10", "instruction": "以下の製品について、Facebookで流す広告を書いてください。子供のいる親をターゲットにして、キャッチーにしてください。\\\\n", "instances": [{"input": "【製品概要】\\\\n子供のためのGPSトラッカー。音声通話ができる。", "output": "トークできるGPS ‍新登場！\\\\n\\\\n遊んでいる子供の様子が心配なこと、ありませんか？💦\\\\nこのGPSトラッカーなら、子供の位置がすぐ分かります！\\\\nしかも、通話で声が聞けるから安心😆\\\\n\\\\n#喋れるGPS #キッズGPS"}]}

elyza-tasks-100なので、seed taskは、100件あります。ソースコードでは、ランダムに3つのseed taskを選んでいたので、私もそれに倣い、100件のseed taskから、ランダムに3件選ぶようにしました。

プロンプト

プロンプトは、著者のGitHubにあったファイルをそのまま流用しました。内容は、以下です。

You are asked to come up with a set of 10 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.

Here are the requirements:
1. Avoid using the same phrases for each instruction to maximize diversity.
2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons.
3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc.
4. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
5. The instructions, inputs and outputs mast be in Japanese. English must not be used.
6. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
7. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging.
8. Not all instructions require input. For example, when a instruction asks about some general information, "what is the highest peak in the world", it is not necssary to provide a specific context. In this case, we simply put "<noinput>" in the input field.
9. The output should be an appropriate response to the instruction and the input.Please output within 100 words.
10. The instructions must not be a translation task.

List of 10 tasks:

日本語にすると、だいたい以下の内容です。

あなたには、10の多様なタスク指示のセットを考えてもらいます。これらのタスク指示はLLMに与えられ、我々はその指示を完了したLLMを評価します。

以下がその条件です。:
1. 多様性を最大化するために、各指示について動詞を繰り返さないようにしてください。
2. 指示に使われる言葉も多様であるべきです。例えば、質問と命令形を組み合わせます。
3. 指示の種類も多様であるべきです。リストには、自由形式の生成、分類、編集など、多様なタイプのタスクを含めるべきです。
2. LLMは、命令を完了できなければなりません。例えば、アシスタントに視覚的または音声的な出力を作成するよう求めてはなりません。別の例では、午後5時にあなたを起こしたり、リマインダーを設定するようにアシスタントに依頼してはいけません。
3. 指示は日本語で書いてください。
4. 指示は1～2文の長さであること。命令文でも疑問文でも構いません。
5. 指示に対して適切な入力を生成すること。入力フィールドには、指示のために提供された具体的な例が含まれていなければなりません。現実的なデータを含むべきで、単純なプレースホルダーを含んではいけません。入力は、指示をやりがいのあるものにするために実質的な内容を提供する必要がありますが、理想的には100語を超えないようにしてください。
6. すべての指示に入力が必要なわけではありません。例えば、「世界で最も高い山はどこか」というような一般的な情報を尋ねる指示の場合、具体的な文脈を提供する必要はありません。この場合、入力フィールドに"<noinput>"と書くだけでよいです。
7. 出力は指示と入力に対する適切な応答でなければなりません。出力は100語以内にしてください。

10のタスクのリスト:

全体的には、以下のようなプロンプトになります。

'You are asked to come up with a set of 10 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.\\\\n\\\\nHere are the requirements:\\\\n1. Avoid using the same phrases for each instruction to maximize diversity.\\\\n2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons.\\\\n3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc.\\\\n4. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.\\\\n5. The instructions, inputs and outputs mast be in Japanese. English must not be used.\\\\n6. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.\\\\n7. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging.\\\\n8. Not all instructions require input. For example, when a instruction asks about some general information, "what is the highest peak in the world", it is not necssary to provide a specific context. In this case, we simply put "<noinput>" in the input field.\\\\n9. The output should be an appropriate response to the instruction and the input.Please output within 100 words.\\\\n10. The instructions must not be a translation task.\\\\n\\\\nList of 10 tasks:\\\\n###\\\\n1. Instruction: 以下のシチュエーションでの適切な発言をいくつか考えてください。\\\\n1. Input:\\\\nシチュエーション: 誰かが無事に到着したとき\\\\n1. Output:\\\\n誰かが無事に到着したときの適切な発言としては以下のものが考えられます。\\\\n\\\\n- お疲れ様です。無事に着いて良かったです。\\\\n- ようこそ！長い旅でしたね。ゆっくり休んでください。\\\\n- お久しぶりです。無事に到着されたようで何よりです。\\\\n###\\\\n2. Instruction: 次の文章では、どこかの時点である記事から別の記事へと変わります。あなたのタスクはこの境界を推測し、別の記事に変わった最初の文を記述することです。\\\\n2. Input:\\\\nキャットフードの種類が多くて迷いますが、毎日の食事選びの基本は、栄養過不足にならないよう、「総合栄養食」で「ライフステージに合ったもの」を選ぶことです。\\\\n基本は「総合栄養食」のドライフードとお水だけでOK\\\\nキャットフードには、ドライフードとウェットフードがあります。ドライフード（通称カリカリ）のほとんどは「総合栄養食」なので、フードとお水だけで、猫ちゃんに必要な栄養素を摂取できます。\\\\n猫は狩猟動物ですから、狩猟本能が満たされない欲求不満や運動不足も大きなストレス源となります。\\\\nキャットタワーやキャットウォークなど、猫が運動しやすい環境を整えてあげたいですね。\\\\n猫がストレスを感じると、便秘や下痢、食欲不振、過度なグルーミングといったストレスサインが現れます。時には皮膚炎や膀胱炎といった病気を引き起こす場合もあります。\\\\n長期的なストレスは免疫の低下にも繋がるので、「たかがストレス」と思わずに早めに原因を取り除くようにしましょう。\\\\n2. Output:\\\\n文章全般では猫について記述されていますが、前半ではキャットフードについて、後半では猫のストレスについて述べられています。初めて猫のストレスについて述べられた文が境界であると考えられます。\\\\n\\\\nよって別の記事に変わった最初の文は「猫は狩猟動物ですから、狩猟本能が満たされない欲求不満や運動不足も大きなストレス源となる。」です。\\\\n###\\\\n3. Instruction: ガラスを使い捨てライターで炙ったら燃えますか？\\\\n3. Input:\\\\n<noinput>\\\\n3. Output:\\\\n一般的にガラスは不燃物であり燃えません。\\\\n\\\\n燃焼とは物質が熱と光を発生しながら酸化することであり、ガラスは酸化しにくい物質であるため燃えません。\\\\n\\\\n一方でガラスを炙ると一部のみが熱により膨張し、割れることがあります。\\\\n###\\\\n4. Instruction:'

Self-Instructionについて

プロンプト

実行結果