Create Function-Calling datasets with APIGen¶

This example will introduce APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets, a data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications.

Replication¶

The following figure showcases the APIGen framework:

Now, let's walk through the key steps illustrated in the figure:

DataSampler: With the help of this step and the original Salesforce/xlam-function-calling-60k we are getting the Seed QA Data Sampler for the prompt template.
APIGenGenerator: This step does the job of the Query-Answer Generator, including the format checker from Stage 1: Format Checker thanks to the structured output generation.
APIGenExecutionChecker: This step is in charge of the Stage 2: Execution Checker.
APIGenSemanticChecker: Step in charge of running Stage 3: Semantic Checker, can use the same or a different LLM, we are using the same as in APIGenGenerator step.

The current implementation hasn't utilized the Diverse Prompt Library. To incorporate it, one could either adjust the prompt template within the APIGenGenerator or develop a new sampler specifically for this purpose. As for the API Sampler, while no specific data is shared here, we've created illustrative examples to demonstrate the pipeline's functionality. These examples represent a mix of data that could be used to replicate the sampler's output.

Data preparation¶

The original paper tells about the data they used and give some hints, but nothing was shared. In this example, we will write a bunch of examples by hand to showcase how this pipeline can be built.

Assume we have the following function names, and corresponding descriptions of their behaviour:

data = [
    {
        "func_name": "final_velocity",
        "func_desc": "Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
    },
    {
        "func_name": "permutation_count",
        "func_desc": "Calculates the number of permutations of k elements from a set of n elements.",
    },
    {
        "func_name": "getdivision",
        "func_desc": "Divides two numbers by making an API call to a division service.",
    },
    {
        "func_name": "binary_addition",
        "func_desc": "Adds two binary numbers and returns the result as a binary string.",
    },
    {
        "func_name": "swapi_planet_resource",
        "func_desc": "get a specific planets resource",
    },
    {
        "func_name": "disney_character",
        "func_desc": "Find a specific character using this endpoint",
    }
]

The original paper refers to both python functions and APIs, but we will make use of python functions exclusively for simplicity. In order to execute and check this functions/APIs, we need access to the code, which we have moved to a Python file: lib_apigen.py. All this functions are executable, but we also need access to their tool representation. For this, we will make use of transformers' get_json_schema function¹.

We have all the machinery prepared in our libpath, except from the tool definition. With the help of our helper function load_module_from_path we will load this python module, collect all the tools, and add them to each row in our data variable.

from distilabel.steps.tasks.apigen.utils import load_module_from_path

libpath_module = load_module_from_path(libpath)
tools = getattr(libpath_module, "get_tools")()  # call get_tools()

for row in data:
    # The tools should have a mix where both the correct and irrelevant tools are present.
    row.update({"tools": [tools[row["func_name"]]]})

Now we have all the necessary data for our prompt. Additionally, we will make use of the original dataset as few-shot examples to enhance the model:

ds_og = (
    load_dataset("Salesforce/xlam-function-calling-60k", split="train")
    .shuffle(seed=42)
    .select(range(500))
    .to_list()
)

We have just loaded a subset and transformed it to a list of dictionaries, as we will use it in the DataSampler GeneratorStep, grabbing random examples from the original dataset.

Building the Pipeline¶

Now that we've walked through each component, it's time to see how it all comes together, here's the Pipeline code:

with Pipeline(name="apigen-example") as pipeline:
    loader_seeds = LoadDataFromDicts(data=data)  # (1)

    sampler = DataSampler(  # (2)
        data=ds_og,
        size=2,
        samples=len(data),
        batch_size=8,
    )

    prep_examples = PrepareExamples()  # This step will add the 'examples' column

    combine_steps = CombineOutputs()  # (3)

    model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
    llm=InferenceEndpointsLLM(  # (4)
        model_id=model_id,
        tokenizer_id=model_id,
        generation_kwargs={
            "temperature": 0.7,
            "max_new_tokens": 2048,
        },
    )
    apigen = APIGenGenerator(  # (5)
        llm=llm,
        use_default_structured_output=True,
    )

    execution_checker = APIGenExecutionChecker(libpath=str(libpath))  # (6)
    semantic_checker = APIGenSemanticChecker(llm=llm)  # (7)

    sampler >> prep_examples
    (
        [loader_seeds, prep_examples] 
        >> combine_steps 
        >> apigen
        >> execution_checker
        >> semantic_checker
    )

Load the data seeds we are going to use to generate our function calling dataset.
The DataSampler together with PrepareExamples will be used to help us create the few-shot examples from the original dataset to be fed in our prompt.
Combine both columns to obtain a single stream of data
Will reuse the same LLM for the generation and the semantic checks.
Creates the query and answers that will be used together with the tools to fine-tune a new model. Will generate the structured outputs to ensure we have valid JSON formatted answers.
Adds columns keep_row_after_execution_check and execution_result.
Adds columns keep_row_after_semantic_check and thought.

Script and final dataset¶

To see all the pieces in place, take a look at the full pipeline, as well as an example row that would be generated from this pipeline.

Run

python examples/pipeline_apigen.py

pipeline_apigen.py

# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from pathlib import Path

from datasets import load_dataset

from distilabel.models import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import CombineOutputs, DataSampler, LoadDataFromDicts
from distilabel.steps.tasks import (
    APIGenExecutionChecker,
    APIGenGenerator,
    APIGenSemanticChecker,
)
from distilabel.steps.tasks.apigen.utils import PrepareExamples, load_module_from_path

libpath = Path(__file__).parent / "lib_apigen.py"

data = [
    {
        "func_name": "final_velocity",
        "func_desc": "Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
    },
    {
        "func_name": "permutation_count",
        "func_desc": "Calculates the number of permutations of k elements from a set of n elements.",
    },
    {
        "func_name": "getdivision",
        "func_desc": "Divides two numbers by making an API call to a division service.",
    },
    {
        "func_name": "binary_addition",
        "func_desc": "Adds two binary numbers and returns the result as a binary string.",
    },
    {
        "func_name": "swapi_planet_resource",
        "func_desc": "get a specific planets resource",
    },
    {
        "func_name": "disney_character",
        "func_desc": "Find a specific character using this endpoint",
    },
]

libpath_module = load_module_from_path(libpath)
tools = libpath_module.get_tools()  # call get_tools()

# TODO: Add in the tools between 0 and 2 extra tools to make the task more challenging.
for row in data:
    # The tools should have a mix where both the correct and irrelevant tools are present.
    row.update({"tools": [tools[row["func_name"]]]})


ds_og = (
    load_dataset("Salesforce/xlam-function-calling-60k", split="train")
    .shuffle(seed=42)
    .select(range(500))
    .to_list()
)


with Pipeline(name="APIGenPipeline") as pipeline:
    loader_seeds = LoadDataFromDicts(data=data)
    sampler = DataSampler(
        data=ds_og,
        size=2,
        samples=len(data),
        batch_size=8,
    )

    prep_examples = PrepareExamples()

    model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
    llm = InferenceEndpointsLLM(
        model_id=model_id,
        tokenizer_id=model_id,
        generation_kwargs={
            "temperature": 0.7,
            "max_new_tokens": 2048,
        },
    )
    apigen = APIGenGenerator(
        llm=llm,
        use_default_structured_output=True,
    )
    combine_steps = CombineOutputs()

    execution_checker = APIGenExecutionChecker(libpath=str(libpath))
    semantic_checker = APIGenSemanticChecker(llm=llm)

    sampler >> prep_examples
    (
        [loader_seeds, prep_examples]
        >> combine_steps
        >> apigen
        >> execution_checker
        >> semantic_checker
    )


if __name__ == "__main__":
    distiset = pipeline.run()
    print(distiset["default"]["train"][0])

Example row:

{
  "func_name": "final_velocity",
  "func_desc": "Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
  "tools": [
    {
      "function": {
        "description": "Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
        "name": "final_velocity",
        "parameters": {
          "properties": {
            "acceleration": {
              "description": "The acceleration of the object.",
              "type": "number"
            },
            "initial_velocity": {
              "description": "The initial velocity of the object.",
              "type": "number"
            },
            "time": {
              "description": "The time elapsed.",
              "type": "number"
            }
          },
          "required": [
            "initial_velocity",
            "acceleration",
            "time"
          ],
          "type": "object"
        }
      },
      "type": "function"
    }
  ],
  "examples": "## Query:\nRetrieve the first 15 comments for post ID '12345' from the Tokapi mobile API.\n## Answers:\n[{\"name\": \"v1_post_post_id_comments\", \"arguments\": {\"post_id\": \"12345\", \"count\": 15}}]\n\n## Query:\nRetrieve the detailed recipe for the cake with ID 'cake101'.\n## Answers:\n[{\"name\": \"detailed_cake_recipe_by_id\", \"arguments\": {\"is_id\": \"cake101\"}}]\n\n## Query:\nWhat are the frequently asked questions and their answers for Coca-Cola Company? Also, what are the suggested tickers based on Coca-Cola Company?\n## Answers:\n[{\"name\": \"symbols_faq\", \"arguments\": {\"ticker_slug\": \"KO\"}}, {\"name\": \"symbols_suggested\", \"arguments\": {\"ticker_slug\": \"KO\"}}]",
  "query": "What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.",
  "answers": "[{\"arguments\": {\"acceleration\": \"9.8\", \"initial_velocity\": \"0\", \"time\": \"10\"}, \"name\": \"final_velocity\"}]",
  "distilabel_metadata": {
    "raw_input_a_p_i_gen_generator_0": [
      {
        "content": "You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\n\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\n\nEnsure the query:\n- Is clear and concise\n- Demonstrates typical use cases\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\n- The corresponding result's parameter types and ranges match with the function's descriptions\n\nEnsure the answer:\n- Is a list of function calls in JSON format\n- The length of the answer list should be equal to the number of requests in the query\n- Can solve all the requests in the query effectively",
        "role": "system"
      },
      {
        "content": "Here are examples of queries and the corresponding answers for similar functions:\n## Query:\nRetrieve the first 15 comments for post ID '12345' from the Tokapi mobile API.\n## Answers:\n[{\"name\": \"v1_post_post_id_comments\", \"arguments\": {\"post_id\": \"12345\", \"count\": 15}}]\n\n## Query:\nRetrieve the detailed recipe for the cake with ID 'cake101'.\n## Answers:\n[{\"name\": \"detailed_cake_recipe_by_id\", \"arguments\": {\"is_id\": \"cake101\"}}]\n\n## Query:\nWhat are the frequently asked questions and their answers for Coca-Cola Company? Also, what are the suggested tickers based on Coca-Cola Company?\n## Answers:\n[{\"name\": \"symbols_faq\", \"arguments\": {\"ticker_slug\": \"KO\"}}, {\"name\": \"symbols_suggested\", \"arguments\": {\"ticker_slug\": \"KO\"}}]\n\nNote that the query could be interpreted as a combination of several independent requests.\n\nBased on these examples, generate 1 diverse query and answer pairs for the function `final_velocity`.\nThe detailed function description is the following:\nCalculates the final velocity of an object given its initial velocity, acceleration, and time.\n\nThese are the available tools to help you:\n[{'type': 'function', 'function': {'name': 'final_velocity', 'description': 'Calculates the final velocity of an object given its initial velocity, acceleration, and time.', 'parameters': {'type': 'object', 'properties': {'initial_velocity': {'type': 'number', 'description': 'The initial velocity of the object.'}, 'acceleration': {'type': 'number', 'description': 'The acceleration of the object.'}, 'time': {'type': 'number', 'description': 'The time elapsed.'}}, 'required': ['initial_velocity', 'acceleration', 'time']}}}]\n\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\n```json\n[\n   {\n       \"query\": \"The generated query.\",\n       \"answers\": [\n           {\n               \"name\": \"api_name\",\n               \"arguments\": {\n                   \"arg_name\": \"value\"\n                   ... (more arguments as required)\n               }\n           },\n           ... (more API calls as required)\n       ]\n   }\n]\n```\n\nNow please generate 1 diverse query and answer pairs following the above format.",
        "role": "user"
      }
    ],
    "raw_input_a_p_i_gen_semantic_checker_0": [
      {
        "content": "As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\n\nDo not pass if:\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\n2. The function call and arguments are not properly chosen from the available functions.\n3. The number of function calls does not correspond to the user\u2019s intentions.\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\n5. The execution results contain errors or reflect that the function calls were not executed successfully.",
        "role": "system"
      },
      {
        "content": "Given Information:\n- All Available Functions:\nCalculates the final velocity of an object given its initial velocity, acceleration, and time.\n- User Query: What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.\n- Generated Function Calls: [{\"arguments\": {\"acceleration\": \"9.8\", \"initial_velocity\": \"0\", \"time\": \"10\"}, \"name\": \"final_velocity\"}]\n- Execution Results: ['9.8']\n\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\n\nThe main decision factor is wheather the function calls accurately reflect the query's intentions and the function descriptions.\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\n\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\n```\n{\n   \"thought\": \"Concisely describe your reasoning here\",\n   \"passes\": \"yes\" or \"no\"\n}\n```\n",
        "role": "user"
      }
    ],
    "raw_output_a_p_i_gen_generator_0": "{\"pairs\": [\n   {\n       \"answers\": [\n           {\n               \"arguments\": {\n                   \"acceleration\": \"9.8\",\n                   \"initial_velocity\": \"0\",\n                   \"time\": \"10\"\n               },\n               \"name\": \"final_velocity\"\n           }\n       ],\n       \"query\": \"What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.\"\n   }\n]}",
    "raw_output_a_p_i_gen_semantic_checker_0": "{\n   \"thought\": \"\",\n   \"passes\": \"yes\"\n}"
  },
  "model_name": "meta-llama/Meta-Llama-3.1-70B-Instruct",
  "keep_row_after_execution_check": true,
  "execution_result": [
    "9.8"
  ],
  "thought": "",
  "keep_row_after_semantic_check": true
}

Read this nice blog post for more information on tools and the reasoning behind get_json_schema: Tool Use, Unified. ↩