Skip to content

DataSampler

Step to sample from a dataset.

GeneratorStep that samples from a dataset and yields it in batches. This step is useful when you have a pipeline that can benefit from using examples in the prompts for example as few-shot learning, that can be changing on each row. For example, you can pass a list of dictionaries with N examples and generate M samples from it (assuming you have another step loading data, this M should have the same size as the data being loaded in that step). The size S argument is the number of samples per row generated, so each example would contain S examples to be used as examples.

Attributes

  • data: The list of dictionaries to sample from.

  • size: Number of samples per example. For example in a few-shot learning scenario, the number of few-shot examples that will be generated per example. Defaults to 2.

  • samples: Number of examples that will be generated by the step in total. If used with another loader step, this should be the same as the number of samples in the loader step. Defaults to 100.

Input & Output Columns

graph TD
    subgraph Dataset
        subgraph New columns
            OCOL0[dynamic]
        end
    end

    subgraph DataSampler
        StepOutput[Output Columns: dynamic]
    end

    StepOutput --> OCOL0

Outputs

  • dynamic (based on the keys found on the first dictionary of the list): The columns of the dataset.

Examples

Sample data from a list of dictionaries

from distilabel.steps import DataSampler

sampler = DataSampler(
    data=[{"sample": f"sample {i}"} for i in range(30)],
    samples=10,
    size=2,
    batch_size=4
)
sampler.load()

result = next(sampler.process())
# >>> result
# ([{'sample': ['sample 7', 'sample 0']}, {'sample': ['sample 2', 'sample 21']}, {'sample': ['sample 17', 'sample 12']}, {'sample': ['sample 2', 'sample 14']}], False)

Pipeline with a loader and a sampler combined in a single stream

from datasets import load_dataset

from distilabel.steps import LoadDataFromDicts, DataSampler
from distilabel.steps.tasks.apigen.utils import PrepareExamples
from distilabel.pipeline import Pipeline

ds = (
    load_dataset("Salesforce/xlam-function-calling-60k", split="train")
    .shuffle(seed=42)
    .select(range(500))
    .to_list()
)
data = [
    {
        "func_name": "final_velocity",
        "func_desc": "Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
    },
    {
        "func_name": "permutation_count",
        "func_desc": "Calculates the number of permutations of k elements from a set of n elements.",
    },
    {
        "func_name": "getdivision",
        "func_desc": "Divides two numbers by making an API call to a division service.",
    },
]
with Pipeline(name="APIGenPipeline") as pipeline:
    loader_seeds = LoadDataFromDicts(data=data)
    sampler = DataSampler(
        data=ds,
        size=2,
        samples=len(data),
        batch_size=8,
    )
    prep_examples = PrepareExamples()

    sampler >> prep_examples
    (
        [loader_seeds, prep_examples]
        >> combine_steps
    )
# Now we have a single stream of data with the loader and the sampler data