Skip to content

Types

This section contains the different types used accross the distilabel codebase.

base

ChatType = List[ChatItem] module-attribute

ChatType is a type alias for a list of dicts following the OpenAI conversational format.

ImageUrl

Bases: TypedDict

Source code in src/distilabel/typing/base.py
class ImageUrl(TypedDict):
    url: Required[str]
    """Either a URL of the image or the base64 encoded image data."""
url instance-attribute

Either a URL of the image or the base64 encoded image data.

ImageContent

Bases: TypedDict

Type alias for the user's message in a conversation that can include text or an image. It's the standard type for vision language models: https://platform.openai.com/docs/guides/vision

Source code in src/distilabel/typing/base.py
class ImageContent(TypedDict, total=False):
    """Type alias for the user's message in a conversation that can include text or an image.
    It's the standard type for vision language models:
    https://platform.openai.com/docs/guides/vision
    """

    type: Required[Literal["image_url"]]
    image_url: Required[ImageUrl]

steps

StepOutput = Iterator[List[Dict[str, Any]]] module-attribute

StepOutput is an alias of the typing Iterator[List[Dict[str, Any]]]

GeneratorStepOutput = Iterator[Tuple[List[Dict[str, Any]], bool]] module-attribute

GeneratorStepOutput is an alias of the typing Iterator[Tuple[List[Dict[str, Any]], bool]]

StepColumns = Union[List[str], Dict[str, bool]] module-attribute

StepColumns is an alias of the typing Union[List[str], Dict[str, bool]] used by the inputs and outputs properties of an Step. In the case of a List[str], it is a list with the required columns. In the case of a Dict[str, bool], it is a dictionary where the keys are the columns and the values are booleans indicating whether the column is required or not.

models

LLMLogprobs = List[List[List[Logprob]]] module-attribute

A type alias representing the probability distributions output by an LLM.

Structure
  • Outermost list: contains multiple generation choices when sampling (n sequences)
  • Middle list: represents each position in the generated sequence
  • Innermost list: contains the log probabilities for each token in the vocabulary at that position

LLMStatistics = Union[TokenCount, Dict[str, Any]] module-attribute

Initially the LLMStatistics will contain the token count, but can have more variables. They can be added once we have them defined for every LLM.

StructuredOutputType = Union[OutlinesStructuredOutputType, InstructorStructuredOutputType] module-attribute

StructuredOutputType is an alias for the union of OutlinesStructuredOutputType and InstructorStructuredOutputType.

StandardInput = ChatType module-attribute

StandardInput is an alias for ChatType that defines the default / standard input produced by format_input.

StructuredInput = Tuple[StandardInput, Union[StructuredOutputType, None]] module-attribute

StructuredInput defines a type produced by format_input when using either StructuredGeneration or a subclass of it.

FormattedInput = Union[StandardInput, StructuredInput, str] module-attribute

FormattedInput is an alias for the union of StandardInput and StructuredInput as generated by format_input and expected by the LLMs, as well as ConversationType for the vision language models.

OutlinesStructuredOutputType

Bases: TypedDict

TypedDict to represent the structured output configuration from outlines.

Source code in src/distilabel/typing/models.py
class OutlinesStructuredOutputType(TypedDict, total=False):
    """TypedDict to represent the structured output configuration from `outlines`."""

    format: Literal["json", "regex"]
    """One of "json" or "regex"."""
    schema: Union[str, Type[BaseModel], Dict[str, Any]]
    """The schema to use for the structured output. If "json", it
    can be a pydantic.BaseModel class, or the schema as a string,
    as obtained from `model_to_schema(BaseModel)`, if "regex", it
    should be a regex pattern as a string.
    """
    whitespace_pattern: Optional[Union[str, List[str]]]
    """If "json" corresponds to a string or a list of
    strings with a pattern (doesn't impact string literals).
    For example, to allow only a single space or newline with
    `whitespace_pattern=r"[\n ]?"`
    """
format instance-attribute

One of "json" or "regex".

schema instance-attribute

The schema to use for the structured output. If "json", it can be a pydantic.BaseModel class, or the schema as a string, as obtained from model_to_schema(BaseModel), if "regex", it should be a regex pattern as a string.

whitespace_pattern instance-attribute

If "json" corresponds to a string or a list of strings with a pattern (doesn't impact string literals). For example, to allow only a single space or newline with whitespace_pattern=r"[ ]?"

InstructorStructuredOutputType

Bases: TypedDict

TypedDict to represent the structured output configuration from instructor.

Source code in src/distilabel/typing/models.py
class InstructorStructuredOutputType(TypedDict, total=False):
    """TypedDict to represent the structured output configuration from `instructor`."""

    format: Optional[Literal["json"]]
    """One of "json"."""
    schema: Union[Type[BaseModel], Dict[str, Any]]
    """The schema to use for the structured output, a `pydantic.BaseModel` class. """
    mode: Optional[str]
    """Generation mode. Take a look at `instructor.Mode` for more information, if not informed it will
    be determined automatically. """
    max_retries: int
    """Number of times to reask the model in case of error, if not set will default to the model's default. """
format instance-attribute

One of "json".

schema instance-attribute

The schema to use for the structured output, a pydantic.BaseModel class.

mode instance-attribute

Generation mode. Take a look at instructor.Mode for more information, if not informed it will be determined automatically.

max_retries instance-attribute

Number of times to reask the model in case of error, if not set will default to the model's default.

pipeline

DownstreamConnectable = Union['Step', 'GlobalStep'] module-attribute

Alias for the Step types that can be connected as downstream steps.

UpstreamConnectableSteps = TypeVar('UpstreamConnectableSteps', bound=Union['Step', 'GlobalStep', 'GeneratorStep']) module-attribute

Type for the Step types that can be connected as upstream steps.

DownstreamConnectableSteps = TypeVar('DownstreamConnectableSteps', bound=DownstreamConnectable, covariant=True) module-attribute

Type for the Step types that can be connected as downstream steps.

PipelineRuntimeParametersInfo = Dict[str, Union[List['RuntimeParameterInfo'], Dict[str, 'RuntimeParameterInfo']]] module-attribute

Alias for the information of the runtime parameters of a Pipeline.

InputDataset = Union['Dataset', 'pd.DataFrame', List[Dict[str, str]]] module-attribute

Alias for the types we can process as input dataset.

LoadGroups = Union[List[List[Any]], Literal['sequential_step_execution']] module-attribute

Alias for the types that can be used as load groups.

  • if List[List[Any]], it's a list containing lists of steps that have to be loaded in isolation.
  • if "sequential_step_execution", each step will be loaded in a different stage i.e. only one step will be executed at a time.

StepLoadStatus

Bases: TypedDict

Dict containing information about if one step was loaded/unloaded or if it's load failed

Source code in src/distilabel/typing/pipeline.py
class StepLoadStatus(TypedDict):
    """Dict containing information about if one step was loaded/unloaded or if it's load
    failed"""

    name: str
    status: Literal["loaded", "unloaded", "load_failed"]