Types¶

This section contains the different types used accross the distilabel codebase.

`base` ¶

`ChatType = List[ChatItem]` `module-attribute` ¶

ChatType is a type alias for a list of dicts following the OpenAI conversational format.

`ImageUrl` ¶

Bases: TypedDict

Source code in src/distilabel/typing/base.py

class ImageUrl(TypedDict):
    url: Required[str]
    """Either a URL of the image or the base64 encoded image data."""

`url` `instance-attribute` ¶

Either a URL of the image or the base64 encoded image data.

`ImageContent` ¶

Bases: TypedDict

Type alias for the user's message in a conversation that can include text or an image. It's the standard type for vision language models: https://platform.openai.com/docs/guides/vision

Source code in src/distilabel/typing/base.py

class ImageContent(TypedDict, total=False):
    """Type alias for the user's message in a conversation that can include text or an image.
    It's the standard type for vision language models:
    https://platform.openai.com/docs/guides/vision
    """

    type: Required[Literal["image_url"]]
    image_url: Required[ImageUrl]

`steps` ¶

`StepOutput = Iterator[List[Dict[str, Any]]]` `module-attribute` ¶

StepOutput is an alias of the typing Iterator[List[Dict[str, Any]]]

`GeneratorStepOutput = Iterator[Tuple[List[Dict[str, Any]], bool]]` `module-attribute` ¶

GeneratorStepOutput is an alias of the typing Iterator[Tuple[List[Dict[str, Any]], bool]]

`StepColumns = Union[List[str], Dict[str, bool]]` `module-attribute` ¶

StepColumns is an alias of the typing Union[List[str], Dict[str, bool]] used by the inputs and outputs properties of an Step. In the case of a List[str], it is a list with the required columns. In the case of a Dict[str, bool], it is a dictionary where the keys are the columns and the values are booleans indicating whether the column is required or not.

`models` ¶

`LLMLogprobs = List[List[List[Logprob]]]` `module-attribute` ¶

A type alias representing the probability distributions output by an LLM.

Structure

Outermost list: contains multiple generation choices when sampling (n sequences)
Middle list: represents each position in the generated sequence
Innermost list: contains the log probabilities for each token in the vocabulary at that position

`LLMStatistics = Union[TokenCount, Dict[str, Any]]` `module-attribute` ¶

Initially the LLMStatistics will contain the token count, but can have more variables. They can be added once we have them defined for every LLM.

`StructuredOutputType = Union[OutlinesStructuredOutputType, InstructorStructuredOutputType]` `module-attribute` ¶

StructuredOutputType is an alias for the union of OutlinesStructuredOutputType and InstructorStructuredOutputType.

`StandardInput = ChatType` `module-attribute` ¶

StandardInput is an alias for ChatType that defines the default / standard input produced by format_input.

`StructuredInput = Tuple[StandardInput, Union[StructuredOutputType, None]]` `module-attribute` ¶

StructuredInput defines a type produced by format_input when using either StructuredGeneration or a subclass of it.

`FormattedInput = Union[StandardInput, StructuredInput, str]` `module-attribute` ¶

FormattedInput is an alias for the union of StandardInput and StructuredInput as generated by format_input and expected by the LLMs, as well as ConversationType for the vision language models.

`OutlinesStructuredOutputType` ¶

Bases: TypedDict

TypedDict to represent the structured output configuration from outlines.

Source code in src/distilabel/typing/models.py

class OutlinesStructuredOutputType(TypedDict, total=False):
    """TypedDict to represent the structured output configuration from `outlines`."""

    format: Literal["json", "regex"]
    """One of "json" or "regex"."""
    schema: Union[str, Type[BaseModel], Dict[str, Any]]
    """The schema to use for the structured output. If "json", it
    can be a pydantic.BaseModel class, or the schema as a string,
    as obtained from `model_to_schema(BaseModel)`, if "regex", it
    should be a regex pattern as a string.
    """
    whitespace_pattern: Optional[Union[str, List[str]]]
    """If "json" corresponds to a string or a list of
    strings with a pattern (doesn't impact string literals).
    For example, to allow only a single space or newline with
    `whitespace_pattern=r"[\n ]?"`
    """

`format` `instance-attribute` ¶

One of "json" or "regex".

`schema` `instance-attribute` ¶

The schema to use for the structured output. If "json", it can be a pydantic.BaseModel class, or the schema as a string, as obtained from model_to_schema(BaseModel), if "regex", it should be a regex pattern as a string.

`whitespace_pattern` `instance-attribute` ¶

If "json" corresponds to a string or a list of strings with a pattern (doesn't impact string literals). For example, to allow only a single space or newline with whitespace_pattern=r"[ ]?"

`InstructorStructuredOutputType` ¶

Bases: TypedDict

TypedDict to represent the structured output configuration from instructor.

Source code in src/distilabel/typing/models.py

class InstructorStructuredOutputType(TypedDict, total=False):
    """TypedDict to represent the structured output configuration from `instructor`."""

    format: Optional[Literal["json"]]
    """One of "json"."""
    schema: Union[Type[BaseModel], Dict[str, Any]]
    """The schema to use for the structured output, a `pydantic.BaseModel` class. """
    mode: Optional[str]
    """Generation mode. Take a look at `instructor.Mode` for more information, if not informed it will
    be determined automatically. """
    max_retries: int
    """Number of times to reask the model in case of error, if not set will default to the model's default. """

`format` `instance-attribute` ¶

One of "json".

`schema` `instance-attribute` ¶

The schema to use for the structured output, a pydantic.BaseModel class.

`mode` `instance-attribute` ¶

Generation mode. Take a look at instructor.Mode for more information, if not informed it will be determined automatically.

`max_retries` `instance-attribute` ¶

Number of times to reask the model in case of error, if not set will default to the model's default.

`pipeline` ¶

`DownstreamConnectable = Union['Step', 'GlobalStep']` `module-attribute` ¶

Alias for the Step types that can be connected as downstream steps.

`UpstreamConnectableSteps = TypeVar('UpstreamConnectableSteps', bound=(Union['Step', 'GlobalStep', 'GeneratorStep']))` `module-attribute` ¶

Type for the Step types that can be connected as upstream steps.

`DownstreamConnectableSteps = TypeVar('DownstreamConnectableSteps', bound=DownstreamConnectable, covariant=True)` `module-attribute` ¶

Type for the Step types that can be connected as downstream steps.

`PipelineRuntimeParametersInfo = Dict[str, Union[List['RuntimeParameterInfo'], Dict[str, 'RuntimeParameterInfo']]]` `module-attribute` ¶

Alias for the information of the runtime parameters of a Pipeline.

`InputDataset = Union['Dataset', 'pd.DataFrame', List[Dict[str, str]]]` `module-attribute` ¶

Alias for the types we can process as input dataset.

`LoadGroups = Union[List[List[Any]], Literal['sequential_step_execution']]` `module-attribute` ¶

Alias for the types that can be used as load groups.

if List[List[Any]], it's a list containing lists of steps that have to be loaded in isolation.
if "sequential_step_execution", each step will be loaded in a different stage i.e. only one step will be executed at a time.

`StepLoadStatus` ¶

Bases: TypedDict

Dict containing information about if one step was loaded/unloaded or if it's load failed

Source code in src/distilabel/typing/pipeline.py

class StepLoadStatus(TypedDict):
    """Dict containing information about if one step was loaded/unloaded or if it's load
    failed"""

    name: str
    status: Literal["loaded", "unloaded", "load_failed"]

Types¶

base ¶

ChatType = List[ChatItem] module-attribute ¶

ImageUrl ¶

url instance-attribute ¶

ImageContent ¶

steps ¶

StepOutput = Iterator[List[Dict[str, Any]]] module-attribute ¶

GeneratorStepOutput = Iterator[Tuple[List[Dict[str, Any]], bool]] module-attribute ¶

StepColumns = Union[List[str], Dict[str, bool]] module-attribute ¶

models ¶

LLMLogprobs = List[List[List[Logprob]]] module-attribute ¶

LLMStatistics = Union[TokenCount, Dict[str, Any]] module-attribute ¶

StructuredOutputType = Union[OutlinesStructuredOutputType, InstructorStructuredOutputType] module-attribute ¶

StandardInput = ChatType module-attribute ¶

StructuredInput = Tuple[StandardInput, Union[StructuredOutputType, None]] module-attribute ¶

FormattedInput = Union[StandardInput, StructuredInput, str] module-attribute ¶

OutlinesStructuredOutputType ¶

format instance-attribute ¶

schema instance-attribute ¶

whitespace_pattern instance-attribute ¶

InstructorStructuredOutputType ¶

format instance-attribute ¶

schema instance-attribute ¶

mode instance-attribute ¶

max_retries instance-attribute ¶

pipeline ¶

DownstreamConnectable = Union['Step', 'GlobalStep'] module-attribute ¶

UpstreamConnectableSteps = TypeVar('UpstreamConnectableSteps', bound=(Union['Step', 'GlobalStep', 'GeneratorStep'])) module-attribute ¶

DownstreamConnectableSteps = TypeVar('DownstreamConnectableSteps', bound=DownstreamConnectable, covariant=True) module-attribute ¶

PipelineRuntimeParametersInfo = Dict[str, Union[List['RuntimeParameterInfo'], Dict[str, 'RuntimeParameterInfo']]] module-attribute ¶

InputDataset = Union['Dataset', 'pd.DataFrame', List[Dict[str, str]]] module-attribute ¶

LoadGroups = Union[List[List[Any]], Literal['sequential_step_execution']] module-attribute ¶

StepLoadStatus ¶

`base` ¶

`ChatType = List[ChatItem]` `module-attribute` ¶

`ImageUrl` ¶

`url` `instance-attribute` ¶

`ImageContent` ¶

`steps` ¶

`StepOutput = Iterator[List[Dict[str, Any]]]` `module-attribute` ¶

`GeneratorStepOutput = Iterator[Tuple[List[Dict[str, Any]], bool]]` `module-attribute` ¶

`StepColumns = Union[List[str], Dict[str, bool]]` `module-attribute` ¶

`models` ¶

`LLMLogprobs = List[List[List[Logprob]]]` `module-attribute` ¶

`LLMStatistics = Union[TokenCount, Dict[str, Any]]` `module-attribute` ¶

`StructuredOutputType = Union[OutlinesStructuredOutputType, InstructorStructuredOutputType]` `module-attribute` ¶

`StandardInput = ChatType` `module-attribute` ¶

`StructuredInput = Tuple[StandardInput, Union[StructuredOutputType, None]]` `module-attribute` ¶

`FormattedInput = Union[StandardInput, StructuredInput, str]` `module-attribute` ¶

`OutlinesStructuredOutputType` ¶

`format` `instance-attribute` ¶

`schema` `instance-attribute` ¶

`whitespace_pattern` `instance-attribute` ¶

`InstructorStructuredOutputType` ¶

`format` `instance-attribute` ¶

`schema` `instance-attribute` ¶

`mode` `instance-attribute` ¶

`max_retries` `instance-attribute` ¶

`pipeline` ¶

`DownstreamConnectable = Union['Step', 'GlobalStep']` `module-attribute` ¶

`UpstreamConnectableSteps = TypeVar('UpstreamConnectableSteps', bound=(Union['Step', 'GlobalStep', 'GeneratorStep']))` `module-attribute` ¶

`DownstreamConnectableSteps = TypeVar('DownstreamConnectableSteps', bound=DownstreamConnectable, covariant=True)` `module-attribute` ¶

`PipelineRuntimeParametersInfo = Dict[str, Union[List['RuntimeParameterInfo'], Dict[str, 'RuntimeParameterInfo']]]` `module-attribute` ¶

`InputDataset = Union['Dataset', 'pd.DataFrame', List[Dict[str, str]]]` `module-attribute` ¶

`LoadGroups = Union[List[List[Any]], Literal['sequential_step_execution']]` `module-attribute` ¶

`StepLoadStatus` ¶