Task¶
The Task
is an implementation on top of Step
that includes the LLM
as a mandatory argument, so that the Task
defines both the input and output format via the format_input
and format_output
abstract methods, respectively; and calls the LLM
to generate the text. We can see the Task
as an LLM
powered Step
.
Working with Tasks¶
The subclasses of Task
are intended to be used within the scope of a Pipeline
, which will orchestrate the different tasks defined; but nonetheless, they can be used standalone if needed too.
For example, the most basic task is the TextGeneration
task, which generates text based on a given instruction, and it can be used standalone as well as within a Pipeline
.
from distilabel.steps.tasks import TextGeneration
task = TextGeneration(
name="text-generation",
llm=OpenAILLM(model="gpt-4"),
)
task.load()
next(task.process([{"instruction": "What's the capital of Spain?"}]))
# [{'instruction': "What's the capital of Spain?", "generation": "The capital of Spain is Madrid.", "model_name": "gpt-4"}]
Note
The load
method needs to be called ALWAYS if using the tasks as standalone, otherwise, if the Pipeline
context manager is used, there's no need to call that method, since it will be automatically called on Pipeline.run
; but in any other case the method load
needs to be called from the parent class e.g. a Task
with an LLM
will need to call Task.load
to load both the task and the LLM.
Defining custom Tasks¶
In order to define custom tasks, we need to inherit from the Task
class and implement the format_input
and format_output
methods, as well as setting the properties inputs
and outputs
, as for Step
subclasses.
So on, the following will need to be defined:
-
inputs
: is a property that returns a list of strings with the names of the required input fields. -
format_input
: is a method that receives a dictionary with the input data and returns aChatType
, which is basically a list of dictionaries with the input data formatted for theLLM
following the chat-completion OpenAI formatting. It's important to note that theChatType
is a list of dictionaries, where each dictionary represents a turn in the conversation, and it must contain the keysrole
andcontent
, and this is done like this since theLLM
subclasses will format that according to the LLM used, since it's the most standard formatting. -
outputs
: is a property that returns a list of strings with the names of the output fields. Note that since all theTask
subclasses are designed to work with a singleLLM
, this property should always includemodel_name
as one of the outputs, since that's automatically injected from the LLM. -
format_output
: is a method that receives the output from theLLM
and optionally also the input data (which may be useful to build the output in some scenarios), and returns a dictionary with the output data formatted as needed i.e. with the values for the columns inoutputs
. Note that there's no need to include themodel_name
in the output, since that's automatically injected from the LLM in theprocess
method of theTask
.
Once those methods have been implemented, the task can be used as any other task, and it will be able to generate text based on the input data.
from typing import Any, Dict, List, Union
from distilabel.steps.tasks.base import Task
from distilabel.steps.tasks.typing import ChatType
class MyCustomTask(Task):
@property
def inputs(self) -> List[str]:
return ["input_field"]
def format_input(self, input: Dict[str, Any]) -> ChatType:
return [
{
"role": "user",
"content": input["input_field"],
},
]
@property
def outputs(self) -> List[str]:
return ["output_field", "model_name"]
def format_output(
self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
return {"output_field": output}