Skip to content

Pipeline

This section contains the API reference for the distilabel pipelines. For an example on how to use the pipelines, see the Tutorial - Pipeline.

BasePipeline

Bases: ABC, RequirementsMixin, _Serializable

Base class for a distilabel pipeline.

Attributes:

Name Type Description
name

The name of the pipeline.

description

A description of the pipeline.

dag

The DAG instance that represents the pipeline.

_cache_dir

The directory where the pipeline will be cached.

_logger

The logger instance that will be used by the pipeline.

_batch_manager Optional[_BatchManager]

The batch manager that will manage the batches received from the steps while running the pipeline. It will be created when the pipeline is run, from scratch or from cache. Defaults to None.

_write_buffer Optional[_WriteBuffer]

The buffer that will store the data of the leaf steps of the pipeline while running, so the Distiset can be created at the end. It will be created when the pipeline is run. Defaults to None.

_fs Optional[AbstractFileSystem]

The fsspec filesystem to be used to store the data of the _Batches passed between the steps. It will be set when the pipeline is run. Defaults to None.

_storage_base_path Optional[str]

The base path where the data of the _Batches passed between the steps will be stored. It will be set then the pipeline is run. Defaults to None.

_use_fs_to_pass_data bool

Whether to use the file system to pass the data of the _Batches between the steps. Even if this parameter is False, the Batches received by GlobalSteps will always use the file system to pass the data. Defaults to False.

_dry_run

A flag to indicate if the pipeline is running in dry run mode. Defaults to False.

output_queue

A queue to store the output of the steps while running the pipeline.

load_queue

A queue used by each Step to notify the main process it has finished loading or it the step has been unloaded.

Source code in src/distilabel/pipeline/base.py
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
class BasePipeline(ABC, RequirementsMixin, _Serializable):
    """Base class for a `distilabel` pipeline.

    Attributes:
        name: The name of the pipeline.
        description: A description of the pipeline.
        dag: The `DAG` instance that represents the pipeline.
        _cache_dir: The directory where the pipeline will be cached.
        _logger: The logger instance that will be used by the pipeline.
        _batch_manager: The batch manager that will manage the batches received from the
            steps while running the pipeline. It will be created when the pipeline is run,
            from scratch or from cache. Defaults to `None`.
        _write_buffer: The buffer that will store the data of the leaf steps of the pipeline
            while running, so the `Distiset` can be created at the end. It will be created
            when the pipeline is run. Defaults to `None`.
        _fs: The `fsspec` filesystem to be used to store the data of the `_Batch`es passed
            between the steps. It will be set when the pipeline is run. Defaults to `None`.
        _storage_base_path: The base path where the data of the `_Batch`es passed between
            the steps will be stored. It will be set then the pipeline is run. Defaults
            to `None`.
        _use_fs_to_pass_data: Whether to use the file system to pass the data of the
            `_Batch`es between the steps. Even if this parameter is `False`, the `Batch`es
            received by `GlobalStep`s will always use the file system to pass the data.
            Defaults to `False`.
        _dry_run: A flag to indicate if the pipeline is running in dry run mode. Defaults
            to `False`.
        output_queue: A queue to store the output of the steps while running the pipeline.
        load_queue: A queue used by each `Step` to notify the main process it has finished
            loading or it the step has been unloaded.
    """

    _output_queue: "Queue[Any]"
    _load_queue: "Queue[Union[StepLoadStatus, None]]"

    def __init__(
        self,
        name: Optional[str] = None,
        description: Optional[str] = None,
        cache_dir: Optional[Union[str, "PathLike"]] = None,
        enable_metadata: bool = False,
        requirements: Optional[List[str]] = None,
    ) -> None:
        """Initialize the `BasePipeline` instance.

        Args:
            name: The name of the pipeline. If not generated, a random one will be generated by default.
            description: A description of the pipeline. Defaults to `None`.
            cache_dir: A directory where the pipeline will be cached. Defaults to `None`.
            enable_metadata: Whether to include the distilabel metadata column for the pipeline
                in the final `Distiset`. It contains metadata used by distilabel, for example
                the raw outputs of the `LLM` without processing would be here, inside `raw_output_...`
                field. Defaults to `False`.
            requirements: List of requirements that must be installed to run the pipeline.
                Defaults to `None`, but can be helpful to inform in a pipeline to be shared
                that this requirements must be installed.
        """
        self.name = name or f"pipeline_{str(uuid.uuid4())[:8]}"
        self.description = description
        self._enable_metadata = enable_metadata
        self.dag = DAG()

        if cache_dir:
            self._cache_dir = Path(cache_dir)
        elif env_cache_dir := os.getenv("DISTILABEL_CACHE_DIR"):
            self._cache_dir = Path(env_cache_dir)
        else:
            self._cache_dir = BASE_CACHE_DIR

        self._logger = logging.getLogger("distilabel.pipeline")

        self._batch_manager: Optional["_BatchManager"] = None
        self._write_buffer: Optional["_WriteBuffer"] = None

        self._steps_load_status: Dict[str, int] = {}
        self._steps_load_status_lock = threading.Lock()

        self._stop_called = False
        self._stop_called_lock = threading.Lock()
        self._stop_calls = 0

        self._fs: Optional[fsspec.AbstractFileSystem] = None
        self._storage_base_path: Optional[str] = None
        self._use_fs_to_pass_data: bool = False
        self._dry_run = False

        self._current_stage = 0
        self._stages_last_batch: List[List[str]] = []

        self.requirements = requirements or []

        self._exception: Union[Exception, None] = None

        self._log_queue: Union["Queue[Any]", None] = None

    def __enter__(self) -> Self:
        """Set the global pipeline instance when entering a pipeline context."""
        _GlobalPipelineManager.set_pipeline(self)
        return self

    def __exit__(self, exc_type, exc_value, traceback) -> None:
        """Unset the global pipeline instance when exiting a pipeline context."""
        _GlobalPipelineManager.set_pipeline(None)

    def _create_signature(self) -> str:
        """Makes a signature (hash) of a pipeline, using the step ids and the adjacency between them.

        The main use is to find the pipeline in the cache folder.

        Returns:
            int: Signature of the pipeline.
        """
        hasher = hashlib.sha1()

        steps_info = []
        pipeline_dump = self.dump()["pipeline"]

        for step in pipeline_dump["steps"]:
            step_info = step["name"]
            for argument, value in sorted(step[STEP_ATTR_NAME].items()):
                if (argument == TYPE_INFO_KEY) or (value is None):
                    continue

                if isinstance(value, dict):
                    # input_mappings/output_mappings
                    step_info += "-".join(
                        [
                            f"{str(k)}={str(v)}"
                            for k, v in value.items()
                            if k not in ("disable_cuda_device_placement",)
                        ]
                    )
                elif isinstance(value, (list, tuple)):
                    # runtime_parameters_info
                    step_info += "-".join([str(v) for v in value])
                elif isinstance(value, (int, str, float, bool)):
                    if argument != "disable_cuda_device_placement":
                        # batch_size/name
                        step_info += str(value)
                else:
                    raise ValueError(
                        f"Field '{argument}' in step '{step['name']}' has type {type(value)}, explicitly cast the type to 'str'."
                    )

            steps_info.append(step_info)

        connections_info = [
            f"{c['from']}-{'-'.join(c['to'])}" for c in pipeline_dump["connections"]
        ]

        routing_batch_functions_info = []
        for function in pipeline_dump["routing_batch_functions"]:
            step = function["step"]
            routing_batch_function: "RoutingBatchFunction" = self.dag.get_step(step)[
                ROUTING_BATCH_FUNCTION_ATTR_NAME
            ]
            if type_info := routing_batch_function._get_type_info():
                step += f"-{type_info}"

        hasher.update(
            ",".join(
                steps_info + connections_info + routing_batch_functions_info
            ).encode()
        )

        return hasher.hexdigest()

    def run(
        self,
        parameters: Optional[Dict[str, Dict[str, Any]]] = None,
        use_cache: bool = True,
        storage_parameters: Optional[Dict[str, Any]] = None,
        use_fs_to_pass_data: bool = False,
        dataset: Optional["InputDataset"] = None,
    ) -> "Distiset":  # type: ignore
        """Run the pipeline. It will set the runtime parameters for the steps and validate
        the pipeline.

        This method should be extended by the specific pipeline implementation,
        adding the logic to run the pipeline.

        Args:
            parameters: A dictionary with the step name as the key and a dictionary with
                the runtime parameters for the step as the value. Defaults to `None`.
            use_cache: Whether to use the cache from previous pipeline runs. Defaults to
                `True`.
            storage_parameters: A dictionary with the storage parameters (`fsspec` and path)
                that will be used to store the data of the `_Batch`es passed between the
                steps if `use_fs_to_pass_data` is `True` (for the batches received by a
                `GlobalStep` it will be always used). It must have at least the "path" key,
                and it can contain additional keys depending on the protocol. By default,
                it will use the local file system and a directory in the cache directory.
                Defaults to `None`.
            use_fs_to_pass_data: Whether to use the file system to pass the data of
                the `_Batch`es between the steps. Even if this parameter is `False`, the
                `Batch`es received by `GlobalStep`s will always use the file system to
                pass the data. Defaults to `False`.
            dataset: If given, it will be used to create a `GeneratorStep` and put it as the
                root step. Convenient method when you have already processed the dataset in
                your script and just want to pass it already processed. Defaults to `None`.

        Returns:
            The `Distiset` created by the pipeline.
        """

        self._exception: Union[Exception, None] = None

        # Set the runtime parameters that will be used during the pipeline execution.
        # They are used to generate the signature of the pipeline that is used to hit the
        # cache when the pipeline is run, so it's important to do it first.
        self._set_runtime_parameters(parameters or {})

        setup_logging(
            log_queue=self._log_queue, filename=str(self._cache_location["log_file"])
        )

        if dataset is not None:
            self._add_dataset_generator_step(dataset)

        # Validate the pipeline DAG to check that all the steps are chainable, there are
        # no missing runtime parameters, batch sizes are correct, etc.
        self.dag.validate()

        # Set the initial load status for all the steps
        self._init_steps_load_status()

        # Load the stages status or initialize it
        self._load_stages_status(use_cache)

        # Load the `_BatchManager` from cache or create one from scratch
        self._load_batch_manager(use_cache)

        if to_install := self.requirements_to_install():
            # Print the list of requirements like they would appear in a requirements.txt
            to_install_list = "\n" + "\n".join(to_install)
            msg = f"Please install the following requirements to run the pipeline: {to_install_list}"
            self._logger.error(msg)
            raise ModuleNotFoundError(msg)

        # Setup the filesystem that will be used to pass the data of the `_Batch`es
        self._setup_fsspec(storage_parameters)
        self._use_fs_to_pass_data = use_fs_to_pass_data

        if self._dry_run:
            self._logger.info("🌵 Dry run mode")

        # If the batch manager is not able to generate batches, that means that the loaded
        # `_BatchManager` from cache didn't have any remaining batches to process i.e.
        # the previous pipeline execution was completed successfully.
        if not self._batch_manager.can_generate():  # type: ignore
            self._logger.info(
                "💾 Loaded batch manager from cache doesn't contain any remaining data."
                " Returning `Distiset` from cache data..."
            )
            distiset = create_distiset(
                self._cache_location["data"],
                pipeline_path=self._cache_location["pipeline"],
                log_filename_path=self._cache_location["log_file"],
                enable_metadata=self._enable_metadata,
                dag=self.dag,
            )
            stop_logging()
            return distiset

        self._setup_write_buffer()

        self._print_load_stages_info()

    def dry_run(
        self,
        parameters: Optional[Dict[str, Dict[str, Any]]] = None,
        batch_size: int = 1,
    ) -> "Distiset":
        """Do a dry run to test the pipeline runs as expected.

        Running a `Pipeline` in dry run mode will set all the `batch_size` of generator steps
        to the specified `batch_size`, and run just with a single batch, effectively
        running the whole pipeline with a single example. The cache will be set to `False`.

        Args:
            parameters: A dictionary with the step name as the key and a dictionary with
                the runtime parameters for the step as the value. Defaults to `None`.
            batch_size: The batch size of the unique batch generated by the generators
                steps of the pipeline. Defaults to `1`.

        Returns:
            Will return the `Distiset` as the main run method would do.
        """
        self._dry_run = True

        for step_name in self.dag:
            step = self.dag.get_step(step_name)[STEP_ATTR_NAME]

            if step.is_generator:
                if not parameters:
                    parameters = {}
                parameters[step_name] = {"batch_size": batch_size}

        distiset = self.run(parameters=parameters, use_cache=False)

        self._dry_run = False
        return distiset

    def _add_dataset_generator_step(self, dataset: "InputDataset") -> None:
        """Create a root step to work as the `GeneratorStep` for the pipeline using a
        dataset.

        Args:
            dataset: A dataset that will be used to create a `GeneratorStep` and
                placed in the DAG as the root step.

        Raises:
            ValueError: If there's already a `GeneratorStep` in the pipeline.
        """
        for step_name in self.dag:
            step = self.dag.get_step(step_name)[STEP_ATTR_NAME]
            if isinstance(step_name, GeneratorStep):
                raise ValueError(
                    "There is already a `GeneratorStep` in the pipeline, you can either pass a `dataset` to the "
                    f"run method, or create a `GeneratorStep` explictly. `GeneratorStep`: {step}"
                )
        loader = make_generator_step(dataset)
        self.dag.add_root_step(loader)

    def get_runtime_parameters_info(self) -> "PipelineRuntimeParametersInfo":
        """Get the runtime parameters for the steps in the pipeline.

        Returns:
            A dictionary with the step name as the key and a list of dictionaries with
            the parameter name and the parameter info as the value.
        """
        runtime_parameters = {}
        for step_name in self.dag:
            step: "_Step" = self.dag.get_step(step_name)[STEP_ATTR_NAME]
            runtime_parameters[step_name] = step.get_runtime_parameters_info()
        return runtime_parameters

    def _init_steps_load_status(self) -> None:
        """Initialize the `_steps_load_status` dictionary assigning 0 to every step of
        the pipeline."""
        for step_name in self.dag:
            self._steps_load_status[step_name] = _STEP_NOT_LOADED_CODE

    def _setup_fsspec(
        self, storage_parameters: Optional[Dict[str, Any]] = None
    ) -> None:
        """Setups the `fsspec` filesystem to be used to store the data of the `_Batch`es
        passed between the steps.

        Args:
            storage_parameters: A dictionary with the storage parameters (`fsspec` and path)
                that will be used to store the data of the `_Batch`es passed between the
                steps if `use_fs_to_pass_data` is `True` (for the batches received by a
                `GlobalStep` it will be always used). It must have at least the "path" key,
                and it can contain additional keys depending on the protocol. By default,
                it will use the local file system and a directory in the cache directory.
                Defaults to `None`.
        """
        if not storage_parameters:
            self._fs = fsspec.filesystem("file")
            self._storage_base_path = (
                f"file://{self._cache_location['batch_input_data']}"
            )
            return

        if "path" not in storage_parameters:
            raise ValueError(
                "The 'path' key must be present in the `storage_parameters` dictionary"
                " if it's not `None`."
            )

        path = storage_parameters.pop("path")
        protocol = UPath(path).protocol

        self._fs = fsspec.filesystem(protocol, **storage_parameters)
        self._storage_base_path = path

    def _add_step(self, step: "_Step") -> None:
        """Add a step to the pipeline.

        Args:
            step: The step to be added to the pipeline.
        """
        self.dag.add_step(step)

    def _add_edge(self, from_step: str, to_step: str) -> None:
        """Add an edge between two steps in the pipeline.

        Args:
            from_step: The name of the step that will generate the input for `to_step`.
            to_step: The name of the step that will receive the input from `from_step`.
        """
        self.dag.add_edge(from_step, to_step)

        # Check if `from_step` has a `routing_batch_function`. If it does, then mark
        # `to_step` as a step that will receive a routed batch.
        node = self.dag.get_step(from_step)  # type: ignore
        routing_batch_function = node.get(ROUTING_BATCH_FUNCTION_ATTR_NAME, None)
        self.dag.set_step_attr(
            name=to_step,
            attr=RECEIVES_ROUTED_BATCHES_ATTR_NAME,
            value=routing_batch_function is not None,
        )

    def _is_convergence_step(self, step_name: str) -> None:
        """Checks if a step is a convergence step.

        Args:
            step_name: The name of the step.
        """
        return self.dag.get_step(step_name).get(CONVERGENCE_STEP_ATTR_NAME)

    def _add_routing_batch_function(
        self, step_name: str, routing_batch_function: "RoutingBatchFunction"
    ) -> None:
        """Add a routing batch function to a step.

        Args:
            step_name: The name of the step that will receive the routed batch.
            routing_batch_function: The function that will route the batch to the step.
        """
        self.dag.set_step_attr(
            name=step_name,
            attr=ROUTING_BATCH_FUNCTION_ATTR_NAME,
            value=routing_batch_function,
        )

    def _set_runtime_parameters(self, parameters: Dict[str, Dict[str, Any]]) -> None:
        """Set the runtime parameters for the steps in the pipeline.

        Args:
            parameters: A dictionary with the step name as the key and a dictionary with
            the parameter name as the key and the parameter value as the value.
        """
        step_names = set(self.dag.G)
        for step_name, step_parameters in parameters.items():
            if step_name not in step_names:
                self._logger.warning(
                    f"❓ Step '{step_name}' provided in `Pipeline.run(parameters={{...}})` not found in the pipeline."
                    f" Available steps are: {step_names}."
                )
            else:
                step: "_Step" = self.dag.get_step(step_name)[STEP_ATTR_NAME]
                step.set_runtime_parameters(step_parameters)

    def _model_dump(self, obj: Any, **kwargs: Any) -> Dict[str, Any]:
        """Dumps the DAG content to a dict.

        Args:
            obj (Any): Unused, just kept to match the signature of the parent method.
            kwargs (Any): Unused, just kept to match the signature of the parent method.

        Returns:
            Dict[str, Any]: Internal representation of the DAG from networkx in a serializable format.
        """
        return self.dag.dump()

    def dump(self, **kwargs: Any) -> Dict[str, Any]:
        return {
            "distilabel": {"version": __version__},
            "pipeline": {
                "name": self.name,
                "description": self.description,
                **super().dump(),
            },
            "requirements": self.requirements,
        }

    @classmethod
    def from_dict(cls, data: Dict[str, Any]) -> Self:
        """Create a Pipeline from a dict containing the serialized data.

        Note:
            It's intended for internal use.

        Args:
            data (Dict[str, Any]): Dictionary containing the serialized data from a Pipeline.

        Returns:
            BasePipeline: Pipeline recreated from the dictionary info.
        """
        name = data["pipeline"]["name"]
        description = data["pipeline"].get("description")
        requirements = data.get("requirements", [])
        with cls(name=name, description=description, requirements=requirements) as pipe:
            pipe.dag = DAG.from_dict(data["pipeline"])
        return pipe

    @property
    def _cache_location(self) -> "_CacheLocation":
        """Dictionary containing the object that will stored and the location,
        whether it is a filename or a folder.

        Returns:
            Path: Filenames where the pipeline content will be serialized.
        """
        folder = self._cache_dir / self.name / self._create_signature()
        return {
            "pipeline": folder / "pipeline.yaml",
            "batch_manager": folder / "batch_manager.json",
            "data": folder / "data",
            "batch_input_data": folder / "batch_input_data",
            "log_file": folder / "pipeline.log",
            "stages_file": folder / "stages.json",
        }

    def _cache(self) -> None:
        """Saves the `BasePipeline` using the `_cache_filename`."""
        if self._dry_run:
            return

        self.save(
            path=self._cache_location["pipeline"],
            format=self._cache_location["pipeline"].suffix.replace(".", ""),  # type: ignore
        )

        if self._batch_manager is not None:
            self._batch_manager.cache(self._cache_location["batch_manager"])

        self._save_stages_status()

        self._logger.debug("Pipeline and batch manager saved to cache.")

    def _save_stages_status(self) -> None:
        """Saves the stages status to cache."""
        self.save(
            path=self._cache_location["stages_file"],
            format="json",
            dump={
                "current_stage": self._current_stage,
                "stages_last_batch": self._stages_last_batch,
            },
        )

    def _load_stages_status(self, use_cache: bool = True) -> None:
        """Try to load the stages status from cache, or initialize it if cache file doesn't
        exist or cache is not going to be used."""
        if use_cache and self._cache_location["stages_file"].exists():
            stages_status = read_json(self._cache_location["stages_file"])
            self._current_stage = stages_status["current_stage"]
            self._stages_last_batch = stages_status["stages_last_batch"]
        else:
            self._current_stage = 0
            self._stages_last_batch = [
                [] for _ in range(len(self.dag.get_steps_load_stages()[0]))
            ]

    def _load_batch_manager(self, use_cache: bool = True) -> None:
        """Will try to load the `_BatchManager` from the cache dir if found. Otherwise,
        it will create one from scratch.
        """
        batch_manager_cache_loc = self._cache_location["batch_manager"]
        if use_cache and batch_manager_cache_loc.exists():
            self._logger.info(
                f"💾 Loading `_BatchManager` from cache: '{batch_manager_cache_loc}'"
            )
            self._batch_manager = _BatchManager.load_from_cache(batch_manager_cache_loc)
        else:
            self._batch_manager = _BatchManager.from_dag(self.dag)

    def _setup_write_buffer(self) -> None:
        """Setups the `_WriteBuffer` that will store the data of the leaf steps of the
        pipeline while running, so the `Distiset` can be created at the end.
        """
        buffer_data_path = self._cache_location["data"]
        self._logger.info(f"📝 Pipeline data will be written to '{buffer_data_path}'")
        self._write_buffer = _WriteBuffer(buffer_data_path, self.dag.leaf_steps)

    def _print_load_stages_info(self) -> None:
        """Prints the information about the load stages."""
        stages, _ = self.dag.get_steps_load_stages()
        msg = ""
        for stage, steps in enumerate(stages):
            msg += f"\n * Stage {stage}: {steps}"
        self._logger.info(
            f"⌛ The steps of the pipeline will be loaded in stages:{msg}"
        )

    def _run_output_queue_loop_in_thread(self) -> threading.Thread:
        """Runs the output queue loop in a separate thread to receive the output batches
        from the steps. This is done to avoid the signal handler to block the loop, which
        would prevent the pipeline from stopping correctly."""
        thread = threading.Thread(target=self._output_queue_loop)
        thread.start()
        return thread

    def _output_queue_loop(self) -> None:
        """Loop to receive the output batches from the steps and manage the flow of the
        batches through the pipeline."""
        if not self._initialize_pipeline_execution():
            return

        while self._should_continue_processing():  # type: ignore
            self._logger.debug("Waiting for output batch from step...")
            if (batch := self._output_queue.get()) is None:
                self._logger.debug("Received `None` from output queue. Breaking loop.")
                break

            self._logger.debug(
                f"Received batch with seq_no {batch.seq_no} from step '{batch.step_name}'"
                f" from output queue: {batch}"
            )

            self._process_batch(batch)

            # If `_stop_called` was set to `True` while waiting for the output queue, then
            # we need to handle the stop of the pipeline and break the loop to avoid
            # propagating the batches through the pipeline and making the stop process
            # slower.
            if self._stop_called:
                self._handle_batch_on_stop(batch)
                break

            # If there is another load stage and all the `last_batch`es from the stage
            # have been received, then load the next stage.
            if self._should_load_next_stage():
                if not self._update_stage():
                    break

            self._manage_batch_flow(batch)

        self._finalize_pipeline_execution()

    def _initialize_pipeline_execution(self) -> bool:
        """Load the steps of the required stage to initialize the pipeline execution,
        and requests the initial batches to trigger the batch flowing in the pipeline.

        Returns:
            `True` if initialization went OK, `False` otherwise.
        """
        # Wait for all the steps to be loaded correctly
        if not self._run_stage_steps_and_wait(stage=self._current_stage):
            self._set_steps_not_loaded_exception()
            return False

        # Send the "first" batches to the steps so the batches starts flowing through
        # the input queues and output queue
        self._request_initial_batches()

        return True

    def _should_continue_processing(self) -> bool:
        """Condition for the consume batches from the `output_queue` loop.

        Returns:
            `True` if should continue consuming batches, `False` otherwise and the pipeline
            should stop.
        """
        return self._batch_manager.can_generate() and not self._stop_called  # type: ignore

    def _process_batch(self, batch: "_Batch") -> None:
        """Process a batch consumed from the `output_queue`.

        Args:
            batch: the batch to be processed.
        """
        if batch.data_path:
            self._logger.debug(
                f"Reading {batch.seq_no} batch data from '{batch.step_name}': '{batch.data_path}'"
            )
            batch.read_batch_data_from_fs()

        if batch.step_name in self.dag.leaf_steps:
            self._write_buffer.add_batch(batch)  # type: ignore

        if batch.last_batch:
            _, stages_last_steps = self.dag.get_steps_load_stages()
            stage_last_steps = stages_last_steps[self._current_stage]
            if batch.step_name in stage_last_steps:
                self._stages_last_batch[self._current_stage].append(batch.step_name)
                self._stages_last_batch[self._current_stage].sort()

            # Make sure to send the `LAST_BATCH_SENT_FLAG` to the predecessors of the step
            # if the batch is the last one, so they stop their processing loop even if they
            # haven't received the last batch because of the routing function.
            for step_name in self.dag.get_step_predecessors(batch.step_name):
                if self._is_step_running(step_name):
                    self._send_last_batch_flag_to_step(step_name)

    def _update_stage(self) -> bool:
        """Checks if the steps of next stage should be loaded and updates `_current_stage`
        attribute.

        Returns:
            `True` if updating the stage went OK, `False` otherwise.
        """
        self._current_stage += 1
        if not self._run_stage_steps_and_wait(stage=self._current_stage):
            self._set_steps_not_loaded_exception()
            return False

        return True

    def _should_load_next_stage(self) -> bool:
        """Returns if the next stage should be loaded.

        Returns:
            `True` if the next stage should be loaded, `False` otherwise.
        """
        _, stage_last_steps = self.dag.get_steps_load_stages()
        there_is_next_stage = self._current_stage + 1 < len(stage_last_steps)
        stage_last_batches_received = (
            self._stages_last_batch[self._current_stage]
            == stage_last_steps[self._current_stage]
        )
        return there_is_next_stage and stage_last_batches_received

    def _finalize_pipeline_execution(self) -> None:
        """Finalizes the pipeline execution handling the prematurely stop of the pipeline
        if required, caching the data and ensuring that all the steps finish its execution."""
        if self._stop_called:
            self._handle_stop()

        self._cache()

        # Send `None` to steps `input_queue`s just in case some step is still waiting
        self._notify_steps_to_stop()

        # Reset flag state
        self._stop_called = False

    def _run_load_queue_loop_in_thread(self) -> threading.Thread:
        """Runs a background thread that reads from the `load_queue` to update the status
        of the number of replicas loaded for each step.

        Returns:
            The thread that was started.
        """
        thread = threading.Thread(target=self._run_load_queue_loop)
        thread.start()
        return thread

    def _run_load_queue_loop(self) -> None:
        """Runs a loop that reads from the `load_queue` to update the status of the number
        of replicas loaded for each step."""

        while True:
            if (load_info := self._load_queue.get()) is None:
                self._logger.debug("Received `None` from load queue. Breaking loop.")
                break

            with self._steps_load_status_lock:
                step_name, status = load_info["name"], load_info["status"]
                if status == "loaded":
                    if self._steps_load_status[step_name] == _STEP_NOT_LOADED_CODE:
                        self._steps_load_status[step_name] = 1
                    else:
                        self._steps_load_status[step_name] += 1
                elif status == "unloaded":
                    self._steps_load_status[step_name] -= 1
                else:
                    # load failed
                    self._steps_load_status[step_name] = _STEP_LOAD_FAILED_CODE

                self._logger.debug(
                    f"Step '{step_name}' loaded replicas: {self._steps_load_status[step_name]}"
                )

    def _is_step_running(self, step_name: str) -> bool:
        """Checks if the step is running (at least one replica is running).

        Args:
            step_name: The step to be check if running.

        Returns:
            `True` if the step is running, `False` otherwise.
        """
        with self._steps_load_status_lock:
            return self._steps_load_status[step_name] >= 1

    def _run_stage_steps_and_wait(self, stage: int) -> bool:
        """Runs the steps of the specified stage and waits for them to be ready.

        Args:
            stage: the stage from which the steps have to be loaded.

        Returns:
            `True` if all the steps have been loaded correctly, `False` otherwise.
        """

        steps_stages, _ = self.dag.get_steps_load_stages()
        steps = steps_stages[stage]

        # Run the steps of the stage
        self._run_steps(steps=steps)

        # Wait for them to be ready
        self._logger.info(f"⏳ Waiting for all the steps of stage {stage} to load...")
        previous_message = None
        while not self._stop_called:
            with self._steps_load_status_lock:
                filtered_steps_load_status = {
                    step_name: replicas
                    for step_name, replicas in self._steps_load_status.items()
                    if step_name in steps
                }
                self._logger.debug(
                    f"Steps from stage {stage} loaded: {filtered_steps_load_status}"
                )

                if any(
                    replicas_loaded == _STEP_LOAD_FAILED_CODE
                    for replicas_loaded in filtered_steps_load_status.values()
                ):
                    self._logger.error(
                        f"❌ Failed to load all the steps of stage {stage}"
                    )
                    return False

                num_steps_loaded = 0
                replicas_message = ""
                for step_name, replicas in filtered_steps_load_status.items():
                    step_replica_count = self.dag.get_step_replica_count(step_name)
                    if replicas == step_replica_count:
                        num_steps_loaded += 1
                    replicas_message += f"\n * '{step_name}' replicas: {max(0, replicas)}/{step_replica_count}"

                message = f"⏳ Steps from stage {stage} loaded: {num_steps_loaded}/{len(filtered_steps_load_status)}{replicas_message}"
                if num_steps_loaded > 0 and message != previous_message:
                    self._logger.info(message)
                    previous_message = message

                if num_steps_loaded == len(filtered_steps_load_status):
                    self._logger.info(
                        f"✅ All the steps from stage {stage} have been loaded!"
                    )
                    return True

            time.sleep(2.5)

        return not self._stop_called

    def _handle_stop(self) -> None:
        """Handles the stop of the pipeline execution, which will stop the steps from
        processing more batches and wait for the output queue to be empty, to not lose
        any data that was already processed by the steps before the stop was called."""
        self._logger.debug("Handling stop of the pipeline execution...")

        self._add_batches_back_to_batch_manager()

        # Wait for the input queue to be empty, which means that all the steps finished
        # processing the batches that were sent before the stop flag.
        for step_name in self.dag:
            self._wait_step_input_queue_empty(step_name)

        self._consume_output_queue()

    def _wait_step_input_queue_empty(self, step_name: str) -> Union["Queue[Any]", None]:
        """Waits for the input queue of a step to be empty.

        Args:
            step_name: The name of the step.

        Returns:
            The input queue of the step if it's not loaded or finished, `None` otherwise.
        """
        if self._check_step_not_loaded_or_finished(step_name):
            return None

        if input_queue := self.dag.get_step(step_name).get(INPUT_QUEUE_ATTR_NAME):
            while input_queue.qsize() != 0:
                pass
            return input_queue

    def _check_step_not_loaded_or_finished(self, step_name: str) -> bool:
        """Checks if a step is not loaded or already finished.

        Args:
            step_name: The name of the step.

        Returns:
            `True` if the step is not loaded or already finished, `False` otherwise.
        """
        with self._steps_load_status_lock:
            num_replicas = self._steps_load_status[step_name]

            # The step has finished (replicas = 0) or it has failed to load
            if num_replicas in [0, _STEP_LOAD_FAILED_CODE]:
                return True

        return False

    @property
    @abstractmethod
    def QueueClass(self) -> Callable:
        """The class of the queue to use in the pipeline."""
        pass

    def _create_step_input_queue(self, step_name: str) -> "Queue[Any]":
        """Creates an input queue for a step.

        Args:
            step_name: The name of the step.

        Returns:
            The input queue created.
        """
        input_queue = self.QueueClass()
        self.dag.set_step_attr(step_name, INPUT_QUEUE_ATTR_NAME, input_queue)
        return input_queue

    @abstractmethod
    def _run_step(self, step: "_Step", input_queue: "Queue[Any]", replica: int) -> None:
        """Runs the `Step` instance.

        Args:
            step: The `Step` instance to run.
            input_queue: The input queue where the step will receive the batches.
            replica: The replica ID assigned.
        """
        pass

    def _run_steps(self, steps: Iterable[str]) -> None:
        """Runs the `Step`s of the pipeline, creating first an input queue for each step
        that will be used to send the batches.

        Args:
            steps:
        """
        for step_name in steps:
            step: "Step" = self.dag.get_step(step_name)[STEP_ATTR_NAME]
            input_queue = self._create_step_input_queue(step_name=step_name)

            # Set `pipeline` to `None` as in some Python environments the pipeline is not
            # picklable and it will raise an error when trying to send the step to the process.
            # `TypeError: cannot pickle 'code' object`
            step.pipeline = None

            if not step.is_normal and step.resources.replicas > 1:  # type: ignore
                self._logger.warning(
                    f"Step '{step_name}' is a `GeneratorStep` or `GlobalStep` and has more"
                    " than 1 replica. Only `Step` instances can have more than 1 replica."
                    " The number of replicas for the step will be set to 1."
                )

            step_num_replicas: int = step.resources.replicas if step.is_normal else 1  # type: ignore
            for replica in range(step_num_replicas):
                self._logger.debug(f"Running 1 replica of step '{step.name}'...")
                self._run_step(step=step, input_queue=input_queue, replica=replica)

    def _add_batches_back_to_batch_manager(self) -> None:
        """Add the `Batch`es that were sent to a `Step` back to the `_BatchManager`. This
        method should be used when the pipeline has been stopped prematurely."""
        for step_name in self.dag:
            node = self.dag.get_step(step_name)
            step: "_Step" = node[STEP_ATTR_NAME]
            if step.is_generator:
                continue
            if input_queue := node.get(INPUT_QUEUE_ATTR_NAME):
                while not input_queue.empty():
                    batch = input_queue.get()
                    if not isinstance(batch, _Batch):
                        continue
                    self._batch_manager.add_batch(  # type: ignore
                        to_step=step_name, batch=batch, prepend=True
                    )
                    self._logger.debug(
                        f"Adding batch back to the batch manager: {batch}"
                    )
                input_queue.put(None)

    def _consume_output_queue(self) -> None:
        """Consumes the `Batch`es from the output queue until it's empty. This method should
        be used when the pipeline has been stopped prematurely to consume and to not lose
        the `Batch`es that were processed by the leaf `Step`s before stopping the pipeline."""
        while not self._output_queue.empty():
            batch = self._output_queue.get()
            if batch is None:
                continue

            if batch.step_name in self.dag.leaf_steps:
                self._write_buffer.add_batch(batch)  # type: ignore

            self._handle_batch_on_stop(batch)

    def _manage_batch_flow(self, batch: "_Batch") -> None:
        """Checks if the step that generated the batch has more data in its buffer to
        generate a new batch. If there's data, then a new batch is sent to the step. If
        the step has no data in its buffer, then the predecessors generator steps are
        requested to send a new batch.

        Args:
            batch: The batch that was processed.
        """
        assert self._batch_manager, "Batch manager is not set"

        self._register_batch(batch)

        route_to, do_not_route_to, routed = self._get_successors(batch)

        # Keep track of the steps that the batch was routed to
        if routed:
            batch.batch_routed_to = route_to

        self._set_next_expected_seq_no(
            steps=do_not_route_to,
            from_step=batch.step_name,
            next_expected_seq_no=batch.seq_no + 1,
        )

        step = self._get_step_from_batch(batch)

        # Add the batch to the successors input buffers
        for successor in route_to:
            # Copy batch to avoid modifying the same reference in the batch manager
            batch_to_add = batch.copy() if len(route_to) > 1 else batch

            self._batch_manager.add_batch(successor, batch_to_add)

            # Check if the step is a generator and if there are successors that need data
            # from this step. This usually happens when the generator `batch_size` is smaller
            # than the `input_batch_size` of the successor steps.
            if (
                step.is_generator
                and step.name in self._batch_manager.step_empty_buffers(successor)
            ):
                last_batch_sent = self._batch_manager.get_last_batch_sent(step.name)
                self._send_batch_to_step(last_batch_sent.next_batch())  # type: ignore

            # If successor step has enough data in its buffer to create a new batch, then
            # send the batch to the step.
            if new_batch := self._batch_manager.get_batch(successor):
                self._send_batch_to_step(new_batch)

        if not step.is_generator:
            # Step ("this", the one from which the batch was received) has enough data on its
            # buffers to create a new batch
            if new_batch := self._batch_manager.get_batch(step.name):  # type: ignore
                self._send_batch_to_step(new_batch)
            else:
                self._request_more_batches_if_needed(step)
        else:
            if len(self.dag) == 1:
                self._request_batch_from_generator(step.name)  # type: ignore

        self._cache()

    def _send_to_step(self, step_name: str, to_send: Any) -> None:
        """Sends something to the input queue of a step.

        Args:
            step_name: The name of the step.
            to_send: The object to send.
        """
        input_queue = self.dag.get_step(step_name)[INPUT_QUEUE_ATTR_NAME]
        input_queue.put(to_send)

    def _send_batch_to_step(self, batch: "_Batch") -> None:
        """Sends a batch to the input queue of a step, writing the data of the batch
        to the filesystem and setting `batch.data_path` with the path where the data
        was written (if requiered i.e. the step is a global step or `use_fs_to_pass_data`)

        This method should be extended by the specific pipeline implementation, adding
        the logic to send the batch to the step.

        Args:
            batch: The batch to send.
        """
        self._logger.debug(
            f"Setting batch {batch.seq_no} as last batch sent to '{batch.step_name}': {batch}"
        )
        self._batch_manager.set_last_batch_sent(batch)  # type: ignore

        step: "_Step" = self.dag.get_step(batch.step_name)[STEP_ATTR_NAME]
        if not step.is_generator and (step.is_global or self._use_fs_to_pass_data):
            base_path = UPath(self._storage_base_path) / step.name  # type: ignore
            self._logger.debug(
                f"Writing {batch.seq_no} batch for '{batch.step_name}' step to filesystem: {base_path}"
            )
            batch.write_batch_data_to_fs(self._fs, base_path)  # type: ignore

        self._logger.debug(
            f"Sending batch {batch.seq_no} to step '{batch.step_name}': {batch}"
        )
        self._send_to_step(batch.step_name, batch)

    def _gather_requirements(self) -> List[str]:
        """Extracts the requirements from the steps to be used in the pipeline.

        Returns:
            List of requirements gathered from the steps.
        """
        steps_requirements = []
        for step in self.dag:
            step_req = self.dag.get_step(step)[STEP_ATTR_NAME].requirements
            steps_requirements.extend(step_req)

        return steps_requirements

    def _register_batch(self, batch: "_Batch") -> None:
        """Registers a batch in the batch manager.

        Args:
            batch: The batch to register.
        """
        self._batch_manager.register_batch(batch)  # type: ignore
        self._logger.debug(
            f"Batch {batch.seq_no} from step '{batch.step_name}' registered in batch"
            " manager"
        )

    def _send_last_batch_flag_to_step(self, step_name: str) -> None:
        """Sends the `LAST_BATCH_SENT_FLAG` to a step to stop processing batches.

        Args:
            step_name: The name of the step.
        """
        self._logger.debug(
            f"Sending `LAST_BATCH_SENT_FLAG` to '{step_name}' step to stop processing"
            " batches..."
        )

        for _ in range(self.dag.get_step_replica_count(step_name)):
            self._send_to_step(step_name, LAST_BATCH_SENT_FLAG)
        self._batch_manager.set_last_batch_flag_sent_to(step_name)  # type: ignore

    def _request_initial_batches(self) -> None:
        """Requests the initial batches to the generator steps."""
        assert self._batch_manager, "Batch manager is not set"

        for step in self._batch_manager._steps.values():
            if not self._is_step_running(step.step_name):
                continue
            if batch := step.get_batch():
                self._logger.debug(
                    f"Sending initial batch to '{step.step_name}' step: {batch}"
                )
                self._send_batch_to_step(batch)

        for step_name in self.dag.root_steps:
            if not self._is_step_running(step_name):
                continue
            seq_no = 0
            if last_batch := self._batch_manager.get_last_batch(step_name):
                seq_no = last_batch.seq_no + 1
            batch = _Batch(seq_no=seq_no, step_name=step_name, last_batch=self._dry_run)
            self._logger.debug(
                f"Requesting initial batch to '{step_name}' generator step: {batch}"
            )
            self._send_batch_to_step(batch)

    def _request_batch_from_generator(self, step_name: str) -> None:
        """Request a new batch to a `GeneratorStep`.

        Args:
            step_name: the name of the `GeneratorStep` to which a batch has to be requested.
        """
        # Get the last batch that the previous step sent to generate the next batch
        # (next `seq_no`).
        last_batch = self._batch_manager.get_last_batch_sent(step_name)  # type: ignore
        if last_batch is None:
            return
        self._send_batch_to_step(last_batch.next_batch())

    def _request_more_batches_if_needed(self, step: "Step") -> None:
        """Request more batches to the predecessors steps of `step` if needed.

        Args:
            step: The step of which it has to be checked if more batches are needed from
                its predecessors.
        """
        empty_buffers = self._batch_manager.step_empty_buffers(step.name)  # type: ignore
        for previous_step_name in empty_buffers:
            # Only more batches can be requested to the `GeneratorStep`s as they are the
            # only kind of steps that lazily generate batches.
            if previous_step_name not in self.dag.root_steps:
                continue

            self._request_batch_from_generator(previous_step_name)

    def _handle_batch_on_stop(self, batch: "_Batch") -> None:
        """Handles a batch that was received from the output queue when the pipeline was
        stopped. It will add and register the batch in the batch manager.

        Args:
            batch: The batch to handle.
        """
        assert self._batch_manager, "Batch manager is not set"

        self._batch_manager.register_batch(batch)
        step: "Step" = self.dag.get_step(batch.step_name)[STEP_ATTR_NAME]
        for successor in self.dag.get_step_successors(step.name):  # type: ignore
            self._batch_manager.add_batch(successor, batch)

    def _get_step_from_batch(self, batch: "_Batch") -> "Step":
        """Gets the `Step` instance from a batch.

        Args:
            batch: The batch to get the step from.

        Returns:
            The `Step` instance.
        """
        return self.dag.get_step(batch.step_name)[STEP_ATTR_NAME]

    def _notify_steps_to_stop(self) -> None:
        """Notifies the steps to stop their infinite running loop by sending `None` to
        their input queues."""
        with self._steps_load_status_lock:
            for step_name, replicas in self._steps_load_status.items():
                if replicas > 0:
                    self._send_to_step(step_name, None)

    def _get_successors(self, batch: "_Batch") -> Tuple[List[str], List[str], bool]:
        """Gets the successors and the successors to which the batch has to be routed.

        Args:
            batch: The batch to which the successors will be determined.

        Returns:
            The successors to route the batch to and whether the batch was routed using
            a routing function.
        """
        node = self.dag.get_step(batch.step_name)
        step: "Step" = node[STEP_ATTR_NAME]
        successors = list(self.dag.get_step_successors(step.name))  # type: ignore
        route_to = successors

        # Check if the step has a routing function to send the batch to specific steps
        if routing_batch_function := node.get(ROUTING_BATCH_FUNCTION_ATTR_NAME):
            route_to = routing_batch_function(batch, successors)
            successors_str = ", ".join(f"'{successor}'" for successor in route_to)
            self._logger.info(
                f"🚏 Using '{step.name}' routing function to send batch {batch.seq_no} to steps: {successors_str}"
            )

        return route_to, list(set(successors) - set(route_to)), route_to != successors

    def _set_next_expected_seq_no(
        self, steps: List[str], from_step: str, next_expected_seq_no: int
    ) -> None:
        """Sets the next expected sequence number of a `_Batch` received by `step` from
        `from_step`. This is necessary as some `Step`s might not receive all the batches
        comming from the previous steps because there is a routing batch function.

        Args:
            steps: list of steps to which the next expected sequence number of a `_Batch`
                from `from_step` has to be updated in the `_BatchManager`.
            from_step: the name of the step from which the next expected sequence number
                of a `_Batch` has to be updated in `steps`.
            next_expected_seq_no: the number of the next expected sequence number of a `Batch`
                from `from_step`.
        """
        assert self._batch_manager, "Batch manager is not set"

        for step in steps:
            self._batch_manager.set_next_expected_seq_no(
                step_name=step,
                from_step=from_step,
                next_expected_seq_no=next_expected_seq_no,
            )

    @abstractmethod
    def _teardown(self) -> None:
        """Clean/release/stop resources reserved to run the pipeline."""
        pass

    @abstractmethod
    def _set_steps_not_loaded_exception(self) -> None:
        """Used to raise `RuntimeError` when the load of the steps failed.

        Raises:
            RuntimeError: containing the information and why a step failed to be loaded.
        """
        pass

    @abstractmethod
    def _stop(self) -> None:
        """Stops the pipeline in a controlled way."""
        pass

    def _stop_load_queue_loop(self) -> None:
        """Stops the `_load_queue` loop sending a `None`."""
        self._logger.debug("Sending `None` to the load queue to notify stop...")
        self._load_queue.put(None)

    def _stop_output_queue_loop(self) -> None:
        """Stops the `_output_queue` loop sending a `None`."""
        self._logger.debug("Sending `None` to the output queue to notify stop...")
        self._output_queue.put(None)

    def _handle_keyboard_interrupt(self) -> Any:
        """Handles KeyboardInterrupt signal sent during the Pipeline.run method.

        It will try to call self._stop (if the pipeline didn't started yet, it won't
        have any effect), and if the pool is already started, will close it before exiting
        the program.

        Returns:
            The original `signal.SIGINT` handler.
        """

        def signal_handler(signumber: int, frame: Any) -> None:
            self._stop()

        return signal.signal(signal.SIGINT, signal_handler)

QueueClass: Callable abstractmethod property

The class of the queue to use in the pipeline.

__enter__()

Set the global pipeline instance when entering a pipeline context.

Source code in src/distilabel/pipeline/base.py
def __enter__(self) -> Self:
    """Set the global pipeline instance when entering a pipeline context."""
    _GlobalPipelineManager.set_pipeline(self)
    return self

__exit__(exc_type, exc_value, traceback)

Unset the global pipeline instance when exiting a pipeline context.

Source code in src/distilabel/pipeline/base.py
def __exit__(self, exc_type, exc_value, traceback) -> None:
    """Unset the global pipeline instance when exiting a pipeline context."""
    _GlobalPipelineManager.set_pipeline(None)

__init__(name=None, description=None, cache_dir=None, enable_metadata=False, requirements=None)

Initialize the BasePipeline instance.

Parameters:

Name Type Description Default
name Optional[str]

The name of the pipeline. If not generated, a random one will be generated by default.

None
description Optional[str]

A description of the pipeline. Defaults to None.

None
cache_dir Optional[Union[str, PathLike]]

A directory where the pipeline will be cached. Defaults to None.

None
enable_metadata bool

Whether to include the distilabel metadata column for the pipeline in the final Distiset. It contains metadata used by distilabel, for example the raw outputs of the LLM without processing would be here, inside raw_output_... field. Defaults to False.

False
requirements Optional[List[str]]

List of requirements that must be installed to run the pipeline. Defaults to None, but can be helpful to inform in a pipeline to be shared that this requirements must be installed.

None
Source code in src/distilabel/pipeline/base.py
def __init__(
    self,
    name: Optional[str] = None,
    description: Optional[str] = None,
    cache_dir: Optional[Union[str, "PathLike"]] = None,
    enable_metadata: bool = False,
    requirements: Optional[List[str]] = None,
) -> None:
    """Initialize the `BasePipeline` instance.

    Args:
        name: The name of the pipeline. If not generated, a random one will be generated by default.
        description: A description of the pipeline. Defaults to `None`.
        cache_dir: A directory where the pipeline will be cached. Defaults to `None`.
        enable_metadata: Whether to include the distilabel metadata column for the pipeline
            in the final `Distiset`. It contains metadata used by distilabel, for example
            the raw outputs of the `LLM` without processing would be here, inside `raw_output_...`
            field. Defaults to `False`.
        requirements: List of requirements that must be installed to run the pipeline.
            Defaults to `None`, but can be helpful to inform in a pipeline to be shared
            that this requirements must be installed.
    """
    self.name = name or f"pipeline_{str(uuid.uuid4())[:8]}"
    self.description = description
    self._enable_metadata = enable_metadata
    self.dag = DAG()

    if cache_dir:
        self._cache_dir = Path(cache_dir)
    elif env_cache_dir := os.getenv("DISTILABEL_CACHE_DIR"):
        self._cache_dir = Path(env_cache_dir)
    else:
        self._cache_dir = BASE_CACHE_DIR

    self._logger = logging.getLogger("distilabel.pipeline")

    self._batch_manager: Optional["_BatchManager"] = None
    self._write_buffer: Optional["_WriteBuffer"] = None

    self._steps_load_status: Dict[str, int] = {}
    self._steps_load_status_lock = threading.Lock()

    self._stop_called = False
    self._stop_called_lock = threading.Lock()
    self._stop_calls = 0

    self._fs: Optional[fsspec.AbstractFileSystem] = None
    self._storage_base_path: Optional[str] = None
    self._use_fs_to_pass_data: bool = False
    self._dry_run = False

    self._current_stage = 0
    self._stages_last_batch: List[List[str]] = []

    self.requirements = requirements or []

    self._exception: Union[Exception, None] = None

    self._log_queue: Union["Queue[Any]", None] = None

dry_run(parameters=None, batch_size=1)

Do a dry run to test the pipeline runs as expected.

Running a Pipeline in dry run mode will set all the batch_size of generator steps to the specified batch_size, and run just with a single batch, effectively running the whole pipeline with a single example. The cache will be set to False.

Parameters:

Name Type Description Default
parameters Optional[Dict[str, Dict[str, Any]]]

A dictionary with the step name as the key and a dictionary with the runtime parameters for the step as the value. Defaults to None.

None
batch_size int

The batch size of the unique batch generated by the generators steps of the pipeline. Defaults to 1.

1

Returns:

Type Description
Distiset

Will return the Distiset as the main run method would do.

Source code in src/distilabel/pipeline/base.py
def dry_run(
    self,
    parameters: Optional[Dict[str, Dict[str, Any]]] = None,
    batch_size: int = 1,
) -> "Distiset":
    """Do a dry run to test the pipeline runs as expected.

    Running a `Pipeline` in dry run mode will set all the `batch_size` of generator steps
    to the specified `batch_size`, and run just with a single batch, effectively
    running the whole pipeline with a single example. The cache will be set to `False`.

    Args:
        parameters: A dictionary with the step name as the key and a dictionary with
            the runtime parameters for the step as the value. Defaults to `None`.
        batch_size: The batch size of the unique batch generated by the generators
            steps of the pipeline. Defaults to `1`.

    Returns:
        Will return the `Distiset` as the main run method would do.
    """
    self._dry_run = True

    for step_name in self.dag:
        step = self.dag.get_step(step_name)[STEP_ATTR_NAME]

        if step.is_generator:
            if not parameters:
                parameters = {}
            parameters[step_name] = {"batch_size": batch_size}

    distiset = self.run(parameters=parameters, use_cache=False)

    self._dry_run = False
    return distiset

from_dict(data) classmethod

Create a Pipeline from a dict containing the serialized data.

Note

It's intended for internal use.

Parameters:

Name Type Description Default
data Dict[str, Any]

Dictionary containing the serialized data from a Pipeline.

required

Returns:

Name Type Description
BasePipeline Self

Pipeline recreated from the dictionary info.

Source code in src/distilabel/pipeline/base.py
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> Self:
    """Create a Pipeline from a dict containing the serialized data.

    Note:
        It's intended for internal use.

    Args:
        data (Dict[str, Any]): Dictionary containing the serialized data from a Pipeline.

    Returns:
        BasePipeline: Pipeline recreated from the dictionary info.
    """
    name = data["pipeline"]["name"]
    description = data["pipeline"].get("description")
    requirements = data.get("requirements", [])
    with cls(name=name, description=description, requirements=requirements) as pipe:
        pipe.dag = DAG.from_dict(data["pipeline"])
    return pipe

get_runtime_parameters_info()

Get the runtime parameters for the steps in the pipeline.

Returns:

Type Description
PipelineRuntimeParametersInfo

A dictionary with the step name as the key and a list of dictionaries with

PipelineRuntimeParametersInfo

the parameter name and the parameter info as the value.

Source code in src/distilabel/pipeline/base.py
def get_runtime_parameters_info(self) -> "PipelineRuntimeParametersInfo":
    """Get the runtime parameters for the steps in the pipeline.

    Returns:
        A dictionary with the step name as the key and a list of dictionaries with
        the parameter name and the parameter info as the value.
    """
    runtime_parameters = {}
    for step_name in self.dag:
        step: "_Step" = self.dag.get_step(step_name)[STEP_ATTR_NAME]
        runtime_parameters[step_name] = step.get_runtime_parameters_info()
    return runtime_parameters

run(parameters=None, use_cache=True, storage_parameters=None, use_fs_to_pass_data=False, dataset=None)

Run the pipeline. It will set the runtime parameters for the steps and validate the pipeline.

This method should be extended by the specific pipeline implementation, adding the logic to run the pipeline.

Parameters:

Name Type Description Default
parameters Optional[Dict[str, Dict[str, Any]]]

A dictionary with the step name as the key and a dictionary with the runtime parameters for the step as the value. Defaults to None.

None
use_cache bool

Whether to use the cache from previous pipeline runs. Defaults to True.

True
storage_parameters Optional[Dict[str, Any]]

A dictionary with the storage parameters (fsspec and path) that will be used to store the data of the _Batches passed between the steps if use_fs_to_pass_data is True (for the batches received by a GlobalStep it will be always used). It must have at least the "path" key, and it can contain additional keys depending on the protocol. By default, it will use the local file system and a directory in the cache directory. Defaults to None.

None
use_fs_to_pass_data bool

Whether to use the file system to pass the data of the _Batches between the steps. Even if this parameter is False, the Batches received by GlobalSteps will always use the file system to pass the data. Defaults to False.

False
dataset Optional[InputDataset]

If given, it will be used to create a GeneratorStep and put it as the root step. Convenient method when you have already processed the dataset in your script and just want to pass it already processed. Defaults to None.

None

Returns:

Type Description
Distiset

The Distiset created by the pipeline.

Source code in src/distilabel/pipeline/base.py
def run(
    self,
    parameters: Optional[Dict[str, Dict[str, Any]]] = None,
    use_cache: bool = True,
    storage_parameters: Optional[Dict[str, Any]] = None,
    use_fs_to_pass_data: bool = False,
    dataset: Optional["InputDataset"] = None,
) -> "Distiset":  # type: ignore
    """Run the pipeline. It will set the runtime parameters for the steps and validate
    the pipeline.

    This method should be extended by the specific pipeline implementation,
    adding the logic to run the pipeline.

    Args:
        parameters: A dictionary with the step name as the key and a dictionary with
            the runtime parameters for the step as the value. Defaults to `None`.
        use_cache: Whether to use the cache from previous pipeline runs. Defaults to
            `True`.
        storage_parameters: A dictionary with the storage parameters (`fsspec` and path)
            that will be used to store the data of the `_Batch`es passed between the
            steps if `use_fs_to_pass_data` is `True` (for the batches received by a
            `GlobalStep` it will be always used). It must have at least the "path" key,
            and it can contain additional keys depending on the protocol. By default,
            it will use the local file system and a directory in the cache directory.
            Defaults to `None`.
        use_fs_to_pass_data: Whether to use the file system to pass the data of
            the `_Batch`es between the steps. Even if this parameter is `False`, the
            `Batch`es received by `GlobalStep`s will always use the file system to
            pass the data. Defaults to `False`.
        dataset: If given, it will be used to create a `GeneratorStep` and put it as the
            root step. Convenient method when you have already processed the dataset in
            your script and just want to pass it already processed. Defaults to `None`.

    Returns:
        The `Distiset` created by the pipeline.
    """

    self._exception: Union[Exception, None] = None

    # Set the runtime parameters that will be used during the pipeline execution.
    # They are used to generate the signature of the pipeline that is used to hit the
    # cache when the pipeline is run, so it's important to do it first.
    self._set_runtime_parameters(parameters or {})

    setup_logging(
        log_queue=self._log_queue, filename=str(self._cache_location["log_file"])
    )

    if dataset is not None:
        self._add_dataset_generator_step(dataset)

    # Validate the pipeline DAG to check that all the steps are chainable, there are
    # no missing runtime parameters, batch sizes are correct, etc.
    self.dag.validate()

    # Set the initial load status for all the steps
    self._init_steps_load_status()

    # Load the stages status or initialize it
    self._load_stages_status(use_cache)

    # Load the `_BatchManager` from cache or create one from scratch
    self._load_batch_manager(use_cache)

    if to_install := self.requirements_to_install():
        # Print the list of requirements like they would appear in a requirements.txt
        to_install_list = "\n" + "\n".join(to_install)
        msg = f"Please install the following requirements to run the pipeline: {to_install_list}"
        self._logger.error(msg)
        raise ModuleNotFoundError(msg)

    # Setup the filesystem that will be used to pass the data of the `_Batch`es
    self._setup_fsspec(storage_parameters)
    self._use_fs_to_pass_data = use_fs_to_pass_data

    if self._dry_run:
        self._logger.info("🌵 Dry run mode")

    # If the batch manager is not able to generate batches, that means that the loaded
    # `_BatchManager` from cache didn't have any remaining batches to process i.e.
    # the previous pipeline execution was completed successfully.
    if not self._batch_manager.can_generate():  # type: ignore
        self._logger.info(
            "💾 Loaded batch manager from cache doesn't contain any remaining data."
            " Returning `Distiset` from cache data..."
        )
        distiset = create_distiset(
            self._cache_location["data"],
            pipeline_path=self._cache_location["pipeline"],
            log_filename_path=self._cache_location["log_file"],
            enable_metadata=self._enable_metadata,
            dag=self.dag,
        )
        stop_logging()
        return distiset

    self._setup_write_buffer()

    self._print_load_stages_info()

Pipeline

Bases: BasePipeline

Local pipeline implementation using multiprocessing.

Source code in src/distilabel/pipeline/local.py
class Pipeline(BasePipeline):
    """Local pipeline implementation using `multiprocessing`."""

    def ray(
        self,
        ray_head_node_url: Optional[str] = None,
        ray_init_kwargs: Optional[Dict[str, Any]] = None,
    ) -> RayPipeline:
        """Creates a `RayPipeline` using the init parameters of this pipeline. This is a
        convenient method that can be used to "transform" one common `Pipeline` to a `RayPipeline`
        and it's mainly used by the CLI.

        Args:
            ray_head_node_url: The URL that can be used to connect to the head node of
                the Ray cluster. Normally, you won't want to use this argument as the
                recommended way to submit a job to a Ray cluster is using the [Ray Jobs
                CLI](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html#ray-jobs-overview).
                Defaults to `None`.
            ray_init_kwargs: kwargs that will be passed to the `ray.init` method. Defaults
                to `None`.

        Returns:
            A `RayPipeline` instance.
        """
        pipeline = RayPipeline(
            name=self.name,
            description=self.description,
            cache_dir=self._cache_dir,
            enable_metadata=self._enable_metadata,
            requirements=self.requirements,
            ray_head_node_url=ray_head_node_url,
            ray_init_kwargs=ray_init_kwargs,
        )
        pipeline.dag = self.dag
        return pipeline

    def run(
        self,
        parameters: Optional[Dict[str, Dict[str, Any]]] = None,
        use_cache: bool = True,
        storage_parameters: Optional[Dict[str, Any]] = None,
        use_fs_to_pass_data: bool = False,
        dataset: Optional["InputDataset"] = None,
    ) -> "Distiset":
        """Runs the pipeline.

        Args:
            parameters: A dictionary with the step name as the key and a dictionary with
                the runtime parameters for the step as the value. Defaults to `None`.
            use_cache: Whether to use the cache from previous pipeline runs. Defaults to
                `True`.
            storage_parameters: A dictionary with the storage parameters (`fsspec` and path)
                that will be used to store the data of the `_Batch`es passed between the
                steps if `use_fs_to_pass_data` is `True` (for the batches received by a
                `GlobalStep` it will be always used). It must have at least the "path" key,
                and it can contain additional keys depending on the protocol. By default,
                it will use the local file system and a directory in the cache directory.
                Defaults to `None`.
            use_fs_to_pass_data: Whether to use the file system to pass the data of
                the `_Batch`es between the steps. Even if this parameter is `False`, the
                `Batch`es received by `GlobalStep`s will always use the file system to
                pass the data. Defaults to `False`.
            dataset: If given, it will be used to create a `GeneratorStep` and put it as the
                root step. Convenient method when you have already processed the dataset in
                your script and just want to pass it already processed. Defaults to `None`.

        Returns:
            The `Distiset` created by the pipeline.

        Raises:
            RuntimeError: If the pipeline fails to load all the steps.
        """
        if script_executed_in_ray_cluster():
            print("Script running in Ray cluster... Using `RayPipeline`...")
            return self.ray().run(
                parameters=parameters,
                use_cache=use_cache,
                storage_parameters=storage_parameters,
                use_fs_to_pass_data=use_fs_to_pass_data,
                dataset=dataset,
            )

        self._log_queue = cast("Queue[Any]", mp.Queue())

        if distiset := super().run(
            parameters,
            use_cache,
            storage_parameters,
            use_fs_to_pass_data,
            dataset=dataset,
        ):
            return distiset

        num_processes = self.dag.get_total_replica_count()
        with (
            mp.Manager() as manager,
            _NoDaemonPool(
                num_processes,
                initializer=_init_worker,
                initargs=(self._log_queue,),
            ) as pool,
        ):
            self._manager = manager
            self._pool = pool
            self._output_queue = self.QueueClass()
            self._load_queue = self.QueueClass()
            self._handle_keyboard_interrupt()

            # Run the loop for receiving the load status of each step
            self._load_steps_thread = self._run_load_queue_loop_in_thread()

            # Start a loop to receive the output batches from the steps
            self._output_queue_thread = self._run_output_queue_loop_in_thread()
            self._output_queue_thread.join()

            self._teardown()

            if self._exception:
                raise self._exception

        distiset = create_distiset(
            self._cache_location["data"],
            pipeline_path=self._cache_location["pipeline"],
            log_filename_path=self._cache_location["log_file"],
            enable_metadata=self._enable_metadata,
            dag=self.dag,
        )

        stop_logging()

        return distiset

    @property
    def QueueClass(self) -> Callable:
        """The callable used to create the input and output queues.

        Returns:
            The callable to create a `Queue`.
        """
        assert self._manager, "Manager is not initialized"
        return self._manager.Queue

    def _run_step(self, step: "_Step", input_queue: "Queue[Any]", replica: int) -> None:
        """Runs the `Step` wrapped in a `_ProcessWrapper` in a separate process of the
        `Pool`.

        Args:
            step: The step to run.
            input_queue: The input queue to send the data to the step.
            replica: The replica ID assigned.
        """
        assert self._pool, "Pool is not initialized"

        step_wrapper = _StepWrapper(
            step=step,  # type: ignore
            replica=replica,
            input_queue=input_queue,
            output_queue=self._output_queue,
            load_queue=self._load_queue,
            dry_run=self._dry_run,
            ray_pipeline=False,
        )

        self._pool.apply_async(step_wrapper.run, error_callback=self._error_callback)

    def _error_callback(self, e: BaseException) -> None:
        """Error callback that will be called when an error occurs in a `Step` process.

        Args:
            e: The exception raised by the process.
        """
        global _SUBPROCESS_EXCEPTION

        # First we check that the exception is a `_StepWrapperException`, otherwise, we
        # print it out and stop the pipeline, since some errors may be unhandled
        if not isinstance(e, _StepWrapperException):
            self._logger.error(f"❌ Failed with an unhandled exception: {e}")
            self._stop()
            return

        if e.is_load_error:
            self._logger.error(f"❌ Failed to load step '{e.step.name}': {e.message}")
            _SUBPROCESS_EXCEPTION = e.subprocess_exception
            _SUBPROCESS_EXCEPTION.__traceback__ = tblib.Traceback.from_string(  # type: ignore
                e.formatted_traceback
            ).as_traceback()
            return

        # If the step is global, is not in the last trophic level and has no successors,
        # then we can ignore the error and continue executing the pipeline
        step_name: str = e.step.name  # type: ignore
        if (
            e.step.is_global
            and not self.dag.step_in_last_trophic_level(step_name)
            and list(self.dag.get_step_successors(step_name)) == []
        ):
            self._logger.error(
                f"✋ An error occurred when running global step '{step_name}' with no"
                " successors and not in the last trophic level. Pipeline execution can"
                f" continue. Error will be ignored."
            )
            self._logger.error(f"Subprocess traceback:\n\n{e.formatted_traceback}")
            return

        # Global step with successors failed
        self._logger.error(f"An error occurred in global step '{step_name}'")
        self._logger.error(f"Subprocess traceback:\n\n{e.formatted_traceback}")

        self._stop()

    def _teardown(self) -> None:
        """Clean/release/stop resources reserved to run the pipeline."""
        if self._write_buffer:
            self._write_buffer.close()

        if self._batch_manager:
            self._batch_manager = None

        self._stop_load_queue_loop()
        self._load_steps_thread.join()

        if self._pool:
            self._pool.terminate()
            self._pool.join()

        if self._manager:
            self._manager.shutdown()
            self._manager.join()

    def _set_steps_not_loaded_exception(self) -> None:
        """Raises a `RuntimeError` notifying that the steps load has failed.

        Raises:
            RuntimeError: containing the information and why a step failed to be loaded.
        """
        self._exception = RuntimeError(
            "Failed to load all the steps. Could not run pipeline."
        )
        self._exception.__cause__ = _SUBPROCESS_EXCEPTION

    def _stop(self) -> None:
        """Stops the pipeline execution. It will first send `None` to the input queues
        of all the steps and then wait until the output queue is empty i.e. all the steps
        finished processing the batches that were sent before the stop flag. Then it will
        send `None` to the output queue to notify the pipeline to stop."""

        with self._stop_called_lock:
            if self._stop_called:
                self._stop_calls += 1
                if self._stop_calls == 1:
                    self._logger.warning(
                        "🛑 Press again to force the pipeline to stop."
                    )
                elif self._stop_calls > 1:
                    self._logger.warning("🛑 Forcing pipeline interruption.")

                    if self._pool:
                        self._pool.terminate()
                        self._pool.join()
                        self._pool = None

                    if self._manager:
                        self._manager.shutdown()
                        self._manager.join()
                        self._manager = None

                    stop_logging()

                    sys.exit(1)

                return
            self._stop_called = True

        self._logger.debug(
            f"Steps loaded before calling `stop`: {self._steps_load_status}"
        )
        self._logger.info(
            "🛑 Stopping pipeline. Waiting for steps to finish processing batches..."
        )

        self._stop_load_queue_loop()
        self._stop_output_queue_loop()

QueueClass: Callable property

The callable used to create the input and output queues.

Returns:

Type Description
Callable

The callable to create a Queue.

ray(ray_head_node_url=None, ray_init_kwargs=None)

Creates a RayPipeline using the init parameters of this pipeline. This is a convenient method that can be used to "transform" one common Pipeline to a RayPipeline and it's mainly used by the CLI.

Parameters:

Name Type Description Default
ray_head_node_url Optional[str]

The URL that can be used to connect to the head node of the Ray cluster. Normally, you won't want to use this argument as the recommended way to submit a job to a Ray cluster is using the Ray Jobs CLI. Defaults to None.

None
ray_init_kwargs Optional[Dict[str, Any]]

kwargs that will be passed to the ray.init method. Defaults to None.

None

Returns:

Type Description
RayPipeline

A RayPipeline instance.

Source code in src/distilabel/pipeline/local.py
def ray(
    self,
    ray_head_node_url: Optional[str] = None,
    ray_init_kwargs: Optional[Dict[str, Any]] = None,
) -> RayPipeline:
    """Creates a `RayPipeline` using the init parameters of this pipeline. This is a
    convenient method that can be used to "transform" one common `Pipeline` to a `RayPipeline`
    and it's mainly used by the CLI.

    Args:
        ray_head_node_url: The URL that can be used to connect to the head node of
            the Ray cluster. Normally, you won't want to use this argument as the
            recommended way to submit a job to a Ray cluster is using the [Ray Jobs
            CLI](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html#ray-jobs-overview).
            Defaults to `None`.
        ray_init_kwargs: kwargs that will be passed to the `ray.init` method. Defaults
            to `None`.

    Returns:
        A `RayPipeline` instance.
    """
    pipeline = RayPipeline(
        name=self.name,
        description=self.description,
        cache_dir=self._cache_dir,
        enable_metadata=self._enable_metadata,
        requirements=self.requirements,
        ray_head_node_url=ray_head_node_url,
        ray_init_kwargs=ray_init_kwargs,
    )
    pipeline.dag = self.dag
    return pipeline

run(parameters=None, use_cache=True, storage_parameters=None, use_fs_to_pass_data=False, dataset=None)

Runs the pipeline.

Parameters:

Name Type Description Default
parameters Optional[Dict[str, Dict[str, Any]]]

A dictionary with the step name as the key and a dictionary with the runtime parameters for the step as the value. Defaults to None.

None
use_cache bool

Whether to use the cache from previous pipeline runs. Defaults to True.

True
storage_parameters Optional[Dict[str, Any]]

A dictionary with the storage parameters (fsspec and path) that will be used to store the data of the _Batches passed between the steps if use_fs_to_pass_data is True (for the batches received by a GlobalStep it will be always used). It must have at least the "path" key, and it can contain additional keys depending on the protocol. By default, it will use the local file system and a directory in the cache directory. Defaults to None.

None
use_fs_to_pass_data bool

Whether to use the file system to pass the data of the _Batches between the steps. Even if this parameter is False, the Batches received by GlobalSteps will always use the file system to pass the data. Defaults to False.

False
dataset Optional[InputDataset]

If given, it will be used to create a GeneratorStep and put it as the root step. Convenient method when you have already processed the dataset in your script and just want to pass it already processed. Defaults to None.

None

Returns:

Type Description
Distiset

The Distiset created by the pipeline.

Raises:

Type Description
RuntimeError

If the pipeline fails to load all the steps.

Source code in src/distilabel/pipeline/local.py
def run(
    self,
    parameters: Optional[Dict[str, Dict[str, Any]]] = None,
    use_cache: bool = True,
    storage_parameters: Optional[Dict[str, Any]] = None,
    use_fs_to_pass_data: bool = False,
    dataset: Optional["InputDataset"] = None,
) -> "Distiset":
    """Runs the pipeline.

    Args:
        parameters: A dictionary with the step name as the key and a dictionary with
            the runtime parameters for the step as the value. Defaults to `None`.
        use_cache: Whether to use the cache from previous pipeline runs. Defaults to
            `True`.
        storage_parameters: A dictionary with the storage parameters (`fsspec` and path)
            that will be used to store the data of the `_Batch`es passed between the
            steps if `use_fs_to_pass_data` is `True` (for the batches received by a
            `GlobalStep` it will be always used). It must have at least the "path" key,
            and it can contain additional keys depending on the protocol. By default,
            it will use the local file system and a directory in the cache directory.
            Defaults to `None`.
        use_fs_to_pass_data: Whether to use the file system to pass the data of
            the `_Batch`es between the steps. Even if this parameter is `False`, the
            `Batch`es received by `GlobalStep`s will always use the file system to
            pass the data. Defaults to `False`.
        dataset: If given, it will be used to create a `GeneratorStep` and put it as the
            root step. Convenient method when you have already processed the dataset in
            your script and just want to pass it already processed. Defaults to `None`.

    Returns:
        The `Distiset` created by the pipeline.

    Raises:
        RuntimeError: If the pipeline fails to load all the steps.
    """
    if script_executed_in_ray_cluster():
        print("Script running in Ray cluster... Using `RayPipeline`...")
        return self.ray().run(
            parameters=parameters,
            use_cache=use_cache,
            storage_parameters=storage_parameters,
            use_fs_to_pass_data=use_fs_to_pass_data,
            dataset=dataset,
        )

    self._log_queue = cast("Queue[Any]", mp.Queue())

    if distiset := super().run(
        parameters,
        use_cache,
        storage_parameters,
        use_fs_to_pass_data,
        dataset=dataset,
    ):
        return distiset

    num_processes = self.dag.get_total_replica_count()
    with (
        mp.Manager() as manager,
        _NoDaemonPool(
            num_processes,
            initializer=_init_worker,
            initargs=(self._log_queue,),
        ) as pool,
    ):
        self._manager = manager
        self._pool = pool
        self._output_queue = self.QueueClass()
        self._load_queue = self.QueueClass()
        self._handle_keyboard_interrupt()

        # Run the loop for receiving the load status of each step
        self._load_steps_thread = self._run_load_queue_loop_in_thread()

        # Start a loop to receive the output batches from the steps
        self._output_queue_thread = self._run_output_queue_loop_in_thread()
        self._output_queue_thread.join()

        self._teardown()

        if self._exception:
            raise self._exception

    distiset = create_distiset(
        self._cache_location["data"],
        pipeline_path=self._cache_location["pipeline"],
        log_filename_path=self._cache_location["log_file"],
        enable_metadata=self._enable_metadata,
        dag=self.dag,
    )

    stop_logging()

    return distiset