UMAP¶

UMAP is a general purpose manifold learning and dimension reduction algorithm.

This is a GlobalStep that reduces the dimensionality of the embeddings using. Visit the TextClustering step for an example of use. The trained model is saved as an artifact when creating a distiset and pushing it to the Hugging Face Hub.

Attributes¶

n_components: The dimension of the space to embed into. This defaults to 2 to provide easy visualization (that's probably what you want), but can reasonably be set to any integer value in the range 2 to 100. - metric: The metric to use to compute distances in high dimensional space. Visit UMAP's documentation for more information. Defaults to euclidean. - n_jobs: The number of parallel jobs to run. Defaults to 8. - random_state: The random state to use for the UMAP algorithm.

Runtime Parameters¶

n_components: The dimension of the space to embed into. This defaults to 2 to provide easy visualization (that's probably what you want), but can reasonably be set to any integer value in the range 2 to 100.
metric: The metric to use to compute distances in high dimensional space. Visit UMAP's documentation for more information. Defaults to euclidean.
n_jobs: The number of parallel jobs to run. Defaults to 8.
random_state: The random state to use for the UMAP algorithm.

Input & Output Columns¶

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[embedding]
        end
        subgraph New columns
            OCOL0[projection]
        end
    end

    subgraph UMAP
        StepInput[Input Columns: embedding]
        StepOutput[Output Columns: projection]
    end

    ICOL0 --> StepInput
    StepOutput --> OCOL0
    StepInput --> StepOutput

Inputs¶

embedding (List[float]): The original embeddings we want to reduce the dimension.

Outputs¶

projection (List[float]): Embedding reduced to the number of components specified, the size of the new embeddings will be determined by the n_components.