UMAP¶
UMAP is a general purpose manifold learning and dimension reduction algorithm.
This is a GlobalStep
that reduces the dimensionality of the embeddings using. Visit
the TextClustering
step for an example of use. The trained model is saved as an artifact
when creating a distiset and pushing it to the Hugging Face Hub.
Attributes¶
- n_components: The dimension of the space to embed into. This defaults to 2 to provide easy visualization (that's probably what you want), but can reasonably be set to any integer value in the range 2 to 100. - metric: The metric to use to compute distances in high dimensional space. Visit UMAP's documentation for more information. Defaults to
euclidean
. - n_jobs: The number of parallel jobs to run. Defaults to8
. - random_state: The random state to use for the UMAP algorithm.
Runtime Parameters¶
-
n_components: The dimension of the space to embed into. This defaults to 2 to provide easy visualization (that's probably what you want), but can reasonably be set to any integer value in the range 2 to 100.
-
metric: The metric to use to compute distances in high dimensional space. Visit UMAP's documentation for more information. Defaults to
euclidean
. -
n_jobs: The number of parallel jobs to run. Defaults to
8
. -
random_state: The random state to use for the UMAP algorithm.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[embedding]
end
subgraph New columns
OCOL0[projection]
end
end
subgraph UMAP
StepInput[Input Columns: embedding]
StepOutput[Output Columns: projection]
end
ICOL0 --> StepInput
StepOutput --> OCOL0
StepInput --> StepOutput
Inputs¶
- embedding (
List[float]
): The original embeddings we want to reduce the dimension.
Outputs¶
- projection (
List[float]
): Embedding reduced to the number of components specified, the size of the new embeddings will be determined by then_components
.