DBSCAN¶
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core
samples in regions of high density and expands clusters from them. This algorithm is good for data which contains clusters of similar density.
This is a `GlobalStep` that clusters the embeddings using the DBSCAN algorithm
from `sklearn`. Visit `TextClustering` step for an example of use.
The trained model is saved as an artifact when creating a distiset
and pushing it to the Hugging Face Hub.
Attributes¶
- eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. - min_samples: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. If
min_samples
is set to a higher value, DBSCAN will find denser clusters, whereas if it is set to a lower value, the found clusters will be more sparse. - metric: The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed bysklearn.metrics.pairwise_distances
for its metric parameter. - n_jobs: The number of parallel jobs to run.
Runtime Parameters¶
-
eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
-
min_samples: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. If
min_samples
is set to a higher value, DBSCAN will find denser clusters, whereas if it is set to a lower value, the found clusters will be more sparse. -
metric: The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by
sklearn.metrics.pairwise_distances
for its metric parameter. -
n_jobs: The number of parallel jobs to run.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[projection]
end
subgraph New columns
OCOL0[cluster_label]
end
end
subgraph DBSCAN
StepInput[Input Columns: projection]
StepOutput[Output Columns: cluster_label]
end
ICOL0 --> StepInput
StepOutput --> OCOL0
StepInput --> StepOutput
Inputs¶
- projection (
List[float]
): Vector representation of the text to cluster, normally the output from theUMAP
step.
Outputs¶
- cluster_label (
int
): Integer representing the label of a given cluster. -1 means it wasn't clustered.