FaissNearestNeighbour¶
Create a faiss
index to get the nearest neighbours.
FaissNearestNeighbour
is a GlobalStep
that creates a faiss
index using the Hugging
Face datasets
library integration, and then gets the nearest neighbours and the scores
or distance of the nearest neighbours for each input row.
Attributes¶
-
device: the CUDA device ID or a list of IDs to be used. If negative integer, it will use all the available GPUs. Defaults to
None
. -
string_factory: the name of the factory to be used to build the
faiss
index. Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes. Defaults toNone
. -
metric_type: the metric to be used to measure the distance between the points. It's an integer and the recommend way to pass it is importing
faiss
and then passing one offaiss.METRIC_x
variables. Defaults toNone
. -
k: the number of nearest neighbours to search for each input row. Defaults to
1
. -
search_batch_size: the number of rows to include in a search batch. The value can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults to
50
. -
train_size: If the index needs a training step, specifies how many vectors will be used to train the index.
Runtime Parameters¶
-
device: the CUDA device ID or a list of IDs to be used. If negative integer, it will use all the available GPUs. Defaults to
None
. -
string_factory: the name of the factory to be used to build the
faiss
index. Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes. Defaults toNone
. -
metric_type: the metric to be used to measure the distance between the points. It's an integer and the recommend way to pass it is importing
faiss
and then passing one offaiss.METRIC_x
variables. Defaults toNone
. -
k: the number of nearest neighbours to search for each input row. Defaults to
1
. -
search_batch_size: the number of rows to include in a search batch. The value can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults to
50
. -
train_size: If the index needs a training step, specifies how many vectors will be used to train the index.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[embedding]
end
subgraph New columns
OCOL0[nn_indices]
OCOL1[nn_scores]
end
end
subgraph FaissNearestNeighbour
StepInput[Input Columns: embedding]
StepOutput[Output Columns: nn_indices, nn_scores]
end
ICOL0 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepInput --> StepOutput
Inputs¶
- embedding (
List[Union[float, int]]
): a sentence embedding.
Outputs¶
-
nn_indices (
List[int]
): a list containing the indices of thek
nearest neighbours in the inputs for the row. -
nn_scores (
List[float]
): a list containing the score or distance to eachk
nearest neighbour in the inputs.
Examples¶
Generating embeddings and getting the nearest neighbours¶
from distilabel.embeddings.sentence_transformers import SentenceTransformerEmbeddings
from distilabel.pipeline import Pipeline
from distilabel.steps import EmbeddingGeneration, FaissNearestNeighbour, LoadDataFromHub
with Pipeline(name="hello") as pipeline:
load_data = LoadDataFromHub(output_mappings={"prompt": "text"})
embeddings = EmbeddingGeneration(
embeddings=SentenceTransformerEmbeddings(
model="mixedbread-ai/mxbai-embed-large-v1"
)
)
nearest_neighbours = FaissNearestNeighbour()
load_data >> embeddings >> nearest_neighbours
if __name__ == "__main__":
distiset = pipeline.run(
parameters={
load_data.name: {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
},
use_cache=False,
)