FaissNearestNeighbour¶
Create a faiss index to get the nearest neighbours.
FaissNearestNeighbour is a GlobalStep that creates a faiss index using the Hugging
Face datasets library integration, and then gets the nearest neighbours and the scores
or distance of the nearest neighbours for each input row.
Attributes¶
-
device: the CUDA device ID or a list of IDs to be used. If negative integer, it will use all the available GPUs. Defaults to
None. -
string_factory: the name of the factory to be used to build the
faissindex. Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes. Defaults toNone. -
metric_type: the metric to be used to measure the distance between the points. It's an integer and the recommend way to pass it is importing
faissand then passing one offaiss.METRIC_xvariables. Defaults toNone. -
k: the number of nearest neighbours to search for each input row. Defaults to
1. -
search_batch_size: the number of rows to include in a search batch. The value can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults to
50. -
train_size: If the index needs a training step, specifies how many vectors will be used to train the index.
Runtime Parameters¶
-
device: the CUDA device ID or a list of IDs to be used. If negative integer, it will use all the available GPUs. Defaults to
None. -
string_factory: the name of the factory to be used to build the
faissindex. Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes. Defaults toNone. -
metric_type: the metric to be used to measure the distance between the points. It's an integer and the recommend way to pass it is importing
faissand then passing one offaiss.METRIC_xvariables. Defaults toNone. -
k: the number of nearest neighbours to search for each input row. Defaults to
1. -
search_batch_size: the number of rows to include in a search batch. The value can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults to
50. -
train_size: If the index needs a training step, specifies how many vectors will be used to train the index.
Input & Output Columns¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[embedding]
end
subgraph New columns
OCOL0[nn_indices]
OCOL1[nn_scores]
end
end
subgraph FaissNearestNeighbour
StepInput[Input Columns: embedding]
StepOutput[Output Columns: nn_indices, nn_scores]
end
ICOL0 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepInput --> StepOutput
Inputs¶
- embedding (
List[Union[float, int]]): a sentence embedding.
Outputs¶
-
nn_indices (
List[int]): a list containing the indices of theknearest neighbours in the inputs for the row. -
nn_scores (
List[float]): a list containing the score or distance to eachknearest neighbour in the inputs.
Examples¶
Generating embeddings and getting the nearest neighbours¶
from distilabel.models import SentenceTransformerEmbeddings
from distilabel.pipeline import Pipeline
from distilabel.steps import EmbeddingGeneration, FaissNearestNeighbour, LoadDataFromHub
with Pipeline(name="hello") as pipeline:
load_data = LoadDataFromHub(output_mappings={"prompt": "text"})
embeddings = EmbeddingGeneration(
embeddings=SentenceTransformerEmbeddings(
model="mixedbread-ai/mxbai-embed-large-v1"
)
)
nearest_neighbours = FaissNearestNeighbour()
load_data >> embeddings >> nearest_neighbours
if __name__ == "__main__":
distiset = pipeline.run(
parameters={
load_data.name: {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
},
use_cache=False,
)