File HNSW.h

namespace faiss

Implementation of k-means clustering with many variants.

Copyright (c) Facebook, Inc. and its affiliates.

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

IDSelector is intended to define a subset of vectors to handle (for removal or as subset to search)

PQ4 SIMD packing and accumulation functions

The basic kernel accumulates nq query vectors with bbs = nb * 2 * 16 vectors and produces an output matrix for that. It is interesting for nq * nb <= 4, otherwise register spilling becomes too large.

The implementation of these functions is spread over 3 cpp files to reduce parallel compile times. Templates are instantiated explicitly.

This file contains callbacks for kernels that compute distances.

Throughout the library, vectors are provided as float * pointers. Most algorithms can be optimized when several vectors are processed (added/searched) together in a batch. In this case, they are passed in as a matrix. When n vectors of size d are provided as float * x, component j of vector i is

x[ i * d + j ]

where 0 <= i < n and 0 <= j < d. In other words, matrices are always compact. When specifying the size of the matrix, we call it an n*d matrix, which implies a row-major storage.

I/O functions can read/write to a filename, a file handle or to an object that abstracts the medium.

The read functions return objects that should be deallocated with delete. All references within these objectes are owned by the object.

Definition of inverted lists + a few common classes that implement the interface.

Since IVF (inverted file) indexes are of so much use for large-scale use cases, we group a few functions related to them in this small library. Most functions work both on IndexIVFs and IndexIVFs embedded within an IndexPreTransform.

In this file are the implementations of extra metrics beyond L2 and inner product

Implements a few neural net layers, mainly to support QINCo

Defines a few objects that apply transformations to a set of vectors Often these are pre-processing steps.

Functions

int search_from_candidates(const HNSW &hnsw, DistanceComputer &qdis, ResultHandler<HNSW::C> &res, HNSW::MinimaxHeap &candidates, VisitedTable &vt, HNSWStats &stats, int level, int nres_in = 0, const SearchParametersHNSW *params = nullptr)
HNSWStats greedy_update_nearest(const HNSW &hnsw, DistanceComputer &qdis, int level, HNSW::storage_idx_t &nearest, float &d_nearest)
std::priority_queue<HNSW::Node> search_from_candidate_unbounded(const HNSW &hnsw, const HNSW::Node &node, DistanceComputer &qdis, int ef, VisitedTable *vt, HNSWStats &stats)
void search_neighbors_to_add(HNSW &hnsw, DistanceComputer &qdis, std::priority_queue<HNSW::NodeDistCloser> &results, int entry_point, float d_entry_point, int level, VisitedTable &vt, bool reference_version = false)

Variables

FAISS_API HNSWStats hnsw_stats
struct SearchParametersHNSW : public faiss::SearchParameters

Public Functions

inline ~SearchParametersHNSW()

Public Members

int efSearch = 16
bool check_relative_distance = true
bool bounded_queue = true
struct HNSW

Public Types

using storage_idx_t = int32_t

internal storage of vectors (32 bits: this is expensive)

using C = CMax<float, int64_t>
typedef std::pair<float, storage_idx_t> Node

Public Functions

void set_default_probas(int M, float levelMult)

initialize the assign_probas and cum_nneighbor_per_level to have 2*M links on level 0 and M links on levels > 0

void set_nb_neighbors(int level_no, int n)

set nb of neighbors for this level (before adding anything)

int nb_neighbors(int layer_no) const

nb of neighbors for this level

int cum_nb_neighbors(int layer_no) const

cumumlative nb up to (and excluding) this level

void neighbor_range(idx_t no, int layer_no, size_t *begin, size_t *end) const

range of entries in the neighbors table of vertex no at layer_no

explicit HNSW(int M = 32)

only mandatory parameter: nb of neighbors

int random_level()

pick a random level for a new point

void fill_with_random_links(size_t n)

add n random levels to table (for debugging…)

void add_with_locks(DistanceComputer &ptdis, int pt_level, int pt_id, std::vector<omp_lock_t> &locks, VisitedTable &vt, bool keep_max_size_level0 = false)

add point pt_id on all levels <= pt_level and build the link structure for them.

HNSWStats search(DistanceComputer &qdis, ResultHandler<C> &res, VisitedTable &vt, const SearchParametersHNSW *params = nullptr) const

search interface for 1 point, single thread

void search_level_0(DistanceComputer &qdis, ResultHandler<C> &res, idx_t nprobe, const storage_idx_t *nearest_i, const float *nearest_d, int search_type, HNSWStats &search_stats, VisitedTable &vt, const SearchParametersHNSW *params = nullptr) const

search only in level 0 from a given vertex

void reset()
void clear_neighbor_tables(int level)
void print_neighbor_stats(int level) const
int prepare_level_tab(size_t n, bool preset_levels = false)
void permute_entries(const idx_t *map)

Public Members

std::vector<double> assign_probas

assignment probability to each layer (sum=1)

std::vector<int> cum_nneighbor_per_level

number of neighbors stored per layer (cumulative), should not be changed after first add

std::vector<int> levels

level of each vector (base level = 1), size = ntotal

std::vector<size_t> offsets

offsets[i] is the offset in the neighbors array where vector i is stored size ntotal + 1

std::vector<storage_idx_t> neighbors

neighbors[offsets[i]:offsets[i+1]] is the list of neighbors of vector i for all levels. this is where all storage goes.

storage_idx_t entry_point = -1

entry point in the search structure (one of the points with maximum level

faiss::RandomGenerator rng
int max_level = -1

maximum level

int efConstruction = 40

expansion factor at construction time

int efSearch = 16

expansion factor at search time

bool check_relative_distance = true

during search: do we check whether the next best distance is good enough?

bool search_bounded_queue = true

use bounded queue during exploration

Public Static Functions

static void shrink_neighbor_list(DistanceComputer &qdis, std::priority_queue<NodeDistFarther> &input, std::vector<NodeDistFarther> &output, int max_size, bool keep_max_size_level0 = false)
struct MinimaxHeap
#include <HNSW.h>

Heap structure that allows fast

Public Types

typedef faiss::CMax<float, storage_idx_t> HC

Public Functions

inline explicit MinimaxHeap(int n)
void push(storage_idx_t i, float v)
float max() const
int size() const
void clear()
int pop_min(float *vmin_out = nullptr)
int count_below(float thresh)

Public Members

int n
int k
int nvalid
std::vector<storage_idx_t> ids
std::vector<float> dis
struct NodeDistCloser
#include <HNSW.h>

to sort pairs of (id, distance) from nearest to fathest or the reverse

Public Functions

inline NodeDistCloser(float d, int id)
inline bool operator<(const NodeDistCloser &obj1) const

Public Members

float d
int id
struct NodeDistFarther

Public Functions

inline NodeDistFarther(float d, int id)
inline bool operator<(const NodeDistFarther &obj1) const

Public Members

float d
int id
struct HNSWStats

Public Functions

inline void reset()

number of hops aka number of edges traversed

inline void combine(const HNSWStats &other)

Public Members

size_t n1 = 0
size_t n2 = 0

number of vectors searched

size_t ndis = 0

number of queries for which the candidate list is exhausted

size_t nhops = 0

number of distances computed