File utils.h

namespace faiss

Implementation of k-means clustering with many variants.

Copyright (c) Facebook, Inc. and its affiliates.

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

IDSelector is intended to define a subset of vectors to handle (for removal or as subset to search)

PQ4 SIMD packing and accumulation functions

The basic kernel accumulates nq query vectors with bbs = nb * 2 * 16 vectors and produces an output matrix for that. It is interesting for nq * nb <= 4, otherwise register spilling becomes too large.

The implementation of these functions is spread over 3 cpp files to reduce parallel compile times. Templates are instantiated explicitly.

This file contains callbacks for kernels that compute distances.

Throughout the library, vectors are provided as float * pointers. Most algorithms can be optimized when several vectors are processed (added/searched) together in a batch. In this case, they are passed in as a matrix. When n vectors of size d are provided as float * x, component j of vector i is

x[ i * d + j ]

where 0 <= i < n and 0 <= j < d. In other words, matrices are always compact. When specifying the size of the matrix, we call it an n*d matrix, which implies a row-major storage.

I/O functions can read/write to a filename, a file handle or to an object that abstracts the medium.

The read functions return objects that should be deallocated with delete. All references within these objectes are owned by the object.

Definition of inverted lists + a few common classes that implement the interface.

Since IVF (inverted file) indexes are of so much use for large-scale use cases, we group a few functions related to them in this small library. Most functions work both on IndexIVFs and IndexIVFs embedded within an IndexPreTransform.

In this file are the implementations of extra metrics beyond L2 and inner product

Implements a few neural net layers, mainly to support QINCo

Defines a few objects that apply transformations to a set of vectors Often these are pre-processing steps.

Functions

std::string get_compile_options(): get compile options

std::string get_version()

double getmillisecs(): ms elapsed since some arbitrary epoch

size_t get_mem_usage_kb(): get current RSS usage in kB

uint64_t get_cycles()

void reflection(const float *u, float *x, size_t n, size_t d, size_t nu)

void matrix_qr(int m, int n, float *a)

compute the Q of the QR decomposition for m > n

Parameters:: a – size n * m: input matrix and output Q

void ranklist_handle_ties(int k, int64_t *idx, const float *dis): distances are supposed to be sorted. Sorts indices with same distance

size_t ranklist_intersection_size(size_t k1, const int64_t *v1, size_t k2, const int64_t *v2): count the number of common elements between v1 and v2 algorithm = sorting + bissection to avoid double-counting duplicates

size_t merge_result_table_with(size_t n, size_t k, int64_t *I0, float *D0, const int64_t *I1, const float *D1, bool keep_min = true, int64_t translation = 0)

merge a result table into another one

Parameters:

I0, D0 – first result table, size (n, k)
I1, D1 – second result table, size (n, k)
keep_min – if true, keep min values, otherwise keep max
translation – add this value to all I1’s indexes

Returns:

nb of values that were taken from the second table

double imbalance_factor(int n, int k, const int64_t *assign): a balanced assignment has a IF of 1

double imbalance_factor(int k, const int *hist): same, takes a histogram as input

int ivec_hist(size_t n, const int *v, int vmax, int *hist): compute histogram on v

void bincode_hist(size_t n, size_t nbits, const uint8_t *codes, int *hist)

Compute histogram of bits on a code array

Parameters:

codes – size(n, nbits / 8)
hist – size(nbits): nb of 1s in the array of codes

uint64_t ivec_checksum(size_t n, const int32_t *a): compute a checksum on a table.

uint64_t bvec_checksum(size_t n, const uint8_t *a): compute a checksum on a table.

void bvecs_checksum(size_t n, size_t d, const uint8_t *a, uint64_t *cs)

compute checksums for the rows of a matrix

Parameters:

n – number of rows
d – size per row
a – matrix to handle, size n * d
cs – output checksums, size n

const float *fvecs_maybe_subsample(size_t d, size_t *n, size_t nmax, const float *x, bool verbose = false, int64_t seed = 1234)

random subsamples a set of vectors if there are too many of them

Parameters:

d – dimension of the vectors
n – on input: nb of input vectors, output: nb of output vectors
nmax – max nb of vectors to keep
x – input array, size *n-by-d
seed – random seed to use for sampling

Returns:

x or an array allocated with new [] with *n vectors

void binary_to_real(size_t d, const uint8_t *x_in, float *x_out)

Convert binary vector to +1/-1 valued float vector.

Parameters:

d – dimension of the vector (multiple of 8)
x_in – input binary vector (uint8_t table of size d / 8)
x_out – output float vector (float table of size d)

void real_to_binary(size_t d, const float *x_in, uint8_t *x_out)

Convert float vector to binary vector. Components > 0 are converted to 1, others to 0.

Parameters:

d – dimension of the vector (multiple of 8)
x_in – input float vector (float table of size d)
x_out – output binary vector (uint8_t table of size d / 8)

uint64_t hash_bytes(const uint8_t *bytes, int64_t n): A reasonable hashing function

bool check_openmp(): Whether OpenMP annotations were respected.

template<typename T> struct CombinerRangeKNN

#include <utils.h>

This class is used to combine range and knn search results in contrib.exhaustive_search.range_search_gpu

Public Functions

inline CombinerRangeKNN(int64_t nq, size_t k, T r2, bool keep_max): whether to keep max values instead of min.

void compute_sizes(int64_t *L_res): size nq + 1

void write_result(T *D_res, int64_t *I_res): Phase 2: caller allocates D_res and I_res (size L_res[nq]) Phase 3: fill in D_res and I_res

Public Members

int64_t nq

size_t k: nb of queries

T r2: number of neighbors for the knn search part

bool keep_max: range search radius

const int64_t *I = nullptr: Knn search results.

const T *D = nullptr: size nq * k

const bool *mask = nullptr

size nq * k

optional: range search results (ignored if mask is NULL)

const int64_t *lim_remain = nullptr: mask for where knn results are valid, size nq

const T *D_remain = nullptr: size nrange + 1

const int64_t *I_remain = nullptr: size lim_remain[nrange]

const int64_t *L_res = nullptr: size lim_remain[nrange]

struct CodeSet

Public Functions

inline explicit CodeSet(size_t d)

void insert(size_t n, const uint8_t *codes, bool *inserted)

Public Members

size_t d

std::set<std::vector<uint8_t>> s