Struct faiss::ProgressiveDimClustering

struct ProgressiveDimClustering : public faiss::ProgressiveDimClusteringParameters

K-means clustering with progressive dimensions used

The clustering first happens in dim 1, then with exponentially increasing dimension until d (I steps). This is typically applied after a PCA transformation (optional). Reference:

“Improved Residual Vector Quantization for High-dimensional Approximate

Nearest Neighbor Search”

Shicong Liu, Hongtao Lu, Junru Shao, AAAI’15

https://arxiv.org/abs/1509.05195

Public Functions

ProgressiveDimClustering(int d, int k)

ProgressiveDimClustering(int d, int k, const ProgressiveDimClusteringParameters &cp)

void train(idx_t n, const float *x, ProgressiveDimIndexFactory &factory)

inline virtual ~ProgressiveDimClustering()

Public Members

size_t d: dimension of the vectors

size_t k: nb of centroids

std::vector<float> centroids: centroids (k * d)

std::vector<ClusteringIterationStats> iteration_stats: stats at every iteration of clustering

int progressive_dim_steps: number of incremental steps

bool apply_pca: apply PCA on input

int niter = 25: number of clustering iterations

int nredo = 1: redo clustering this many times and keep the clusters with the best objective

bool verbose = false

bool spherical = false: whether to normalize centroids after each iteration (useful for inner product clustering)

bool int_centroids = false: round centroids coordinates to integer after each iteration?

bool update_index = false: re-train index after each iteration?

bool frozen_centroids = false: Use the subset of centroids provided as input and do not change them during iterations

int min_points_per_centroid = 39: If fewer than this number of training vectors per centroid are provided, writes a warning. Note that fewer than 1 point per centroid raises an exception.

int max_points_per_centroid = 256: to limit size of dataset, otherwise the training set is subsampled

int seed = 1234: seed for the random number generator. negative values lead to seeding an internal rng with std::high_resolution_clock.

size_t decode_block_size = 32768: when the training set is encoded, batch size of the codec decoder

bool check_input_data_for_NaNs = true: whether to check for NaNs in an input data

bool use_faster_subsampling = false: Whether to use splitmix64-based random number generator for subsampling, which is faster, but may pick duplicate points.