Struct faiss::Clustering1D

struct Clustering1D : public faiss::Clustering

Exact 1D clustering algorithm

Since it does not use an index, it does not overload the train() function

Public Functions

explicit Clustering1D(int k)
Clustering1D(int k, const ClusteringParameters &cp)
void train_exact(idx_t n, const float *x)
inline virtual ~Clustering1D()
virtual void train(idx_t n, const float *x, faiss::Index &index, const float *x_weights = nullptr)

run k-means training

Parameters:
  • x – training vectors, size n * d

  • index – index used for assignment

  • x_weights – weight associated to each vector: NULL or size n

void train_encoded(idx_t nx, const uint8_t *x_in, const Index *codec, Index &index, const float *weights = nullptr)

run with encoded vectors

win addition to train()’s parameters takes a codec as parameter to decode the input vectors.

Parameters:

codec – codec used to decode the vectors (nullptr = vectors are in fact floats)

void post_process_centroids()

Post-process the centroids after each centroid update. includes optional L2 normalization and nearest integer rounding

Public Members

size_t d

dimension of the vectors

size_t k

nb of centroids

std::vector<float> centroids

centroids (k * d) if centroids are set on input to train, they will be used as initialization

std::vector<ClusteringIterationStats> iteration_stats

stats at every iteration of clustering

int niter = 25

number of clustering iterations

int nredo = 1

redo clustering this many times and keep the clusters with the best objective

bool verbose = false
bool spherical = false

whether to normalize centroids after each iteration (useful for inner product clustering)

bool int_centroids = false

round centroids coordinates to integer after each iteration?

bool update_index = false

re-train index after each iteration?

bool frozen_centroids = false

Use the subset of centroids provided as input and do not change them during iterations

int min_points_per_centroid = 39

If fewer than this number of training vectors per centroid are provided, writes a warning. Note that fewer than 1 point per centroid raises an exception.

int max_points_per_centroid = 256

to limit size of dataset, otherwise the training set is subsampled

int seed = 1234

seed for the random number generator. negative values lead to seeding an internal rng with std::high_resolution_clock.

size_t decode_block_size = 32768

when the training set is encoded, batch size of the codec decoder

bool check_input_data_for_NaNs = true

whether to check for NaNs in an input data

bool use_faster_subsampling = false

Whether to use splitmix64-based random number generator for subsampling, which is faster, but may pick duplicate points.