vsmlib.model

The model module that implements embedding loading.

Functions

load_from_dir(path) Automatically detects embeddings format and loads

Classes

Model() Basic model class to define interface.
ModelDense() Stores dense embeddings.
ModelLevy() This is deprecated and will be removed soon.
ModelNumbered() extends dense model by numbering dimensions
ModelSparse() sparse (usually count-based) embeddings
ModelW2V() extends ModelDense to support loading of original binary format from Mikolov’s w2v
Model_svd_scipy(original, …)
class vsmlib.model.Model

Bases: object

Basic model class to define interface.

Usually you would not use this class directly, but rather some of the classes which inherit from Model

get_most_similar_words(w, cnt=10)

returns list of words sorted by cosine proximity to a target word

Parameters:
  • w – target word
  • cnt – how many similar words are needed
Returns:

list of words and corresponding similarities

class vsmlib.model.ModelDense

Bases: vsmlib.model.Model

Stores dense embeddings.

filter_by_vocab(words)

reduced embeddings to the provided list of words (which can be empty)

Parameters:words – set or list of words to keep
Returns:Instance of Dense class
get_most_similar_words(w, cnt=10)

returns list of words sorted by cosine proximity to a target word

Parameters:
  • w – target word
  • cnt – how many similar words are needed
Returns:

list of words and corresponding similarities

load_hdf5(path)

loads embeddings from hdf5 format

load_npy(path)

loads embeddings from numpy format

class vsmlib.model.ModelLevy

Bases: vsmlib.model.ModelNumbered

This is deprecated and will be removed soon.

filter_by_vocab(words)

reduced embeddings to the provided list of words (which can be empty)

Parameters:words – set or list of words to keep
Returns:Instance of Dense class
get_most_similar_words(w, cnt=10)

returns list of words sorted by cosine proximity to a target word

Parameters:
  • w – target word
  • cnt – how many similar words are needed
Returns:

list of words and corresponding similarities

load_hdf5(path)

loads embeddings from hdf5 format

load_npy(path)

loads embeddings from numpy format

class vsmlib.model.ModelNumbered

Bases: vsmlib.model.ModelDense

extends dense model by numbering dimensions

filter_by_vocab(words)

reduced embeddings to the provided list of words (which can be empty)

Parameters:words – set or list of words to keep
Returns:Instance of Dense class
get_most_similar_words(w, cnt=10)

returns list of words sorted by cosine proximity to a target word

Parameters:
  • w – target word
  • cnt – how many similar words are needed
Returns:

list of words and corresponding similarities

load_hdf5(path)

loads embeddings from hdf5 format

load_npy(path)

loads embeddings from numpy format

class vsmlib.model.ModelSparse

Bases: vsmlib.model.Model

sparse (usually count-based) embeddings

get_most_similar_words(w, cnt=10)

returns list of words sorted by cosine proximity to a target word

Parameters:
  • w – target word
  • cnt – how many similar words are needed
Returns:

list of words and corresponding similarities

load_from_hdf5(path)

load model in compressed sparse row format from hdf5 file

hdf5 file should contain row_ptr, col_ind and data array

Parameters:path – path to the embeddings folder
class vsmlib.model.ModelW2V

Bases: vsmlib.model.ModelNumbered

extends ModelDense to support loading of original binary format from Mikolov’s w2v

filter_by_vocab(words)

reduced embeddings to the provided list of words (which can be empty)

Parameters:words – set or list of words to keep
Returns:Instance of Dense class
get_most_similar_words(w, cnt=10)

returns list of words sorted by cosine proximity to a target word

Parameters:
  • w – target word
  • cnt – how many similar words are needed
Returns:

list of words and corresponding similarities

load_hdf5(path)

loads embeddings from hdf5 format

load_npy(path)

loads embeddings from numpy format

class vsmlib.model.Model_svd_scipy(original, cnt_singular_vectors, power)

Bases: vsmlib.model.ModelNumbered

filter_by_vocab(words)

reduced embeddings to the provided list of words (which can be empty)

Parameters:words – set or list of words to keep
Returns:Instance of Dense class
get_most_similar_words(w, cnt=10)

returns list of words sorted by cosine proximity to a target word

Parameters:
  • w – target word
  • cnt – how many similar words are needed
Returns:

list of words and corresponding similarities

load_hdf5(path)

loads embeddings from hdf5 format

load_npy(path)

loads embeddings from numpy format

vsmlib.model.load_from_dir(path)

Automatically detects embeddings format and loads

Parameters:path – directory where embeddings are stores
Returns:Instance of appropriate Model-based class