Introduction to VSMlib

This is the tutorial for VSMlib. It describes:

  • What it is, and why we are developing it.
  • what you can do with vsmlib.
  • the roadmap of the project.

Both the library and the documentation are actively developed, check back for more! If you have questions, or would like to contribute, feel free to get in touch on github.

What is VSMlib?

VSMlib is an open-source Python library for working with vector space models (VSMs), including various word embeddings such as word2vec. VSMlib can load various popular formats of VSMs and retrieve nearest neighbors of a given vector. It includes a growing list of benchmarks with which VSMs are evaluated in most current research, and a few visualization tools. It also includes a growing list of modules for creating VSMs, both explicit and based on neural networks.

Why do you bother?

There are a few other libraries for working with VSMs, including gensim and spacy. VSMlib differs from them in that its primary goal is to facilitate pricipled, systematic research in providing a framework for reproducible experiments on VSMs.

From the academic perspective, this matters because this is the only way to understand more about what VSMs are and what kind of meaning representation they offer.

From the practical perspective, this matters because otherwise we can not tell which VSM would be the best to use for what task. Existing extrinsic evaluations of VSMs such as popular word similarity, relatedness, analogy and intrusion tasks have methodological problems and do not correlate well with performance on all extrinsic tasks. Therefore basically to pick the best representation for a task you have to try different kinds of VSMs until you find the best-performing one.

Furthermore, there is the important and unpleasant part of parameter tuning and optimizing for a particular task. Levy et al. (2015) showed that the choice of hyperparameters may make more of a difference than the choice of model itself. Even more frustratingly, when you have a relatively comprehensive task covering a wide range of linguistic relations, you may find that the parameters beneficial to a part of the task are detrimental for another part (Gladkova et al. 2016).

The neural parts of VSMlib is implemented in Chainer, a new deep learning framework that is friendly to high-performace multi-GPU environments. This should make VSMlib useful in both academic and industrial settings.