A python module to generate word embeddings from tiny data

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

nonce2vec

Welcome to Nonce2Vec!

This is the repo accompanying the paper "High-risk learning: acquiring new word vectors from tiny data" (Herbelot & Baroni, 2017). If you use this code, please cite the following:

@InProceedings{herbelot-baroni:2017:EMNLP2017,
  author    = {Herbelot, Aur\'{e}lie  and  Baroni, Marco},
  title     = {High-risk learning: acquiring new word vectors from tiny data},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {304--309},
  url       = {https://www.aclweb.org/anthology/D17-1030}
}

NEW! We have now released v2.0 of Nonce2Vec which is packaged via pip and runs on gensim v3.4.0. This should make it way easier for you to replicate experiments.

Install

pip3 install nonce2vec

Download and extract the required resources

To download the nonces, chimeras and MEN datasets:

wget http://129.194.21.122/~kabbach/noncedef.chimeras.men.7z

To use the pretrained gensim model from Herbelot and Baroni (2017):

wget http://129.194.21.122/~kabbach/wiki_all.model.7z

Generate a pre-trained word2vec model

To generate a gensim.word2vec model from scratch, with :

Use a Wikipedia dump

To use the same Wikipedia dump as Herbelot and Baroni (2017):

wget http://129.194.21.122/~kabbach/wiki.all.utf8.sent.split.lower.7z

Else, to create a new Wikipedia dump from an earlier archive, check out WiToKit.

Train the background model

n2v train \
  --data /absolute/path/to/wikipedia/dump \
  --outputdir /absolute/path/to/dir/where/to/store/w2v/model \
  --alpha 0.025 \
  --neg 5 \
  --window 5 \
  --sample 1e-3 \
  --epochs 5 \
  --min-count 50 \
  --size 400 \
  --num-threads number_of_cpu_threads_to_use
  --train-mode skipgram

Check the correlation with the MEN dataset

n2v check \
  --data /absolute/path/to/MEN/MEN_dataset_natural_form_full
  --model /absolute/path/to/gensim/word2vec/model

Test nonce2vec on the nonce definitional dataset

n2v test \
  --on nonces \
  --model /absolute/path/to/pretrained/w2v/model \
  --data /absolute/path/to/nonce.definitions.299.test \
  --alpha 1 \
  --neg 3 \
  --window 15 \
  --sample 10000 \
  --epochs 1 \
  --min-count 1 \
  --lambda 70 \
  --sample-decay 1.9 \
  --window-decay 5 \
  --sum-filter random \
  --sum-over-set \
  --replication

Test nonce2vec on the chimeras dataset

n2v test \
  --on chimeras \
  --model /absolute/path/to/pretrained/w2v/model \
  --data /absolute/path/to/chimeras.dataset.lx.tokenised.test.txt \
  --alpha 1 \
  --neg 3 \
  --window 15 \
  --sample 10000 \
  --epochs 1 \
  --min-count 1 \
  --lambda 70 \
  --sample-decay 1.9 \
  --window-decay 5 \
  --sum-filter random \
  --sum-over-set \
  --replication

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.0.2

Nov 2, 2020

2.0.1

Jan 10, 2020

2.0.0

Jul 29, 2019

2.0.0rc5 pre-release

Jan 15, 2019

2.0.0rc4 pre-release

Nov 19, 2018

2.0.0rc3 pre-release

Nov 17, 2018

2.0.0rc2 pre-release

Nov 17, 2018

This version

2.0.0rc1 pre-release

Nov 17, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nonce2vec-2.0.0rc1.tar.gz (16.6 kB view hashes)

Uploaded Nov 17, 2018 Source

Hashes for nonce2vec-2.0.0rc1.tar.gz

Hashes for nonce2vec-2.0.0rc1.tar.gz
Algorithm	Hash digest
SHA256	`e5431a75ad29e2403e51640550bf498911c29fbab06cbaff7d1dc3fae38fe08a`
MD5	`71cd74c86d908e35c552132ba8d4c67a`
BLAKE2b-256	`0c06b087d9457af55401bad9f2c20fece1c4783dd3cc810c29f34296e4c7a557`