Quick start: Similarity experiments¶
This tutorial explains how to run similarity experiments. It assumes that a vector space is already built.
SimLex-999¶
The [SimLex-999] data set consists of 999 word pairs judged by humans for similarity. You can download the whole data set from here.
These are some of the records, the similarity score is in the SD(SimLex)
column:
word1 | word2 | POS | SimLex999 | conc(w1) | conc(w2) | concQ | Assoc(USF) | SimAssoc333 | SD(SimLex) |
---|---|---|---|---|---|---|---|---|---|
old | new | A | 1.58 | 2.72 | 2.81 | 2 | 7.25 | 1 | 0.41 |
smart | intelligent | A | 9.2 | 1.75 | 2.46 | 1 | 7.11 | 1 | 0.67 |
hard | difficult | A | 8.77 | 3.76 | 2.21 | 2 | 5.94 | 1 | 1.19 |
happy | cheerful | A | 9.55 | 2.56 | 2.34 | 1 | 5.85 | 1 | 2.18 |
hard | easy | A | 0.95 | 3.76 | 2.07 | 2 | 5.82 | 1 | 0.93 |
fast | rapid | A | 8.75 | 3.32 | 3.07 | 2 | 5.66 | 1 | 1.68 |
happy | glad | A | 9.17 | 2.56 | 2.36 | 1 | 5.49 | 1 | 1.59 |
short | long | A | 1.23 | 3.61 | 3.18 | 2 | 5.36 | 1 | 1.58 |
Our task is to predict the human judgment given a pair of words from the dataset.
# Dowload the dataset
wget http://www.eecs.qmul.ac.uk/~dm303/static/data/SimLex-999/SimLex-999.txt
# Download the vector space
wget http://www.eecs.qmul.ac.uk/~dm303/t/space_corpus.ukwac_wackypedia-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.SimLex-999.reduction.raw.cds.nan.h5
# Run an experiment
corpora wsd similarity \
--space space_corpus.ukwac_wackypedia-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.SimLex-999.reduction.raw.cds.nan.h5 \
--dataset simlex999://$PWD/SimLex-999.txt?tagset=ukwac \
--output simlex.h5
Similarity |################################| 999/999, elapsed: 0:00:03
Spearman correlation (head), cosine): rho=0.359, p=0.00000, support=999
For this space (weighting: PPMI, corpus: ukWaC+Wackypedia, context: 2000 most frequent POS tagged words), the result is 0.359.
It is possible to access individual similarity results based not only on cosine, but also on correlation and inner product:
>>> import pandas as pd
>>> pd.read_hdf('simlex.h5', key='dataset').head()
unit1 unit2 eucliedean cos correlation inner_product score
0 (old, J, ()) (new, J, ()) 0.044349 0.137955 -0.009947 36.827502 1.58
1 (smart, J, ()) (intelligent, J, ()) 0.050431 0.504371 0.418022 180.106291 9.20
2 (hard, J, ()) (difficult, J, ()) 0.054279 0.483636 0.419472 142.161781 8.77
3 (happy, J, ()) (cheerful, J, ()) 0.050953 0.469045 0.403718 149.552762 9.55
4 (hard, J, ()) (easy, J, ()) 0.050773 0.436153 0.356595 134.926696 0.95
MEN¶
MEN is a word similarity and relatedness dataset [MEN]:
# Download the datasets
wget http://www.eecs.qmul.ac.uk/~dm303/t/MEN_dataset_lemma_form_full
wget http://www.eecs.qmul.ac.uk/~dm303/t/MEN_dataset_lemma_form.dev
wget http://www.eecs.qmul.ac.uk/~dm303/t/MEN_dataset_lemma_form.test
# Download the space
# Run an experiment on the full dataset
corpora wsd similarity \
--space space_corpus.ukwac_wackypedia-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.SimLex-999.reduction.raw.cds.nan.h5 \
--dataset men://$PWD/MEN_dataset_lemma_form_full \
--output men_full.h5
Similarity |################################| 3000/3000, elapsed: 0:00:09
Spearman correlation (head), cosine): rho=0.699, p=0.00000, support=3000
# Dev
corpora wsd similarity \
--space space_corpus.ukwac_wackypedia-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.SimLex-999.reduction.raw.cds.nan.h5 \
--dataset men://$PWD/MEN_dataset_lemma_form.dev \
--output men_dev.h5
Similarity |################################| 2000/2000, elapsed: 0:00:06
Spearman correlation (head), cosine): rho=0.698, p=0.00000, support=2000
# Test
corpora wsd similarity \
--space space_corpus.ukwac_wackypedia-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.SimLex-999.reduction.raw.cds.nan.h5 \
--dataset men://$PWD/MEN_dataset_lemma_form.test \
--output men_test.h5
Similarity |################################| 1000/1000, elapsed: 0:00:03
Spearman correlation (head), cosine): rho=0.701, p=0.00000, support=1000
KS14¶
# Download the dataset
wget http://compling.eecs.qmul.ac.uk/wp-content/uploads/2015/07/KS2014.txt
# Download the spaces
wget http://www.eecs.qmul.ac.uk/~dm303/t/space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5
# Addition
corpora wsd similarity \
--space space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--dataset ks13://$PWD/KS2014.txt \
--composition_operator add \
--output ks14_add.h5
Similarity |################################| 108/108, elapsed: 0:00:01
Spearman correlation (add), cosine): rho=0.780, p=0.00000, support=108
# Head
corpora wsd similarity \
--space space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--dataset ks13://$PWD/KS2014.txt \
--composition_operator head \
--output ks14_head.h5
Similarity |################################| 108/108, elapsed: 0:00:00
Spearman correlation (head), cosine): rho=0.697, p=0.00000, support=108
# Multiplication
corpora wsd similarity \
--space space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--dataset ks13://$PWD/KS2014.txt \
--composition_operator mult \
--output ks14_mult.h5
Similarity |################################| 108/108, elapsed: 0:00:01
Spearman correlation (mult), cosine): rho=0.721, p=0.00000, support=108
# Kronecker
corpora wsd similarity \
--space space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--dataset ks13://$PWD/KS2014.txt \
--composition_operator kron \
--output ks14_kron.h5
Similarity |################################| 108/108, elapsed: 0:01:04
Spearman correlation (kron), cosine): rho=0.805, p=0.00000, support=108
# Relational
corpora wsd similarity \
--space space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--verb_space out/verb_space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--dataset ks13://$PWD/KS2014.txt \
--composition_operator relational \
--output ks14_relational.h5
Similarity |################################| 108/108, elapsed: 0:01:04
Spearman correlation (relational), cosine): rho=0.522, p=0.00000, support=108
# copy-object
corpora wsd similarity \
--space space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--verb_space out/verb_space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--dataset ks13://$PWD/KS2014.txt \
--composition_operator copy-object \
--output ks14_copy-object.h5
Similarity |################################| 108/108, elapsed: 0:00:38
Spearman correlation (copy-object), cosine): rho=0.346, p=0.00025, support=108
# copy-subject
corpora wsd similarity \
--space space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--verb_space out/verb_space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--dataset ks13://$PWD/KS2014.txt \
--composition_operator copy-subject \
--output ks14_copy-subject.h5
Similarity |################################| 108/108, elapsed: 0:00:35
Spearman correlation (copy-subject), cosine): rho=0.446, p=0.00000, support=108
# frobenious-add
corpora wsd similarity \
--space space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--verb_space out/verb_space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--dataset ks13://$PWD/KS2014.txt \
--composition_operator frobenious-add \
--output ks14_frobenious-add.h5
Similarity |################################| 108/108, elapsed: 0:00:39
Spearman correlation (frobenious-add), cosine): rho=0.486, p=0.00000, support=108
# frobenious-mult
corpora wsd similarity \
--space space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--verb_space out/verb_space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--dataset ks13://$PWD/KS2014.txt \
--composition_operator frobenious-mult \
--output ks14_frobenious-mult.h5
Similarity |################################| 108/108, elapsed: 0:00:39
Spearman correlation (frobenious-mult), cosine): rho=0.354, p=0.00017, support=108
# frobenious-outer
corpora wsd similarity \
--space space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--verb_space out/verb_space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.emnlp2013_turk.reduction.raw.cds.nan.h5 \
--dataset ks13://$PWD/KS2014.txt \
--composition_operator frobenious-outer \
--output ks14_frobenious-outer.h5
Similarity |################################| 108/108, elapsed: 0:01:37
Spearman correlation (frobenious-outer), cosine): rho=0.522, p=0.00000, support=108
GS11¶
# Download the dataset
wget http://compling.eecs.qmul.ac.uk/wp-content/uploads/2015/07/GS2011data.txt
# Download the sapces
wget http://www.eecs.qmul.ac.uk/~dm303/t/space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.gs2011.reduction.raw.cds.nan.h5
wget http://www.eecs.qmul.ac.uk/~dm303/t/verb_space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.gs2011.reduction.raw.cds.nan.h5
# Add
corpora wsd similarity \
--space space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.gs2011.reduction.raw.cds.nan.h5 \
--dataset gs11://$PWD/GS2011data.txt \
--composition_operator add \
--output gs11-add.h5
Similarity |################################| 199/199, elapsed: 0:00:01
Spearman correlation (add), cosine): rho=0.192, p=0.00670, support=199
# copy-object
corpora wsd similarity \
--space space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.gs2011.reduction.raw.cds.nan.h5 \
--verb_space verb_space_corpus.ukwac-weighting.ppmi.neg.1.base.e-context.nvaa2000_dataset.gs2011.reduction.raw.cds.nan.h5 \
--dataset gs11://$PWD/GS2011data.txt \
--composition_operator copy-object \
--output gs11-copy-object.h5
Similarity |################################| 199/199, elapsed: 0:01:15
Spearman correlation (copy-object), cosine): rho=0.024, p=0.73779, support=199
References¶
[SimLex-999] | Felix Hill, Roi Reichart and Anna Korhonen. SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. Computational Linguistics. 2015 |
[MEN] | E. Bruni, N. K. Tran and M. Baroni. Multimodal Distributional Semantics. Journal of Artificial Intelligence Research 49: 1-47. |