The bulk of synonym generation strategies that I have seen in user-facing websites/apps address it as a two-step problem:

1. Prospect Generation

At this step, offered a word, you generate all possible candidates that could be synonyms for words.

But what is a synonym?

  1. a word having the same or nearly the same meaning as another in the language, as happy, joyful, elated. A dictionary of synonyms and antonyms (or opposites) is called a thesaurus.
  2. a word or expression accepted as another name for something, as Arcadia for pastoral simplicity or Wall Street for U.S. financial markets; metonym.
  3. Biology. one of two or more scientific names applied to a single taxon

For a search engine, your interpretation of synonyms could be much more comprehensive and also consist of any other words that will help you get far better search results for your inquiry. E.g., phrase expansions (CS expands to “computer science”), synonyms for proper nouns (“the big apple” broadens to “new york city”), or even entire phrase replacements are great “synonyms” for a search engine.

Depending upon your precise application, below are several of the sources you could utilize for synonym generation:

Word Embeddings

You can train word vectors for your corpus and after that discover basic synonyms for the current word using nearest neighbors or by specifying some notion of “similarity.”.
See much more on this here: What are the type of relevant words that Word2Vec outcomes?

Historical user information

You can additionally look at past user behavior as well as produce primary synonym candidates from that. The simple variation of this ignores the context (in this case, the previous word being “get”), but to obtain perfect synonyms, you do want to utilize the next and previous n-words as context. For word processors, you can in a similar way look at words and also expression substitutions that users do.

Lexical basic synonyms

These are grammatic basic synonyms as defined by the guidelines of the language. Wordnet is a preferred resource for these amongst others.

You must use them as well if various other synonym sources outside of these are better suited for your application.

You could utilize either of these methods in isolation to resolve your trouble. As an example, a synonym generation formula, using word2vec vectors alone could be enough for you. However, by utilizing simply one resource, you will certainly miss out on out on the strengths that the various other sources offer.


2. Synonym Detection.

Currently that you have a collection of primary synonym candidates for a provided the word, you require to locate out which ones of those are necessary synonyms. This could be resolved as a classic supervised learning problem.

Given your prospect collection, you can produce ground-truth training data either making use of human courts or previous user involvement. As I discussed above, the meaning of synonyms is different for various applications, so if you are using human courts, you will think of clear standards on what makes an excellent synonym for them to utilize.

You could produce various lexical as well as statistical features for your data and educate a supervised ML model of your option on it once you have a labeled training collection. In the method, I have seen that any functions related to past user behavior, such as word replacement frequency, execute the best for basic synonym discovery in particular domains such as internet search engine.

Antonym generation.

You could locate antonyms making use of a similar strategy as synonyms. You could need to make use of various resources (e.g. you could not have historical customer information, or you could require a different notion of “relation” between word vectors for antonyms), yet the fundamental framework of candidate generation complied with by category continues to be the very same.

Here is a description of how to use this feature to find synonyms for a word.

Dependency-Based Word Embeddings.
Omer Levy and Yoav Goldberg. Short paper in ACL 2014.


While continuous word embeddings are gaining popularity, current models are based solely on linear contexts. In this work, we generalize the skip-gram model with negative sampling introduced by Mikolov et al. to include arbitrary contexts. In particular, we perform experiments with dependency-based contexts and show that they produce markedly different embeddings. The dependency-based embeddings are less topical and exhibit more functional similarity than the original skip-gram embeddings.


Source code

The implementation described in the paper can be found in bitbucket.

The original word2vec can be found in google code.

Get more stuff like this

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.