Implementation of NCE-Loss in TensorFlow and Word2Vec

Follow the WeChat official account “ML_NLP
Set as “Starred”, important content delivered to you first!

Implementation of NCE-Loss in TensorFlow and Word2Vec

I’ve been looking at the source code of word2vec these days and found that its loss function is not multi-class cross-entropy but NCE. So I checked some information and found this blog post, sharing it here.

First, let’s take a look at the API for TensorFlow’s NCE loss:

def nce_loss(weights, biases, inputs, labels, num_sampled, num_classes,
             num_true=1,
             sampled_values=None,
             remove_accidental_hits=False,
             partition_strategy="mod",
             name="nce_loss")

Assuming the input data before nce_loss is K-dimensional and there are a total of N classes, we have:

  • weight.shape = (N, K)
  • bias.shape = (N)
  • inputs.shape = (batch_size, K)
  • labels.shape = (batch_size, num_true)
  • num_true: Actual number of positive samples
  • num_sampled: Number of negative samples to sample
  • num_classes = N
  • sampled_values: Sampled negative samples; if None, different samplers will be used. I’ll explain what a sampler is later.
  • remove_accidental_hits: Whether to remove negative samples that accidentally match positive samples during sampling
  • partition_strategy: Strategy for parallel lookup of weights during embedding_lookup. TensorFlow’s embedding_lookup is implemented in CPU, so we need to consider locking issues during multi-threaded lookups.

The implementation logic of nce_loss is as follows:

  • _compute_sampled_logits: This function computes the output and label corresponding to the positive samples and the sampled negative samples.
  • sigmoid_cross_entropy_with_logits: This function calculates the loss between the output and labels using sigmoid cross entropy for backpropagation. It transforms the final problem into num_sampled + num_real binary classification problems, each using the cross-entropy loss function, which is commonly used in logistic regression. TensorFlow also provides a softmax_cross_entropy_with_logits function, which is different from this.

Next, let’s look at the implementation of word2vec in TensorFlow, which uses nce_loss as follows:

  loss = tf.reduce_mean(
      tf.nn.nce_loss(nce_weights, nce_biases, embed, train_labels,
                     num_sampled, vocabulary_size))

As you can see, it does not pass sampled_values here. So how are its negative samples obtained? Continuing to look at the implementation of nce_loss, we can see the code handling sampled_values=None as follows:

    if sampled_values is None:
      sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
          true_classes=labels,
          num_true=num_true,
          num_sampled=num_sampled,
          unique=True,
          range_max=num_classes)

Therefore, by default, it will use log_uniform_candidate_sampler for sampling. So how does log_uniform_candidate_sampler perform sampling? Its implementation is as follows:

  • It samples an integer k from [0, range_max)
  • P(k) = (log(k + 2) – log(k + 1)) / log(range_max + 1)

As you can see, the larger k is, the lower the probability of being sampled. So does the category number have any significance in TensorFlow’s word2vec? Check the code below:

def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count += 1
    data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  return data, count, dictionary, reverse_dictionary

As you can see, in TensorFlow’s word2vec implementation, the higher the word frequency, the larger the category number of the word. Therefore, in TensorFlow’s word2vec, the negative sampling process actually prioritizes high-frequency words as negative samples.

In the original paper proposing negative sampling, including the original C++ implementation of word2vec, sampling was done based on the 0.75 power of popularity, which differs from TensorFlow’s implementation. But the general idea is similar: the more popular, the more likely it is to become a negative sample.

Author: xlvector
Link: https://www.jianshu.com/p/fab82fa53e16
Download 1: Four Essentials
Reply “Four Essentials” in the backend of the machine learning algorithm and natural language processing WeChat official account,
To get the four essentials for learning TensorFlow, Pytorch, machine learning, and deep learning!

Download 2: Repository Address Sharing
Reply “Code” in the backend of the machine learning algorithm and natural language processing WeChat official account,
To get 195 NAACL + 295 ACL 2019 papers with open-source code. The open-source address is as follows: https://github.com/yizhen20133868/NLP-Conferences-Code

Exciting news! The machine learning algorithm and natural language processing exchange group has been officially established!
There are a lot of resources in the group, and everyone is welcome to join the group to learn!

Extra free resources! Deep learning and neural networks, official Chinese tutorial for Pytorch, data analysis with Python, machine learning notes, official Chinese documentation for pandas, effective java (Chinese version), and 20 other free resources

How to get: After entering the group, click on the group announcement to get the download link
Note: Please modify the remarks to [School/Company + Name + Direction] when adding
For example — HIT + Zhang San + Dialogue System.
Please avoid adding if you are a micro-business. Thank you!

Recommended Reading:
12 Golden Rules for Solving NER Problems in Industry
Master Machine Learning Core in Three Steps: Matrix Derivatives
Distillation Techniques in Neural Networks, Starting with Softmax

Leave a Comment