tf.keras.preprocessing.sequence.make_sampling_table
Stay organized with collections
Save and categorize content based on your preferences.
Generates a word rank-based probabilistic sampling table.
View aliases
Compat aliases for migration
See
Migration guide for
more details.
`tf.compat.v1.keras.preprocessing.sequence.make_sampling_table`
tf.keras.preprocessing.sequence.make_sampling_table(
size, sampling_factor=1e-05
)
Used for generating the sampling_table
argument for skipgrams
.
sampling_table[i]
is the probability of sampling
the word i-th most common word in a dataset
(more common words should be sampled less frequently, for balance).
The sampling probabilities are generated according
to the sampling distribution used in word2vec:
p(word) = (min(1, sqrt(word_frequency / sampling_factor) /
(word_frequency / sampling_factor)))
We assume that the word frequencies follow Zipf's law (s=1) to derive
a numerical approximation of frequency(rank):
frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))
where gamma
is the Euler-Mascheroni constant.
Args |
size
|
Int, number of possible words to sample.
|
sampling_factor
|
The sampling factor in the word2vec formula.
|
Returns |
A 1D Numpy array of length size where the ith entry
is the probability that a word of rank i should be sampled.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.
Last updated 2023-10-06 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2023-10-06 UTC."],[],[]]