tf.keras.optimizers.Adadelta
Stay organized with collections
Save and categorize content based on your preferences.
Optimizer that implements the Adadelta algorithm.
Inherits From: Optimizer
tf.keras.optimizers.Adadelta(
learning_rate=0.001, rho=0.95, epsilon=1e-07, name='Adadelta',
**kwargs
)
Adadelta optimization is a stochastic gradient descent method that is based on
adaptive learning rate per dimension to address two drawbacks:
- The continual decay of learning rates throughout training
- The need for a manually selected global learning rate
Adadelta is a more robust extension of Adagrad that adapts learning rates
based on a moving window of gradient updates, instead of accumulating all
past gradients. This way, Adadelta continues learning even when many updates
have been done. Compared to Adagrad, in the original version of Adadelta you
don't have to set an initial learning rate. In this version, initial
learning rate can be set, as in most other Keras optimizers.
According to section 4.3 ("Effective Learning rates"), near the end of
training step sizes converge to 1 which is effectively a high learning
rate which would cause divergence. This occurs only near the end of the
training as gradients and step sizes are small, and the epsilon constant
in the numerator and denominator dominate past gradients and parameter
updates which converge the learning rate to 1.
According to section 4.4("Speech Data"),where a large neural network with
4 hidden layers was trained on a corpus of US English data, ADADELTA was
used with 100 network replicas.The epsilon used is 1e-6 with rho=0.95
which converged faster than ADAGRAD, by the following construction:
def init(self, lr=1.0, rho=0.95, epsilon=1e-6, decay=0., **kwargs):
Args |
learning_rate
|
A Tensor , floating point value, or a schedule that is a
tf.keras.optimizers.schedules.LearningRateSchedule . The learning rate.
To match the exact form in the original paper use 1.0.
|
rho
|
A Tensor or a floating point value. The decay rate.
|
epsilon
|
A Tensor or a floating point value. A constant epsilon used
to better conditioning the grad update.
|
name
|
Optional name prefix for the operations created when applying
gradients. Defaults to "Adadelta" .
|
**kwargs
|
Keyword arguments. Allowed to be one of
"clipnorm" or "clipvalue" .
"clipnorm" (float) clips gradients by norm; "clipvalue" (float) clips
gradients by value.
|
Reference:
Raises |
ValueError
|
in case of any invalid argument.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.
Last updated 2021-02-18 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2021-02-18 UTC."],[],[],null,["# tf.keras.optimizers.Adadelta\n\n\u003cbr /\u003e\n\n|--------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|\n| [TensorFlow 1 version](/versions/r1.15/api_docs/python/tf/keras/optimizers/Adadelta) | [View source on GitHub](https://github.com/tensorflow/tensorflow/blob/v2.4.0/tensorflow/python/keras/optimizer_v2/adadelta.py#L32-L160) |\n\nOptimizer that implements the Adadelta algorithm.\n\nInherits From: [`Optimizer`](../../../tf/keras/optimizers/Optimizer)\n\n#### View aliases\n\n\n**Main aliases**\n\n[`tf.optimizers.Adadelta`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adadelta)\n**Compat aliases for migration**\n\nSee\n[Migration guide](https://www.tensorflow.org/guide/migrate) for\nmore details.\n\n[`tf.compat.v1.keras.optimizers.Adadelta`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adadelta)\n\n\u003cbr /\u003e\n\n tf.keras.optimizers.Adadelta(\n learning_rate=0.001, rho=0.95, epsilon=1e-07, name='Adadelta',\n **kwargs\n )\n\nAdadelta optimization is a stochastic gradient descent method that is based on\nadaptive learning rate per dimension to address two drawbacks:\n\n- The continual decay of learning rates throughout training\n- The need for a manually selected global learning rate\n\nAdadelta is a more robust extension of Adagrad that adapts learning rates\nbased on a moving window of gradient updates, instead of accumulating all\npast gradients. This way, Adadelta continues learning even when many updates\nhave been done. Compared to Adagrad, in the original version of Adadelta you\ndon't have to set an initial learning rate. In this version, initial\nlearning rate can be set, as in most other Keras optimizers.\n\nAccording to section 4.3 (\"Effective Learning rates\"), near the end of\ntraining step sizes converge to 1 which is effectively a high learning\nrate which would cause divergence. This occurs only near the end of the\ntraining as gradients and step sizes are small, and the epsilon constant\nin the numerator and denominator dominate past gradients and parameter\nupdates which converge the learning rate to 1.\n\nAccording to section 4.4(\"Speech Data\"),where a large neural network with\n4 hidden layers was trained on a corpus of US English data, ADADELTA was\nused with 100 network replicas.The epsilon used is 1e-6 with rho=0.95\nwhich converged faster than ADAGRAD, by the following construction:\ndef **init**(self, lr=1.0, rho=0.95, epsilon=1e-6, decay=0., \\*\\*kwargs):\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|-----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `learning_rate` | A `Tensor`, floating point value, or a schedule that is a [`tf.keras.optimizers.schedules.LearningRateSchedule`](../../../tf/keras/optimizers/schedules/LearningRateSchedule). The learning rate. To match the exact form in the original paper use 1.0. |\n| `rho` | A `Tensor` or a floating point value. The decay rate. |\n| `epsilon` | A `Tensor` or a floating point value. A constant epsilon used to better conditioning the grad update. |\n| `name` | Optional name prefix for the operations created when applying gradients. Defaults to `\"Adadelta\"`. |\n| `**kwargs` | Keyword arguments. Allowed to be one of `\"clipnorm\"` or `\"clipvalue\"`. `\"clipnorm\"` (float) clips gradients by norm; `\"clipvalue\"` (float) clips gradients by value. |\n\n\u003cbr /\u003e\n\n#### Reference:\n\n- [Zeiler, 2012](http://arxiv.org/abs/1212.5701)\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Raises ------ ||\n|--------------|----------------------------------|\n| `ValueError` | in case of any invalid argument. |\n\n\u003cbr /\u003e"]]