tf.distribute.experimental.PreemptionWatcher
Stay organized with collections
Save and categorize content based on your preferences.
Watch preemption signal and store it.
tf.distribute.experimental.PreemptionWatcher()
Notice: Currently only support Borg TPU environment with TPUClusterResolver.
This class provides a way to monitor the preemption signal during training on
TPU. It will start a background thread to watch the training process, trying
to fetch preemption message from the coordination service. When preemption
happens, the preempted worker will write the preemption message to the
coordination service. Thus getting a non-empty preemption message means there
is a preemption happened.
User can use the preemption message as a reliable preemption indicator, and
then set the coordinator to reconnect to the TPU worker instead of a fully
restart triggered by Borg. For example, a training process with
preemption recovery will be like:
keep_running = True
preemption_watcher = None
while keep_running:
try:
# Initialize TPU cluster and stratygy.
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
# PreemptionWatcher must be created after connected to cluster.
preemption_watcher = tf.distribute.experimental.PreemptionWatcher()
train_model(strategy)
keep_running = False
except Exception as e:
if preemption_watcher and preemption_watcher.preemption_message:
keep_running = True
else:
raise e
Attributes |
preemption_message
|
A variable to store the preemption message fetched from
the coordination service. If it is not None, then there is a preemption
happened.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.
Last updated 2023-10-06 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2023-10-06 UTC."],[],[],null,["# tf.distribute.experimental.PreemptionWatcher\n\n\u003cbr /\u003e\n\n|-------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/tensorflow/blob/v2.12.1/tensorflow/python/distribute/failure_handling/preemption_watcher.py#L40-L107) |\n\nWatch preemption signal and store it. \n\n tf.distribute.experimental.PreemptionWatcher()\n\nNotice: Currently only support Borg TPU environment with TPUClusterResolver.\n\nThis class provides a way to monitor the preemption signal during training on\nTPU. It will start a background thread to watch the training process, trying\nto fetch preemption message from the coordination service. When preemption\nhappens, the preempted worker will write the preemption message to the\ncoordination service. Thus getting a non-empty preemption message means there\nis a preemption happened.\n\nUser can use the preemption message as a reliable preemption indicator, and\nthen set the coordinator to reconnect to the TPU worker instead of a fully\nrestart triggered by Borg. For example, a training process with\npreemption recovery will be like: \n\n keep_running = True\n preemption_watcher = None\n while keep_running:\n try:\n # Initialize TPU cluster and stratygy.\n resolver = tf.distribute.cluster_resolver.TPUClusterResolver()\n tf.config.experimental_connect_to_cluster(resolver)\n tf.tpu.experimental.initialize_tpu_system(resolver)\n strategy = tf.distribute.TPUStrategy(resolver)\n\n # PreemptionWatcher must be created after connected to cluster.\n preemption_watcher = tf.distribute.experimental.PreemptionWatcher()\n train_model(strategy)\n keep_running = False\n except Exception as e:\n if preemption_watcher and preemption_watcher.preemption_message:\n keep_running = True\n else:\n raise e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Attributes ---------- ||\n|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------|\n| `preemption_message` | A variable to store the preemption message fetched from the coordination service. If it is not None, then there is a preemption happened. |\n\n\u003cbr /\u003e"]]