tf.distribute.experimental.PreemptionWatcher
Stay organized with collections
Save and categorize content based on your preferences.
Watch preemption signal and store it.
tf.distribute.experimental.PreemptionWatcher()
Notice: Currently only support Borg TPU environment with TPUClusterResolver.
This class provides a way to monitor the preemption signal during training on
TPU. It will start a background thread to watch the training process, trying
to fetch preemption message from the coordination service. When preemption
happens, the preempted worker will write the preemption message to the
coordination service. Thus getting a non-empty preemption message means there
is a preemption happened.
User can use the preemption message as a reliable preemption indicator, and
then set the coordinator to reconnect to the TPU worker instead of a fully
restart triggered by Borg. For example, a training process with
preemption recovery will be like:
keep_running = True
preemption_watcher = None
while keep_running:
try:
# Initialize TPU cluster and stratygy.
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
# PreemptionWatcher must be created after connected to cluster.
preemption_watcher = tf.distribute.experimental.PreemptionWatcher()
train_model(strategy)
keep_running = False
except Exception as e:
if preemption_watcher and preemption_watcher.preemption_message:
preemption_watcher.block_until_worker_exit()
keep_running = True
else:
raise e
Attributes |
preemption_message
|
A variable to store the preemption message fetched from
the coordination service. If it is not None, then there is a preemption
happened.
|
platform
|
A PlatformDevice to indicate the current job's platform. Refer to
failure_handling_util.py for the definition of enum class PlatformDevice.
|
Methods
block_until_worker_exit
View source
block_until_worker_exit()
Block coordinator until workers exit.
In some rare cases, another error could be raised during the
preemption grace period. This will cause the coordinator to reconnect to the
same TPU workers, which will be killed later. It prevents the coordinator to
reconnect to new TPU workers, and falls back to a hard restart. To avoid
this situation, this method will block the coordinator to reconnect until
workers exit. This method will be a no-op for non-TPU platform.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.
Last updated 2024-04-26 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-04-26 UTC."],[],[],null,["# tf.distribute.experimental.PreemptionWatcher\n\n\u003cbr /\u003e\n\n|-------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/tensorflow/blob/v2.16.1/tensorflow/python/distribute/failure_handling/preemption_watcher.py#L44-L137) |\n\nWatch preemption signal and store it. \n\n tf.distribute.experimental.PreemptionWatcher()\n\nNotice: Currently only support Borg TPU environment with TPUClusterResolver.\n\nThis class provides a way to monitor the preemption signal during training on\nTPU. It will start a background thread to watch the training process, trying\nto fetch preemption message from the coordination service. When preemption\nhappens, the preempted worker will write the preemption message to the\ncoordination service. Thus getting a non-empty preemption message means there\nis a preemption happened.\n\nUser can use the preemption message as a reliable preemption indicator, and\nthen set the coordinator to reconnect to the TPU worker instead of a fully\nrestart triggered by Borg. For example, a training process with\npreemption recovery will be like: \n\n keep_running = True\n preemption_watcher = None\n while keep_running:\n try:\n # Initialize TPU cluster and stratygy.\n resolver = tf.distribute.cluster_resolver.TPUClusterResolver()\n tf.config.experimental_connect_to_cluster(resolver)\n tf.tpu.experimental.initialize_tpu_system(resolver)\n strategy = tf.distribute.TPUStrategy(resolver)\n\n # PreemptionWatcher must be created after connected to cluster.\n preemption_watcher = tf.distribute.experimental.PreemptionWatcher()\n train_model(strategy)\n keep_running = False\n except Exception as e:\n if preemption_watcher and preemption_watcher.preemption_message:\n preemption_watcher.block_until_worker_exit()\n keep_running = True\n else:\n raise e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Attributes ---------- ||\n|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------|\n| `preemption_message` | A variable to store the preemption message fetched from the coordination service. If it is not None, then there is a preemption happened. |\n| `platform` | A PlatformDevice to indicate the current job's platform. Refer to failure_handling_util.py for the definition of enum class PlatformDevice. |\n\n\u003cbr /\u003e\n\nMethods\n-------\n\n### `block_until_worker_exit`\n\n[View source](https://github.com/tensorflow/tensorflow/blob/v2.16.1/tensorflow/python/distribute/failure_handling/preemption_watcher.py#L117-L137) \n\n block_until_worker_exit()\n\nBlock coordinator until workers exit.\n\nIn some rare cases, another error could be raised during the\npreemption grace period. This will cause the coordinator to reconnect to the\nsame TPU workers, which will be killed later. It prevents the coordinator to\nreconnect to new TPU workers, and falls back to a hard restart. To avoid\nthis situation, this method will block the coordinator to reconnect until\nworkers exit. This method will be a no-op for non-TPU platform."]]