Linear Thompson Sampling Policy.
Inherits From: LinearBanditPolicy, TFPolicy
tf_agents.bandits.policies.linear_thompson_sampling_policy.LinearThompsonSamplingPolicy(
    action_spec: tf_agents.typing.types.BoundedTensorSpec,
    cov_matrix: Sequence[tf_agents.typing.types.Float],
    data_vector: Sequence[tf_agents.typing.types.Float],
    num_samples: Sequence[tf_agents.typing.types.Int],
    time_step_spec: Optional[tf_agents.typing.types.TimeStep] = None,
    alpha: float = 1.0,
    eig_vals: Sequence[tf_agents.typing.types.Float] = (),
    eig_matrix: Sequence[tf_agents.typing.types.Float] = (),
    tikhonov_weight: float = 1.0,
    add_bias: bool = False,
    emit_policy_info: Sequence[Text] = (),
    observation_and_action_constraint_splitter: Optional[types.Splitter] = None,
    name: Optional[Text] = None
)
Implements the Linear Thompson Sampling Policy from the following paper:
"Thompson Sampling for Contextual Bandits with Linear Payoffs",
Shipra Agrawal, Navin Goyal, ICML 2013. The actual algorithm implemented is
Algorithm 3 from the supplementary material of the paper from
<a href="http://proceedings.mlr.press/v28/agrawal13-supp.pdf">http://proceedings.mlr.press/v28/agrawal13-supp.pdf</a>.
In a nutshell, the algorithm estimates reward distributions based on
parameters B_inv and f for every action. Then for each
action we sample a reward and take the argmax.
| Args | 
|---|
| action_spec | TensorSpeccontaining action specification. | 
| cov_matrix | list of the covariance matrices A in the paper. There exists
one A matrix per arm. | 
| data_vector | list of the b vectors in the paper. The b vector is a
weighted sum of the observations, where the weight is the corresponding
reward. Each arm has its own vector b. | 
| num_samples | list of number of samples per arm. | 
| time_step_spec | A TimeStepspec of the expected time_steps. | 
| alpha | a float value used to scale the confidence intervals. | 
| eig_vals | list of eigenvalues for each covariance matrix (one per arm). | 
| eig_matrix | list of eigenvectors for each covariance matrix (one per arm). | 
| tikhonov_weight | (float) tikhonov regularization term. | 
| add_bias | If true, a bias term will be added to the linear reward
estimation. | 
| emit_policy_info | (tuple of strings) what side information we want to get
as part of the policy info. Allowed values can be found in policy_utilities.PolicyInfo. | 
| observation_and_action_constraint_splitter | A function used for masking
valid/invalid actions with each state of the environment. The function
takes in a full observation and returns a tuple consisting of 1) the
part of the observation intended as input to the bandit policy and 2)
the mask. The mask should be a 0-1 Tensorof shape[batch_size,
num_actions]. This function should also work with aTensorSpecas
input, and should outputTensorSpecobjects for the observation and
mask. | 
| name | The name of this policy. | 
| Attributes | 
|---|
| action_spec | Describes the TensorSpecs of the Tensors expected by step(action).actioncan be a single Tensor, or a nested dict, list or tuple of
Tensors.
 | 
| collect_data_spec | Describes the Tensors written when using this policy with an environment. | 
| emit_log_probability | Whether this policy instance emits log probabilities or not. | 
| info_spec | Describes the Tensors emitted as info by actionanddistribution.infocan be an empty tuple, a single Tensor, or a nested dict,
list or tuple of Tensors.
 | 
| observation_and_action_constraint_splitter |  | 
| policy_state_spec | Describes the Tensors expected by step(_, policy_state).policy_statecan be an empty tuple, a single Tensor, or a nested dict,
list or tuple of Tensors.
 | 
| policy_step_spec | Describes the output of action(). | 
| time_step_spec | Describes the TimeSteptensors returned bystep(). | 
| trajectory_spec | Describes the Tensors written when using this policy with an environment. | 
| validate_args | Whether action&distributionvalidate input and output args. | 
Methods
action
View source
action(
    time_step: tf_agents.trajectories.TimeStep,
    policy_state: tf_agents.typing.types.NestedTensor = (),
    seed: Optional[types.Seed] = None
) -> tf_agents.trajectories.PolicyStep
Generates next action given the time_step and policy_state.
| Args | 
|---|
| time_step | A TimeSteptuple corresponding totime_step_spec(). | 
| policy_state | A Tensor, or a nested dict, list or tuple of Tensors
representing the previous policy_state. | 
| seed | Seed to use if action performs sampling (optional). | 
| Returns | 
|---|
| A PolicyStepnamed tuple containing:action: An action Tensor matching theaction_spec.state: A policy state tensor to be fed into the next call to action.info: Optional side information such as action log probabilities. | 
| Raises | 
|---|
| RuntimeError | If subclass init didn't call super().init.
ValueError or TypeError: If validate_args is Trueand inputs or
  outputs do not matchtime_step_spec,policy_state_spec,
  orpolicy_step_spec. | 
distribution
View source
distribution(
    time_step: tf_agents.trajectories.TimeStep,
    policy_state: tf_agents.typing.types.NestedTensor = ()
) -> tf_agents.trajectories.PolicyStep
Generates the distribution over next actions given the time_step.
| Args | 
|---|
| time_step | A TimeSteptuple corresponding totime_step_spec(). | 
| policy_state | A Tensor, or a nested dict, list or tuple of Tensors
representing the previous policy_state. | 
| Returns | 
|---|
| A PolicyStepnamed tuple containing:action: A tf.distribution capturing the distribution of next actions.state: A policy state tensor for the next call to distribution.info: Optional side information such as action log probabilities.
 | 
| Raises | 
|---|
| ValueError or TypeError: If validate_args is Trueand inputs or
outputs do not matchtime_step_spec,policy_state_spec,
orpolicy_step_spec. | 
get_initial_state
View source
get_initial_state(
    batch_size: Optional[types.Int]
) -> tf_agents.typing.types.NestedTensor
Returns an initial state usable by the policy.
| Args | 
|---|
| batch_size | Tensor or constant: size of the batch dimension. Can be None
in which case no dimensions gets added. | 
| Returns | 
|---|
| A nested object of type policy_statecontaining properly
initialized Tensors. | 
update
View source
update(
    policy,
    tau: float = 1.0,
    tau_non_trainable: Optional[float] = None,
    sort_variables_by_name: bool = False
) -> tf.Operation
Update the current policy with another policy.
This would include copying the variables from the other policy.
| Args | 
|---|
| policy | Another policy it can update from. | 
| tau | A float scalar in [0, 1]. When tau is 1.0 (the default), we do a hard
update. This is used for trainable variables. | 
| tau_non_trainable | A float scalar in [0, 1] for non_trainable variables.
If None, will copy from tau. | 
| sort_variables_by_name | A bool, when True would sort the variables by name
before doing the update. | 
| Returns | 
|---|
| An TF op to do the update. |