reward_shaper.RewardShaper

class reward_shaper.RewardShaper(goal_symbolic_set, shift_bonus, depth_penalty)

Bases: object

Computes rewards for intervention sequences.

This class evaluates the quality of intervention sequences by computing: 1. Step rewards: Immediate rewards after each intervention 2. Final rewards: Overall score based on goal achievement minus penalties

Attributes:

goal: Set of goal symbolic relationships to achieve. shift_bonus: Reward bonus for each shift intervention. depth_penalty: Penalty factor per intervention (encourages shorter sequences).

__init__(goal_symbolic_set, shift_bonus, depth_penalty)

Initialize the reward shaper.

Args:

goal_symbolic_set: Set of goal relationships (e.g., {‘On(2,1)’, ‘On(3,2)’}). shift_bonus: Reward bonus for shift interventions (default: 0.25). depth_penalty: Penalty per intervention to encourage brevity (default: 0.05).

Methods

__init__(goal_symbolic_set, shift_bonus, ...)

Initialize the reward shaper.

final_reward(final_rels, interventions)

Compute final reward for the complete intervention sequence.

step_reward(prev_rels, curr_rels, action)

Compute immediate reward after applying one intervention.

final_reward(final_rels, interventions)

Compute final reward for the complete intervention sequence.

This is called at the end of a rollout to evaluate the overall quality of the intervention sequence. Combines goal achievement with intervention cost.

Args:

final_rels: Final set of symbolic relationships after all interventions. interventions: Complete list of (object, action) tuples applied.

Returns:

Final reward score (higher is better).

step_reward(prev_rels, curr_rels, action)

Compute immediate reward after applying one intervention.

This is called after each intervention to provide immediate feedback. Can reward based on action type or changes in relationships.

Args:

prev_rels: Symbolic relationships before the intervention. curr_rels: Symbolic relationships after the intervention. action: The action string (e.g., ‘left,0.005’, ‘pick-top’).

Returns:

Immediate reward for this step.