reward_shaper.RewardShaper

class reward_shaper.RewardShaper(goal_symbolic_set, shift_bonus, depth_penalty)

Bases: object

Computes rewards for intervention sequences.

This class evaluates the quality of intervention sequences by computing: 1. Step rewards: Immediate rewards after each intervention 2. Final rewards: Overall score based on goal achievement minus penalties

Attributes:: goal: Set of goal symbolic relationships to achieve. shift_bonus: Reward bonus for each shift intervention. depth_penalty: Penalty factor per intervention (encourages shorter sequences).

__init__(goal_symbolic_set, shift_bonus, depth_penalty)

Initialize the reward shaper.

Args:: goal_symbolic_set: Set of goal relationships (e.g., {‘On(2,1)’, ‘On(3,2)’}). shift_bonus: Reward bonus for shift interventions (default: 0.25). depth_penalty: Penalty per intervention to encourage brevity (default: 0.05).

Methods

`__init__`(goal_symbolic_set, shift_bonus, ...)	Initialize the reward shaper.
`final_reward`(final_rels, interventions)	Compute final reward for the complete intervention sequence.
`step_reward`(prev_rels, curr_rels, action)	Compute immediate reward after applying one intervention.

final_reward(final_rels, interventions)

Compute final reward for the complete intervention sequence.

This is called at the end of a rollout to evaluate the overall quality of the intervention sequence. Combines goal achievement with intervention cost.

Args:: final_rels: Final set of symbolic relationships after all interventions. interventions: Complete list of (object, action) tuples applied.
Returns:: Final reward score (higher is better).

step_reward(prev_rels, curr_rels, action)

Compute immediate reward after applying one intervention.

This is called after each intervention to provide immediate feedback. Can reward based on action type or changes in relationships.

Args:: prev_rels: Symbolic relationships before the intervention. curr_rels: Symbolic relationships after the intervention. action: The action string (e.g., ‘left,0.005’, ‘pick-top’).
Returns:: Immediate reward for this step.