Python Data Science Jobs & Interviews

Question 3 (Advanced):
In reinforcement learning, what does the term “policy” refer to?

A) The sequence of rewards the agent receives
B) The model’s loss function
C) The strategy used by the agent to decide actions
D) The environment's set of rules

#ReinforcementLearning #AI #DeepRL #PolicyLearning #ML

❤1

1.22K views06:42

Q: How can reinforcement learning be used to simulate human-like decision-making in dynamic environments? Provide a detailed, advanced-level code example.

In reinforcement learning (RL), agents learn optimal behaviors through trial and error by interacting with an environment. To simulate human-like decision-making, we use deep reinforcement learning models like Proximal Policy Optimization (PPO), which balances exploration and exploitation while adapting to complex, real-time scenarios.

Human behavior involves not just reward maximization but also risk aversion, social cues, and emotional responses. We can model these using:
- State representation: Include contextual features (e.g., stress level, past rewards).
- Action space: Discrete or continuous actions mimicking human choices.
- Reward shaping: Incorporate intrinsic motivation (e.g., curiosity) and extrinsic rewards.
- Policy networks: Use neural networks to approximate policies that mimic human reasoning.

Here’s a Python example using stable-baselines3 for PPO in a custom environment simulating human decision-making under uncertainty:

import numpy as np
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

# Define custom environment
class HumanLikeDecisionEnv(gym.Env):
    def __init__(self):
        super().__init__()
        self.action_space = gym.spaces.Discrete(3)  # [0: cautious, 1: neutral, 2: bold]
        self.observation_space = gym.spaces.Box(low=-100, high=100, shape=(4,), dtype=np.float32)
        self.state = None
        self.reset()

    def reset(self, seed=None, options=None):
        self.state = np.array([np.random.uniform(-50, 50),  # current reward
                               np.random.uniform(0, 10),   # risk tolerance
                               np.random.uniform(0, 1),    # social influence
                               np.random.uniform(-1, 1)])  # emotion factor
        return self.state, {}

    def step(self, action):
        # Simulate human-like response based on action
        reward = 0
        if action == 0:  # Cautious
            reward += self.state[0] * 0.8 - np.abs(self.state[1]) * 0.5
        elif action == 1:  # Neutral
            reward += self.state[0] * 0.9
        else:  # Bold
            reward += self.state[0] * 1.2 + np.random.normal(0, 5)

        # Update state with noise and dynamics
        self.state[0] = np.clip(self.state[0] + np.random.normal(0, 2), -100, 100)
        self.state[1] = np.clip(self.state[1] + np.random.uniform(-0.5, 0.5), 0, 10)
        self.state[2] = np.clip(self.state[2] + np.random.uniform(-0.1, 0.1), 0, 1)
        self.state[3] = np.clip(self.state[3] + np.random.normal(0, 0.2), -1, 1)

        done = np.random.rand() > 0.95  # Random termination
        return self.state, reward, done, False, {}

# Create environment
env = DummyVecEnv([lambda: HumanLikeDecisionEnv])

# Train PPO agent
model = PPO("MlpPolicy", env, verbose=1, n_steps=128)
model.learn(total_timesteps=10000)

# Evaluate policy
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")

This simulation captures how humans balance risk, emotion, and social context in decisions. The model learns to adapt its strategy over time—mimicking cognitive flexibility.

#ReinforcementLearning #DeepLearning #HumanBehaviorSimulation #AI #MachineLearning #PPO #Python #AdvancedAI #RL #NeuralNetworks

By: @DataScienceQ 🚀

❤2

215 viewsedited 07:08

About

Blog

Apps

Platform