Python Data Science Jobs & Interviews

Q: How can reinforcement learning be used to simulate human-like decision-making in dynamic environments? Provide a detailed, advanced-level code example.

In reinforcement learning (RL), agents learn optimal behaviors through trial and error by interacting with an environment. To simulate human-like decision-making, we use deep reinforcement learning models like Proximal Policy Optimization (PPO), which balances exploration and exploitation while adapting to complex, real-time scenarios.

Human behavior involves not just reward maximization but also risk aversion, social cues, and emotional responses. We can model these using:
- State representation: Include contextual features (e.g., stress level, past rewards).
- Action space: Discrete or continuous actions mimicking human choices.
- Reward shaping: Incorporate intrinsic motivation (e.g., curiosity) and extrinsic rewards.
- Policy networks: Use neural networks to approximate policies that mimic human reasoning.

Here’s a Python example using stable-baselines3 for PPO in a custom environment simulating human decision-making under uncertainty:

import numpy as np
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

# Define custom environment
class HumanLikeDecisionEnv(gym.Env):
    def __init__(self):
        super().__init__()
        self.action_space = gym.spaces.Discrete(3)  # [0: cautious, 1: neutral, 2: bold]
        self.observation_space = gym.spaces.Box(low=-100, high=100, shape=(4,), dtype=np.float32)
        self.state = None
        self.reset()

    def reset(self, seed=None, options=None):
        self.state = np.array([np.random.uniform(-50, 50),  # current reward
                               np.random.uniform(0, 10),   # risk tolerance
                               np.random.uniform(0, 1),    # social influence
                               np.random.uniform(-1, 1)])  # emotion factor
        return self.state, {}

    def step(self, action):
        # Simulate human-like response based on action
        reward = 0
        if action == 0:  # Cautious
            reward += self.state[0] * 0.8 - np.abs(self.state[1]) * 0.5
        elif action == 1:  # Neutral
            reward += self.state[0] * 0.9
        else:  # Bold
            reward += self.state[0] * 1.2 + np.random.normal(0, 5)

        # Update state with noise and dynamics
        self.state[0] = np.clip(self.state[0] + np.random.normal(0, 2), -100, 100)
        self.state[1] = np.clip(self.state[1] + np.random.uniform(-0.5, 0.5), 0, 10)
        self.state[2] = np.clip(self.state[2] + np.random.uniform(-0.1, 0.1), 0, 1)
        self.state[3] = np.clip(self.state[3] + np.random.normal(0, 0.2), -1, 1)

        done = np.random.rand() > 0.95  # Random termination
        return self.state, reward, done, False, {}

# Create environment
env = DummyVecEnv([lambda: HumanLikeDecisionEnv])

# Train PPO agent
model = PPO("MlpPolicy", env, verbose=1, n_steps=128)
model.learn(total_timesteps=10000)

# Evaluate policy
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")

This simulation captures how humans balance risk, emotion, and social context in decisions. The model learns to adapt its strategy over time—mimicking cognitive flexibility.

#ReinforcementLearning #DeepLearning #HumanBehaviorSimulation #AI #MachineLearning #PPO #Python #AdvancedAI #RL #NeuralNetworks

By: @DataScienceQ 🚀

❤2

206 viewsedited 07:08

About

Blog

Apps

Platform