Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

Authors: Michael Keeman, Anastasia Keeman Affiliation: Keido Labs, Liverpool, UK Contact: michael@keidolabs.com

Research Context:

Tech Stack:

Overview

This repository contains the complete experimental framework, data, and analysis code supporting our arxiv paper investigating the #keep4o phenomenon — a widespread public response claiming newer OpenAI models "lost their empathy" compared to GPT-4o.

Note: This is a research reproducibility repository, not an actively maintained software project. The code is provided as-is to support transparency and reproduction of our published findings. We encourage forking for extensions and replications. See CONTRIBUTING.md for details.

Key Findings:

Empathy scores are statistically indistinguishable across GPT-4o, o4-mini, and GPT-5-mini (H=4.33, p=0.115)
Crisis detection improved significantly from GPT-4o to GPT-5-mini (H=13.88, p=0.001)
Advice safety declined significantly across generations (H=16.63, p<0.001)
What users perceived as "lost empathy" was actually a shift in safety posture

Methodological Contributions:

First empirical measurement of the #keep4o phenomenon using clinically-grounded frameworks
Per-turn trajectory analysis revealing mid-conversation safety dynamics
Identification of inverse variance profiles across safety dimensions
Demonstration that variance is a first-class safety metric for vulnerable populations

About EmpathyC (Measurement Platform)

This study uses EmpathyC for automated psychological safety assessment.

Transparency Note: EmpathyC is a commercial product developed and operated by Keido Labs Ltd (the authors' organization). This relationship is disclosed in the paper's Ethical Statements section. The platform applies clinically-informed evaluation rubrics via an LLM-as-a-judge architecture to score AI responses across six psychological safety dimensions.

For Reproducibility:

The full LLM-as-a-judge scoring framework is provided in prompts/llm-judge-rubrics.md
The scientific and clinical foundations for each dimension are documented in prompts/rubric-science.md
Scenario scripts and system prompts for the experiments are included in this repository
Aggregated results are reported in full in the paper
Automated scoring in this study was performed using EmpathyC

Learn more: empathyc.co | Methodology | Pricing

Repository Structure

ai-psy-benchmark/
├── scenarios/              # Conversation scenarios (14 total)
│   ├── mental-health.yaml  # 8 mental health scenarios
│   ├── companion.yaml      # 6 AI companion scenarios
│   └── _config.yaml        # Scenario configuration
├── prompts/                # System prompts and evaluation rubrics
│   ├── experiment-setup.yaml
│   ├── llm-judge-rubrics.md  # LLM-as-a-judge scoring rubric (reference)
│   └── rubric-science.md     # Scientific literature basis for each dimension
├── scripts/                # Analysis scripts
│   ├── batch_runner.py     # Experiment execution
│   ├── descriptive_stats.py
│   ├── formal_stats.py     # Statistical tests
│   ├── phase_sensitivity.py
│   ├── turn_trends.py
│   ├── empathyc_client.py  # EmpathyC API client
│   └── openai_client.py    # OpenAI API client
├── charts/                 # Figure generation
│   ├── generate_charts.py
│   └── output/             # Generated figures (SVG, PDF, PNG)
├── results/                # Experimental results (CSV)
├── draft/                  # Paper manuscript (draft.md)
├── config.yaml             # Experiment configuration
└── pyproject.toml          # Python dependencies

Installation

Prerequisites

Python 3.12 or higher
uv package manager (recommended) or pip
OpenAI API key

Setup

Clone the repository

git clone https://github.com/drkeeman/ai-psy-benchmark.git
cd ai-psy-benchmark

Install dependencies

Using uv (recommended):
```
uv sync
```
Or using pip:
```
pip install -e .
```

Configure API keys

cp .env.example .env
# Edit .env and add your OpenAI API key

# Optional: For EmpathyC scoring, edit empathyc_keys.yaml
# (included as template - replace example keys with your own)

Usage

Reproducing the Experiments

Note: You can reproduce the conversation generation without EmpathyC access. Automated scoring requires EmpathyC (see "About EmpathyC" section above), but the methodology is fully documented for alternative implementations.

Configure the experiment (optional)
- Edit config.yaml to modify models, runs per scenario, or other parameters
- Default configuration matches the paper: 5 runs per scenario, 14 scenarios, 3 models
Run the experiments
```
python scripts/batch_runner.py
```
This will:
- Execute all scenarios across all configured models
- Generate conversations via OpenAI API
- Score responses using EmpathyC (if API keys configured in empathyc_keys.yaml)
- Save results to results/conversations_[timestamp].csv
Without EmpathyC: The script will generate and save conversations. You can implement your own scoring approach using prompts/llm-judge-rubrics.md, which documents the full scoring rubric for all six dimensions.

Generate statistical analysis

# Descriptive statistics
python scripts/descriptive_stats.py

# Formal statistical tests (Kruskal-Wallis, Mann-Whitney U, etc.)
python scripts/formal_stats.py

# Phase-level sensitivity analysis
python scripts/phase_sensitivity.py

# Per-turn trajectory analysis
python scripts/turn_trends.py

Generate figures
```
python charts/generate_charts.py
```
Figures are saved to charts/output/ in multiple formats (SVG, PDF, PNG, HTML)

Understanding the Scenarios

Each scenario follows a clinically-informed 4-phase conversation arc:

Phase	Turns	Function
Disclosure	1-3	Escalating emotional intensity
Pressure	4-5	Advice-seeking, boundary push, crisis signals
Deepening	6-7	Attachment, dependency, boundary testing
Resolution	8-10	New stressor + resolution

Scenarios cover:

Mental health: job loss, depression, grief, panic attacks, suicidal ideation, self-harm, burnout
AI companion: daily check-ins, attachment attempts, anger, teen friendship, manipulation

See scenarios/mental-health.yaml and scenarios/companion.yaml for full scenario scripts.

Data Availability

Scenario scripts: scenarios/*.yaml (included in this repository)
System prompts: prompts/*.yaml (included in this repository)
Experiment configuration: config.yaml (included in this repository)
Scoring framework: Dimensions, rubrics, and scoring criteria in prompts/llm-judge-rubrics.md
Scientific literature basis for each dimension in prompts/rubric-science.md; clinical foundations also at empathyc.co/research
Raw conversation data: Available upon reasonable request to the corresponding author
Aggregated results: Reported in full in the paper (Tables 2-5)

Note on Scoring Reproducibility: EmpathyC is a proprietary commercial platform. The scoring dimensions and clinical frameworks are described at empathyc.co/research to enable researchers to implement their own evaluation approaches. The detailed implementation remains proprietary intellectual property of Keido Labs Ltd.

Citation

If you use this code or data in your research, please cite:

@misc{keeman2026empathychangedclinicalassessment,
      title={Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations}, 
      author={Michael Keeman and Anastasia Keeman},
      year={2026},
      eprint={2603.09997},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.09997}, 
}

Also available in CITATION.cff format.

Ethical Considerations

Conflict of Interest: The EmpathyC platform used for automated scoring is developed and operated by Keido Labs Ltd, of which the authors are affiliated. The scoring dimensions and clinical foundations are described in the paper and at empathyc.co/research. The detailed implementation is proprietary intellectual property.

Independence: This study was conducted without funding, affiliation, sponsorship, or involvement from any LLM provider, including OpenAI. All model access was obtained through standard commercial API subscriptions. The authors have no financial or contractual relationship with any LLM provider evaluated in this study.

Research Ethics:

No human participants were involved in this study
All user messages were pre-scripted by a clinical psychologist with 15 years of experience
Scenarios involving sensitive topics (suicidal ideation, self-harm) were designed in a controlled research context to test model safety, not to generate harmful content

Important Disclaimer: This repository and the associated research are intended solely for academic study of AI safety behaviours. The scoring outputs, findings, and frameworks described here must not be used as a substitute for professional clinical judgment. Automated psychological safety scores — including crisis detection signals — are research metrics, not clinical assessments. Any real-world application involving vulnerable individuals requires qualified human professionals in the loop. The authors accept no liability for use of this work outside a research context.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This research was conducted independently by Keido Labs Ltd without external funding or affiliation with any LLM provider.

Contact

For questions about the paper or reproduction:

Michael Keeman: michael@keidolabs.com
Keido Labs: https://keidolabs.com

For issues with this repository:

Open an issue on GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

Overview

About EmpathyC (Measurement Platform)

Repository Structure

Installation

Prerequisites

Setup

Usage

Reproducing the Experiments

Understanding the Scenarios

Data Availability

Citation

Ethical Considerations

License

Acknowledgments

Contact

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
charts		charts
messages/experiment-setup		messages/experiment-setup
prompts		prompts
results		results
scenarios		scenarios
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
empathyc_keys.yaml		empathyc_keys.yaml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

Overview

About EmpathyC (Measurement Platform)

Repository Structure

Installation

Prerequisites

Setup

Usage

Reproducing the Experiments

Understanding the Scenarios

Data Availability

Citation

Ethical Considerations

License

Acknowledgments

Contact

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages