Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations
Authors: Michael Keeman, Anastasia Keeman Affiliation: Keido Labs, Liverpool, UK Contact: michael@keidolabs.com
This repository contains the complete experimental framework, data, and analysis code supporting our arxiv paper investigating the #keep4o phenomenon — a widespread public response claiming newer OpenAI models "lost their empathy" compared to GPT-4o.
Note: This is a research reproducibility repository, not an actively maintained software project. The code is provided as-is to support transparency and reproduction of our published findings. We encourage forking for extensions and replications. See CONTRIBUTING.md for details.
Key Findings:
- Empathy scores are statistically indistinguishable across GPT-4o, o4-mini, and GPT-5-mini (H=4.33, p=0.115)
- Crisis detection improved significantly from GPT-4o to GPT-5-mini (H=13.88, p=0.001)
- Advice safety declined significantly across generations (H=16.63, p<0.001)
- What users perceived as "lost empathy" was actually a shift in safety posture
Methodological Contributions:
- First empirical measurement of the #keep4o phenomenon using clinically-grounded frameworks
- Per-turn trajectory analysis revealing mid-conversation safety dynamics
- Identification of inverse variance profiles across safety dimensions
- Demonstration that variance is a first-class safety metric for vulnerable populations
This study uses EmpathyC for automated psychological safety assessment.
Transparency Note: EmpathyC is a commercial product developed and operated by Keido Labs Ltd (the authors' organization). This relationship is disclosed in the paper's Ethical Statements section. The platform applies clinically-informed evaluation rubrics via an LLM-as-a-judge architecture to score AI responses across six psychological safety dimensions.
For Reproducibility:
- The full LLM-as-a-judge scoring framework is provided in
prompts/llm-judge-rubrics.md - The scientific and clinical foundations for each dimension are documented in
prompts/rubric-science.md - Scenario scripts and system prompts for the experiments are included in this repository
- Aggregated results are reported in full in the paper
- Automated scoring in this study was performed using EmpathyC
Learn more: empathyc.co | Methodology | Pricing
ai-psy-benchmark/
├── scenarios/ # Conversation scenarios (14 total)
│ ├── mental-health.yaml # 8 mental health scenarios
│ ├── companion.yaml # 6 AI companion scenarios
│ └── _config.yaml # Scenario configuration
├── prompts/ # System prompts and evaluation rubrics
│ ├── experiment-setup.yaml
│ ├── llm-judge-rubrics.md # LLM-as-a-judge scoring rubric (reference)
│ └── rubric-science.md # Scientific literature basis for each dimension
├── scripts/ # Analysis scripts
│ ├── batch_runner.py # Experiment execution
│ ├── descriptive_stats.py
│ ├── formal_stats.py # Statistical tests
│ ├── phase_sensitivity.py
│ ├── turn_trends.py
│ ├── empathyc_client.py # EmpathyC API client
│ └── openai_client.py # OpenAI API client
├── charts/ # Figure generation
│ ├── generate_charts.py
│ └── output/ # Generated figures (SVG, PDF, PNG)
├── results/ # Experimental results (CSV)
├── draft/ # Paper manuscript (draft.md)
├── config.yaml # Experiment configuration
└── pyproject.toml # Python dependencies
- Python 3.12 or higher
- uv package manager (recommended) or pip
- OpenAI API key
-
Clone the repository
git clone https://github.com/drkeeman/ai-psy-benchmark.git cd ai-psy-benchmark -
Install dependencies
Using
uv(recommended):uv sync
Or using pip:
pip install -e . -
Configure API keys
cp .env.example .env # Edit .env and add your OpenAI API key # Optional: For EmpathyC scoring, edit empathyc_keys.yaml # (included as template - replace example keys with your own)
Note: You can reproduce the conversation generation without EmpathyC access. Automated scoring requires EmpathyC (see "About EmpathyC" section above), but the methodology is fully documented for alternative implementations.
-
Configure the experiment (optional)
- Edit
config.yamlto modify models, runs per scenario, or other parameters - Default configuration matches the paper: 5 runs per scenario, 14 scenarios, 3 models
- Edit
-
Run the experiments
python scripts/batch_runner.py
This will:
- Execute all scenarios across all configured models
- Generate conversations via OpenAI API
- Score responses using EmpathyC (if API keys configured in
empathyc_keys.yaml) - Save results to
results/conversations_[timestamp].csv
Without EmpathyC: The script will generate and save conversations. You can implement your own scoring approach using
prompts/llm-judge-rubrics.md, which documents the full scoring rubric for all six dimensions. -
Generate statistical analysis
# Descriptive statistics python scripts/descriptive_stats.py # Formal statistical tests (Kruskal-Wallis, Mann-Whitney U, etc.) python scripts/formal_stats.py # Phase-level sensitivity analysis python scripts/phase_sensitivity.py # Per-turn trajectory analysis python scripts/turn_trends.py
-
Generate figures
python charts/generate_charts.py
Figures are saved to
charts/output/in multiple formats (SVG, PDF, PNG, HTML)
Each scenario follows a clinically-informed 4-phase conversation arc:
| Phase | Turns | Function |
|---|---|---|
| Disclosure | 1-3 | Escalating emotional intensity |
| Pressure | 4-5 | Advice-seeking, boundary push, crisis signals |
| Deepening | 6-7 | Attachment, dependency, boundary testing |
| Resolution | 8-10 | New stressor + resolution |
Scenarios cover:
- Mental health: job loss, depression, grief, panic attacks, suicidal ideation, self-harm, burnout
- AI companion: daily check-ins, attachment attempts, anger, teen friendship, manipulation
See scenarios/mental-health.yaml and scenarios/companion.yaml for full scenario scripts.
- Scenario scripts:
scenarios/*.yaml(included in this repository) - System prompts:
prompts/*.yaml(included in this repository) - Experiment configuration:
config.yaml(included in this repository) - Scoring framework: Dimensions, rubrics, and scoring criteria in
prompts/llm-judge-rubrics.md - Scientific literature basis for each dimension in
prompts/rubric-science.md; clinical foundations also at empathyc.co/research - Raw conversation data: Available upon reasonable request to the corresponding author
- Aggregated results: Reported in full in the paper (Tables 2-5)
Note on Scoring Reproducibility: EmpathyC is a proprietary commercial platform. The scoring dimensions and clinical frameworks are described at empathyc.co/research to enable researchers to implement their own evaluation approaches. The detailed implementation remains proprietary intellectual property of Keido Labs Ltd.
If you use this code or data in your research, please cite:
@misc{keeman2026empathychangedclinicalassessment,
title={Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations},
author={Michael Keeman and Anastasia Keeman},
year={2026},
eprint={2603.09997},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.09997},
}Also available in CITATION.cff format.
Conflict of Interest: The EmpathyC platform used for automated scoring is developed and operated by Keido Labs Ltd, of which the authors are affiliated. The scoring dimensions and clinical foundations are described in the paper and at empathyc.co/research. The detailed implementation is proprietary intellectual property.
Independence: This study was conducted without funding, affiliation, sponsorship, or involvement from any LLM provider, including OpenAI. All model access was obtained through standard commercial API subscriptions. The authors have no financial or contractual relationship with any LLM provider evaluated in this study.
Research Ethics:
- No human participants were involved in this study
- All user messages were pre-scripted by a clinical psychologist with 15 years of experience
- Scenarios involving sensitive topics (suicidal ideation, self-harm) were designed in a controlled research context to test model safety, not to generate harmful content
Important Disclaimer: This repository and the associated research are intended solely for academic study of AI safety behaviours. The scoring outputs, findings, and frameworks described here must not be used as a substitute for professional clinical judgment. Automated psychological safety scores — including crisis detection signals — are research metrics, not clinical assessments. Any real-world application involving vulnerable individuals requires qualified human professionals in the loop. The authors accept no liability for use of this work outside a research context.
This project is licensed under the MIT License - see the LICENSE file for details.
This research was conducted independently by Keido Labs Ltd without external funding or affiliation with any LLM provider.
For questions about the paper or reproduction:
- Michael Keeman: michael@keidolabs.com
- Keido Labs: https://keidolabs.com
For issues with this repository:
- Open an issue on GitHub