Yousuf Golding

LLM Evaluation for Mental-Health Counseling

Evaluation methodology for counseling LLMs: annotation guidelines, inter-rater reliability, and a systematic multi-model quality comparison.

Jan 2025 - Mar 2025

Technologies

PythonPyTorchHuggingface TransformersLLMsEvaluation MetricsInter-Rater Reliability

About This Project

Evaluated proprietary and open-source LLMs (ChatGPT, Claude, LLaMA, Deepseek) on counseling capabilities using classification and generation tasks with comprehensive metrics. Developed detailed annotation guidelines and analyzed inter-rater reliability to systematically assess therapeutic response quality across models. The study examines empathy, safety, clinical appropriateness, and actionability of AI-generated counseling responses, providing insights into the readiness of LLMs for mental health applications.

Complete Report

View the complete report below or download PDF

Unable to display PDF viewer.

Download Report PDF