Yousuf Golding
LLM Evaluation for Mental-Health Counseling
Evaluation methodology for counseling LLMs: annotation guidelines, inter-rater reliability, and a systematic multi-model quality comparison.
Jan 2025 - Mar 2025
Technologies
PythonPyTorchHuggingface TransformersLLMsEvaluation MetricsInter-Rater Reliability
About This Project
Evaluated proprietary and open-source LLMs (ChatGPT, Claude, LLaMA, Deepseek) on counseling capabilities using classification and generation tasks with comprehensive metrics. Developed detailed annotation guidelines and analyzed inter-rater reliability to systematically assess therapeutic response quality across models. The study examines empathy, safety, clinical appropriateness, and actionability of AI-generated counseling responses, providing insights into the readiness of LLMs for mental health applications.
Complete Report
View the complete report below or download PDF
Unable to display PDF viewer.
Download Report PDF