Routine evaluation of mental health care systems necessitates a quick assessment of progress and outcome. This study was designed to determine the value of the GAF-scale in such applications. We allowed 104 raters from six therapeutic centres to rate five clinical case-vignettes. Interrater reliability was almost equal for raters within different professional categories. The highest and the lowest scores for each of the case-vignettes differed by between 39 and 45 points. The raters' biases ranged from -23 to +30 points, and random deviations were between 1 and 20 points. Systematic differences between centres were up to 6 points. Our main finding is that the reliability of GAF scores in routine settings proved unsatisfactory with entrained raters.