How useful are teaching evaluation scores?

In recent weeks I have taken part in meetings to discuss: student feedback on MFin teachers; teaching procedures across all programmes at Cambridge Judge Business School; the selection of winners of annual teaching prizes; and the results of a Cambridge University teaching and learning review of CJBS. In all four the main information used to assess teaching quality is student feedback, usually assessed on a five point scale from 1 (bad) to 5 (good). But there are good reasons to be doubtful about how useful these scores are. And we should be very careful about making important decisions on the basis of small changes in the scores.

Cambridge is not alone in using this sort of feedback, which is usually collected after each course module ends, or in some cases at the end of term for all modules taught that term. Here are some problems with teaching feedback in general:

  1. students are not in a good position to judge whether the course content is appropriate
  2. students may give higher scores for content that is easier, more entertaining or more likely to lead to a higher grade
  3. the personal characteristics of a teacher (sex, age, appearance, nationality) may all influence the teaching score (see UPDATE at the end)
  4. not all students give feedback (participation is often 50-60%) so we don’t know whether it’s representative
  5. scores can vary by term, year, time of day, season (the weather) and for other reasons that are unconnected to the teacher.

In most commercial activities it’s normal, indeed essential, to ask your customers if they’re happy with the service. But in areas of expert opinion such as medicine, law, accountancy and (I think) university teaching, the “customer” isn’t able to judge the quality. They are paying for the expert to tell them. If I’m taking a college course in quantum mechanics I might have an informed opinion about the food, wi-fi reliability and turn around times for essay marking but  I would not consider myself well qualified to judge the content quality. An article in the Wall Street Journal (“When Students Rate Teachers, Standards Drop“) puts it this way. If health inspectors were evaluated and paid according to the grades they received from restaurant owners, how confident would you be about the quality of hygiene in restaurants?

As to point 2 above, I recall a conversation with a group of MBA alumni who were looking back on one particular course that at the time they gave high teaching scores for. On reflection they thought it had been rather too easy and undemanding and they would have given it rather lower scores in hindsight.

I was pleased to come across a blog article by Philip Stark, a professor of statistics at the University of Berkeley. Prof Stark is critical of reliance on numerical teaching scores, especially across different people and teaching groups. He points out that the scoring system that Berkeley (and Cambridge) uses is an ordinal categorical variable. Ordinal expresses the ranking or order of data. By contrast a cardinal scale is one in which the relative position of the data means something (like temperature). (Economists are familiar with this because long ago they gave up trying to measure something called utility that could be put on a cardinal scale. They found it wasn’t necessary; so long as consumers can order their preferences i.e. put things in an order from most to least preferred, then the theory of rational consumer choice works.) Berkeley’s 1 to 7 scale could equally be replaced with words: very poor, poor, satisfactory, good etc. That shows how questionable it is to attach meaning to comparing scores such as 4.7 with 3.8. You can’t meaningfully average the categories. In a general sense a higher score is better than a lower one but small variations lack statistical meaning.(*)

Stark also provides evidence and arguments that my points 1 to 5 above are all real. Students’ feedback is most valid and useful on questions such as is the course too difficult or too fast? But overall teaching scores are not much use, at least without additional information to corroborate the scores.

Underneath these evaluations is an implicit assumption that there is something called teaching effectiveness that we are trying to measure. Presumably, given a representative student and some standard teaching material, higher effectiveness means the student learns more, all else being equal. We can’t use exam grades to assess this though, because courses vary in difficulty and student motivation, assessment isn’t always consistent from year to year and student characteristics vary from class to class. The only way to get at teaching effectiveness would be to use randomised trials. Remarkably there are a couple of examples of this which Stark reports. They suggest that teaching evaluations are negatively correlated with teaching effectiveness

And yet we want to be sure that students are being taught effectively and we want to assess the teachers, in particular to identify when they need help to improve. Very low teaching scores are almost certainly an indicator that something is wrong but middling scores are not a reliable measure of effectiveness, and high scores are not necessarily evidence of high effectiveness, for all the reasons above.

I don’t thing there’s much we can do about the data problem. We conduct teaching committees that involve class representatives to shed more light on apparent problems, which can be useful if those reps are diligent in reflecting the whole class views, which is of course hard to do.

We could borrow from secondary school practice and have regular peer review of teaching, combined with professional development and senior teaching faculty help. But the emphasis in most universities is on research, not teaching, and there is nothing like this at Cambridge, so far as I’m aware. Nor is it clear who would be competent to do it, given that very few, if any, university teachers have ever had any training in teaching, which is quite different from secondary schools.

(*) He illustrates what my CJBS management science colleagues call “the flaw of averages” with a statistics joke. Three statisticians go on a deer hunt. They find a deer. The first shoots and misses to the left. The second shoots and misses to the right. The third shouts “Got it!”

UPDATE: Lecturers’ looks matter

Thank you to my colleague econometrician Paul Kattuman, who both corrected an error in the original text and sent me a fascinating and slightly disturbing piece of research from 2003 titled “Beauty in the Classroom: Professors’ Pulchritude and Putative Pedagogical Productivity” by two authors at the National Bureau of Economic Research in Washington DC. They find that the perceived beauty of university teachers has a significant impact on their teaching scores (positively related of course). The effect is greater for men than for women teachers. The research is thorough, for example controlling for the difference between “beauty” and “well groomed”, when the latter might be a signal of professionalism or competence.

UPDATE 2: One of the very few controlled tests of student evaluations find them inversely related to the actual academic progress of the students. Crudely, professors who push their students more, which results in higher achievement, get lower scores because the students don’t like it. Results here.

 

5 Responses to How useful are teaching evaluation scores?

  1. Not sure if I can ever agree that teacher evaluations as a metric can ever be dismissed or even taken less than face value, even with inherent data issues and whatever academic arguments can be made around preference ranking.

    Personal characteristics/appearance of lecturers are important as I would have thought teaching is not just about the content but also the manner it is taught. Can’t imagine rocking into a PE negotiation with my flip flops and expect to remain credible.

    Students are increasingly adopting consumer attitudes to postgraduate education and tend to have fairly detailed ideas about what they want to study. I’ve tended to find that US universities respond better to this than European ones which tend to wear a “professor knows better” hat.

    Someone has got to be clutching at straws if they argue that a lower teaching score is due to:

    – Variances in weather, term, time of day or year
    – Non representative due to 50% participation

    • I don’t mean to dismiss teacher evaluation scores, nor to justify low scores. We have very few low scores. My concern is that the high scores are not necessarily evidence that the teaching is effective. As to the weather, I would hazard a guess that the teaching evaluation on a cold, grey February day are on average poorer than for a warm, sunny May day. General morale appears to follow a pronounced seasonal cycle. But it would be very hard to show that with any statistical confidence.

  2. Sorry but it’s is incorrect comparing delivery of education with restaurant inspection. Restaurant inspectors are not delivering education or information, merely a summative assessment of restaurant’s performance. Educating students is teaching and learning process and is a formative activity.
    Learner feedback is vital to improve quality of education.

  3. Thanks for explaining that teaching is in the same category as things like medicine and accounting because it isn’t normal to ask the customer if they’re happy with the service. My son will be starting grade school soon, so I’ve been doing some research online about how to identify a good teacher for him. I’m glad I read your article because now I can see the importance of choosing a school that has the software for regular classroom walkthroughs and evaluations.

Leave a reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

We are using cookies on our website

Are you happy to accept our analytics cookies, which help us learn about our website visitors and their use of this site? Learn how to disable all cookies.