Thursday, December 20, 2012

Can we trust educational research? ("Visible Learning": Problems with the evidence)

I've been reading several books about education, trying to figure out what education research can tell me about how to teach high school English. I was initially impressed by the thoroughness and thoughtfulness of John Hattie's book, Visible Learning, and I can understand why the view of Hattie and others has been so influential in recent years.  That said, I'm not ready to say, as Hattie does, that we must make all learning visible, and in particular that "practice at reading" is "minimally" associated with reading gains.  I discussed a couple of conceptual issues I have with Hattie's take in an earlier post--I worry that Visible Learning might be too short-term, too simplistic, and less well-suited to English than to other disciplines.  Those arguments, however, are not aimed at Hattie's apparent strength, which is the sweep and heft of his empirical data.  Today, then, I want to address a couple of the statistical weaknesses in Hattie's work.  These weaknesses, and the fact that they seem to have been largely unnoticed by the many educational researchers around the world who have read Hattie's book, only strengthen my doubts about the trustworthiness of educational research.  I agree with Hattie that education is an unscientific field, perhaps analogous to what medicine was like a hundred and fifty years ago, but while Hattie blames this on teachers, whom he characterizes as "the devil in this story" because we ignore the great scientific work of people like him, I would ask him to look in the mirror first.  Visible Learning is just not good science.

Hattie's data
Visible Learning attempts to be both encyclopedia and synthesis. The book categorizes and describes over 800 meta-analyses of educational research (altogether, those 800 meta-analyses included over 50,000 separate studies), and it puts the results of those meta-analyses onto a single scale, so that we can compare the effectiveness of the very different approaches.  After categorizing the meta-analyses, into, for instance, "Vocabulary Programs", "Exposure to Reading", "Outdoor Programs", or "Use of Calculator", Hattie then determines the average effect that the constituent meta-analyses show for that educational approach.  By these measures, exposure to reading seems to make more of a difference than the use of calculators, but less of a difference than outdoor programs, and much less of a difference than vocabulary programs. (There are some odd results: "Direct Instruction," according to Hattie's rank-ordering, makes more of a difference than "Socioeconomic Status.")

Like other teaching gurus and meta-meta-analyzers (for instance, Robert Marzano, whose 2000 monograph, A New Era of School Reform, makes the case very explicitly), Hattie believes that good teaching can be codified and taught (that sounds partly true to me), that good teaching involves having very clear and specific learning objectives (I'm somewhat doubtful about that), and that good teaching can overcome, at the school level, the effects of poverty and inequality (I don't believe that).  Hattie uses a fair amount of data to back up his argument, but the data and his use of it are somewhat problematic.

First, questions about the statistical competence of Hattie in particular
I am not sure whether we can trust education research, and I am not alone.  John Hattie seems to be a leading figure in the field, and while he seems to be a decent fellow, and while most of his recommendations seem somewhat reasonable, his magnum opus, Visible Learning, has such significant issues that my one friend who's a professional statistician believes, after reading my copy of the book, that Hattie is incompetent.

The most blatant errors in Hattie's book have to do with something called "CLE" (Common Language Effect size), which is the probability that a random kid in a "treatment group" will outperform a random kid in a control group.  The CLEs in Hattie's book are wrong pretty much throughout.  He seems to have written a computer program to calculate them, and the computer program was poorly written.  This might be understandable (all programming has bugs), and it might not have meant that Hattie was statistically incompetent, except that the CLEs Hattie cites are dramatically wrong.  For instance, the CLE for homework, which Hattie uses prominently (page 9) as an example to explain what CLE means,  is given as .21.  This would imply that it was much more likely that a student who did not have homework would do well than a student who did have homework.  This is ridiculous, and Hattie should have noticed it.  But even more egregious is when Hattie proposes CLEs that are less than 0.  Hattie has defined the CLE as a probability.  A probability cannot be less than 0.  There cannot be a less than zero chance of something happening (except perhaps in the language of hyperbolic seventh graders.)

As my statistician friend wrote me in an email, "People who think probabilities can be negative shouldn't write books about statistics."

Second, doubts about the trustworthiness of educational researchers in general
My statistician friend is not the first to have noticed the probabilities of less than zero. A year and a half ago a Norwegian researcher wrote an article called "Can We Trust The Use of Statistics in Educational Research" in which he raised questions about Hattie's statistical competence, and in follow-up correspondence with Hattie the Norwegian was not reassured.  (Hattie seems, understandably, not to want  to admit that his errors were anything more than minor technical details.  In a exchange of comments on an earlier post on this blog, as well, Hattie seems to ignore the CLE/negative probability problem.)

For me, the really interesting thing about Hattie's exchange with the Norwegians was that he seemed genuinely surprised, two years after his book had come out, by the fact that his calculations of CLE were wrong.  In his correspondence with the Norwegians, Hattie wrote, "Thanks for Arne Kåre Topphol for noting this error and it will be corrected in any update of Visible Learning."  This seems to imply that Hattie hadn't realized that there was any error in his calculations of CLE until it was pointed out by the Norwegians--which means, if I'm right, that no one in the world of education research noticed the CLE errors in between 2009 and 2011.

If it is true that the most prominent book on education to use statistical analysis (when I google "book meta-analysis education", Hattie's book is the first three results) was in print for two years, and not a single education researcher looked at it closely enough and had enough basic statistical sense to notice that a prominent example on page 9 of the book didn't make sense, or that the book was apparently proposing negative probabilities, then education research is in a sorry state.  Hattie suggests that the "devil" in education is the "average" teacher, who has "no idea of the damage he or she is doing," and Hattie approvingly quotes someone who calls teaching "an immature profession, one that lacks a solid scientific base and has less respect for evidence than for opinions and ideology" (258).  He essentially blames teachers for the fact that teaching is not more evidence-based, implying that if we hidebound practitioners would only do what the data-gurus like him suggest, then schools could educate all students to a very high standard.  There is no doubt that there is room for improvement in the practice of many teachers, as there is in the practice of just about everyone, but it is pretty galling to get preachy advice about science from a guy and a field who can't get their own house in order.

Another potential problem with Hattie's data
Aside from the CLE issue, I am troubled by the way Hattie presents his data.  He uses a "barometer" that is supposed to show how effective is the curricular program or pedagogical practice he is considering.  This is the central graphic tool in Hattie's book, the gauge by which he measures every curricular program, pedagogical practice and administrative shift:




Note that developmental and teacher effects are both above zero. What this implies is that the effect size represented by the arrow is not the effect as compared to a control group of students that got traditional schooling, nor even the effect size as compared to students who got no schooling but simply grew their brains over the course of the study, but the effect size as compared to the same students before the study began.

This would imply that offering homework, with a reported effect size of .29, is actually worse than having students just do normal school, or that multi-grade classes, with an effect size of .04, make kids learn nothing.

Now, that is obviously not what Hattie means.  The truth is that Hattie sometimes uses "effect size" to mean "as compared to a control group" and other times uses it to mean "as compared to the same students before the study started." He seems comfortable with this ambiguity, but I am not.  Not only is the "barometer" very confusing in cases like homework and multi-grade classrooms, where the graphic seems clearly to imply that those practices are less effective than just doing the regular thing (especially confusing in the case of homework, which is the regular thing), this confusion makes me very, very skeptical of the way Hattie compares these different effect sizes.  The comparison of these "effect sizes" is absolutely central to the book.  Comparing effect sizes (and he rank orders them in an appendix) is just not acceptable if the effects are being measured against dramatically different comparison groups.

Hattie, in a comment on an earlier post in which I expressed annoyance at this confusion, suggested that we should think of effect sizes as "yardsticks"--but in the same comment he says that effect size is the effect as compared to two different things.  In his words: "An effect size of 0 means that the experimental group didn't learn more than the control group and that neither group learned anything."  Now, I am an English teacher, so I know that words can mean different things in different contexts.  But that is exactly what a yardstick is not supposed to do!

Of course, it is possible that many of Hattie's conclusions are correct.  Some of them (like the idea that if you explicitly teach something and have kids practice it under your close observation, then they will get better at it more quickly than if you just ask them to try it out for themselves) are pretty obvious.  But it is very hard to have much confidence in the book as a whole as a "solid scientific base" when it contains so much slipperiness, confusion and error.

Beyond these broad issues with Hattie's work, I also have some deep qualms about the way he handles reading in particular.  Maybe one day I'll address those in another post.

5 comments:

  1. Look, I agree about the moving target, but the bit about 0.29 for homework is, as I read it, saying that you want to consider your resources allocation carefully, and not throw a whole lot of effort into something that is a *relatively* "low performer" **without carefully considering the input**. He goes over and over that this takes us to the point of asking questions, and considering the cohort in question in fine detail. The school uniform thing is a great example; it allows us to see that if one was to pursue the uniform on academic achievement (and a few other) grounds, it would be hard to justify, but it does NOT mean that there aren't other grounds upon which any given group might want to employ a uniform in their particular context. And you say about explicit instruction being "obvious"; dude, I can tell you that there are plenty of folks still plugging along with "There's maths in that!"

    ReplyDelete
    Replies
    1. You may be comfortable with Hattie's confident assertions and fuzzy math, but I am not. Obviously we want to consider resource allocation carefully, but Hattie goes on and on about how medieval the teaching profession is, and how we need modern scientists like him to bring us into the modern world; if that is his position, it is simply ridiculous for him to be so vague about what his numbers mean.

      Delete
  2. Thanks for a nice introduction to the criticism of Hattie. It is being pushed (and over-simplified) in my school. I have two issues that maybe you can speak to a bit. I have not read the book, I have just seen several presentations that are at least once-removed, if not twice removed. 1. The areas researched are presented in a rank ordering. This sends the message that "feedback" is better than "teacher-student relationships" or that "not labeling students" is more effective than "concept mapping" when these seem totally unrelated. 2. I have seen his effect size reported as r and as CLES. Either way, there shouldn't be any values over 2, right? Am I missing something? Just like 0 is a weird number, even values very close to 1 seem out of place and over 1 seem like a disaster. What is going on with that?

    ReplyDelete
    Replies
    1. I would advise you to put these questions to John himself! He is easily accessibel through his university mailaccount. He will explain that some of the items with a lower ranking are a pre-condition for items higher on the list. He specifically uses the example of teacher-student relationships as a pre-condition for effective feedback.

      Delete
  3. Hi Emily: I'm traveling and don't have Hattie's book with me, but certainly my take is that you are right and Hattie's work is basically a mess. More specifically, yes, Hattie seems to stand by his rank-ordering, and we are I think supposed to understand him to be saying that feedback is EXTREMELY important, even more so than teacher-student relationships. Because I don't understand his math, I don't trust this conclusion. But who knows, it is imaginably true... As for your #2, yes, I think the values should be between 0 and 1. But as I've said, I really don't know what these numbers are supposed to represent, and as far as I can tell Hattie himself doesn't know, either. As I wrote in the post, he actually left a comment on my earlier post saying that the effect sizes mean different things for different studies. As far as I'm concerned, his work is pretty much useless. This is, unfortunately, par for the course in educational research, and it wouldn't bug me much except that he has the incredible nerve to blame us teachers for the unscientific nature of education. I'm sorry to hear that his work is being pushed in your school! --EC

    ReplyDelete