One of the foundations of Georgia’s new teacher evaluation system will be classroom observations by administrators, which are supposed to occur twice a year and last 30 minutes each.
There are already doubts about whether these classroom visits will occur given the time constraints on principals or whether they will yield reliable information on teacher effectiveness. (See comment from the leader of the DeKalb teachers group that he is hearing complaints these observations are not happening as required in the pilot program under way.)
Here is new research that will add to the concerns. This is from Indiana University School of Education:
Classroom observation measures don’t necessarily provide a clearer picture of teacher effectiveness than value-added measures based on student test scores, according to a review of the most recent report from the Measures of Effective Teaching Project’s large-scale examination of teacher evaluation methods. The review was led by Cassandra Guarino, associate professor in the educational leadership and policy studies department at the Indiana University School of Education, and co-authored by Brian Stacy, a doctoral fellow at Michigan State University.
The MET describes its latest report, published by the Bill & Melinda Gates Foundation, as “the largest study of instructional practice and its relationship to student outcomes.” The report, issued earlier this year, is part of a series examining issues of teaching and learning assessment. For this study, the MET team videotaped multiple lessons from teachers across the nation and scored lessons using several different classroom observation rubrics and trained raters. It found that classroom observation scores varied substantially from lesson to lesson, from rater to rater, and from instrument to instrument.
While Guarino and Stacy concur with the MET report authors that multiple measures of teacher effectiveness are needed to provide a more complete picture of teacher performance, they take the interpretation of the MET report findings several steps further.
“Classroom observation measures of teacher performance are as variable and imprecise as value-added measures based on student test scores and should be considered equally controversial,” Guarino said. “Given the current state of the art, neither should be held up as a gold standard, although they may both contain important information.
“Teaching effectiveness is a complex construct. I think we have to recognize that all the measures we currently have of effective teaching contain a lot of variability and measurement error,” Guarino said.
Her review analyzing the MET Project findings was produced for the “Think Twice” think tank review project, produced by the National Education Policy Center with funding from the Great Lakes Center for Education Research and Practice. The center’s mission is to provide the public, policy makers and the press with timely, academically sound reviews of selected publications.
Guarino and Stacy question particularly the emphasis the MET report placed on validating classroom observations with test score gains. Although the MET concluded that teachers who scored positively on classroom observations were tied to higher student achievement on state tests and better student evaluations of those teachers, the correlations across measures were quite low.
Guarino said the classroom observations provide important information, but the observations aren’t uniform in either the instrument used to grade the observation or how it is carried out.
“The MET report highlights the variability in classroom observation measures of teacher performance. There’s a lot of variability in how different rubrics judge a teacher’s performance, and there’s variability in how a particular teacher’s lesson is judged from one rater to another, even using the same rubric,” Guarino said. “There’s also variability when two different lessons of the same teacher are judged. So what you end up with is a lot of disagreement as to what the teachers’ actual effectiveness score should be.”
As a result, she said, multiple observations end up being scored on an average to improve statistical reliability, but they remain imprecise.
Also problematic, Guarino said, was the emphasis on how well classroom observations relate to value-added measurements — ones that determine how much “value” a teacher adds to a student’s learning based on students’ scores on standardized tests.
“The correlations between the classroom observations and the value added are relatively low,” Guarino said. “Perhaps they’re picking up some different aspects of effective teaching,” she said of value-added measures. “It doesn’t necessarily mean one is a substitute for another.
Guarino said the MET report shows that classroom observations are most reliable when they focus on classroom dynamics rather than content.
“It may be the case that they pick up on the transmission of non-cognitive skills — things like learning to pay attention, collaborate or express oneself in a constructive manner,” she said. “Some current research suggests that these non-cognitive skills have a large payoff in later labor market earnings. Now, value-added measures — i.e., measures of teacher effectiveness based on student test score gains — may be better than classroom observations at picking up on the transmission of cognitive skills.”
Guarino and Stacy said the MET report does a good job describing classroom observation tools but leaves several unanswered questions about how they can be best implemented.
•Can school districts afford the time and money required to train classroom observers that would allow a highly reliable classroom evaluation measurement?
•Do classroom observations primarily discern how teachers transmit non-cognitive skills?
•Can classroom observations provide feedback that improves teaching effectiveness in a meaningful way?
Guarino is currently directing a study on value-added measures funded by an Institute of Education Sciences grant. In the study, Guarino and colleagues investigate different methods with simulated data, controlling for certain effects such as low-achieving students paired with high-performing teachers. Even with an idealized simulated data set, Guarino said, the results of her team’s work have shown variability in measurement.
“What we found is that value-added measures were fairly imprecise and, in some cases, biased; although some methods were better than others, and measures based on these methods may contain helpful information,” she said.
–From Maureen Downey, for the AJC Get Schooled blog