A reader sent me a series of questions about the state’s Criterion-Referenced Competency Tests, which I asked the state Department of Education to answer. I am running this today because the state just released district scores.
I appreciate the time that DOE took to draft this detailed response as I requested that it be as jargon-free as possible for the non educators on the blog. As DOE spokesman Matt Cardoza said, “While there is a lot of technical information in this answer, they asked a very technical question. It’s as jargon-free as we can get it.”
From the reader:
How does the DOE oversee the CRCT test validity and scoring year to year in order to do the charts and comparisons the DOE released this week? As a former principal and test coordinator before that, I was never told the cut scores on the first test or the retest. I felt that there was wiggle room at the state level in deciding how many questions a child could miss and still pass and that it was decided each year after the tests were all in and scored. I also felt the retest had a different and less demanding cut score. Who makes sure that the level of difficulty of the questions remains constant across years so that there is valid comparison? If the questions and cut scores are manipulated year to year, how can valid comparisons be made?
If I were state superintendent, I would want to show steady progress and improvement, and the fact that there is not transparency in the cut scores and the test is created in-house and updated year to year, there is room to manipulate the results without changing a single child’s answer. I have no proof; I am just asking the question. Has anyone looked at this?
The process Georgia uses to build state assessments, such as the Criterion-Referenced Competency Tests (CRCT), is an established, time-tested practice that all reputable test developers use. This process follows the professional standards jointly developed by several organizations, including the American Psychological Association (APA), the National Council of Measurement in Education (NCME), and the American Educational Research Association (AERA). While the state contracts for the development, administration, scoring, and reporting of the assessment programs (for example, CTB McGraw-Hill is the current contractor for the CRCT), the Georgia Department of Education (GaDOE) assessment staff provides direct oversight of this work.
Georgia educators make significant contributions to the state’s testing programs by reviewing test items both before and after they are field tested with Georgia students. Field testing involves trying out newly written items with a representative sample of students and is a crucial step towards ensuring the items are appropriate and not confusing for students before holding students and schools accountable for performance on the items. As such, field test items do not contribute to student results.
Multiple steps are taken to ensure the technical quality of each assessment program. For example, Georgia convenes a Technical Advisory Committee (TAC), comprised of six nationally-recognized experts in the field of educational measurement. The purpose of the TAC is to provide the state with impartial, expert advice on the technical qualities of the state’s assessments. TAC meets quarterly and reviews every step of the test development, scoring, and reporting process for each testing program.
Additionally, testing programs such as the CRCT must undergo a comprehensive review process conducted by the U.S. Department of Education (US ED) known as Peer Review. During this review, each state must submit detailed documentation providing evidence of the technical qualities of the program(s). These include, but are not limited to, qualities such as alignment, development and maintenance procedures, and technical reports. A committee of peers (measurement, curriculum, and education policy experts) selected from other states reviews the evidence and evaluates the overall quality and soundness of the instruments.
So, how does the GaDOE know that year to year comparisons of test scores on a test are valid? When test forms are built, careful consideration is given to both the content and statistical features of the items selected to comprise the form. A test blueprint, with both content and statistical targets, guides the form development. Throughout the test form building process, the goal is to develop a form that is as parallel as possible to the blueprint and previous forms. And while the test forms are created to be as parallel as possible in terms of content coverage and difficulty, the fact remains that differences in unique collections of items can result in subtle changes in difficulty. A statistical procedure called equating serves to equalize those differences.
Equating is not unique to Georgia; equating is a process used by virtually all large-scale assessment programs (including the SAT, ACT, other state assessments, etc.). Behind and undergirding each test program and each administration is a large variety of statistical work that takes place to ensure the assessments are technically sound and equated appropriately. The process of equating a test ensures that students taking a test are always held to the same level of achievement, regardless of any differences in the collection of items that comprise the test form taken. Thus whenever multiple test forms are used in the same administration or when a different form is given in a subsequent administration (e.g., grade 7 science in 2011 and grade 7 science in 2012), they must be equated.
The technical work for the CRCT involves expressing each test form on a common metric called the theta scale, such that performance on each form can be compared. Working with a common metric means that difference in test performance can be interpreted as a result of changes in student achievement as opposed to changes in test difficulty. It is particularly important to understand that the cut scores determined within large scale assessment programs like the CRCT are set in this common metric and are held constant for the lifetime of the testing program.
For example, the cut scores for the grade 7 Mathematics were determined via an extensive standard setting procedure in the first year of the testing program. However, the cut score for the CRCT is set using the theta scale and not the raw score. The particular raw score required to achieve the cut score of 800 in any subsequent test administration is based on the difficulty of the collection of test items that comprise the form.
Working within a common metric such as the theta scale and implementing statistical procedures such as equating allows us to attribute, with confidence, any changes in student performance to student achievement and not as a by-product of the test form that was administered. The passing score always has the same meaning from administration to administration.
Finally, the technical work underlying the development and administration of the CRCT is well documented and has been extensively and independently reviewed by the TAC as well as the US ED. For many years the raw score (or number correct) achieved by students has been reported on class rosters and individual student reports.
Given this transparency it would be exceedingly difficult, if not impossible, for the state to “manipulate” the cut score to achieve some “desired” result.
Given the stakes associated with the test results and recent events in our state, it is natural to have questions about how tests are developed and validated. Individuals interested in learning more can Google the words “psychometrics” (the advanced study of measurement) and “test equating.”
–From Maureen Downey, for the AJC Get Schooled blog