Program Goal VIII: Assessment

As James Cooper (1999) has noted, assessment data is only useful if it contributes to sound educational decision-making. In order to use this data effectively, beginning teachers must have knowledge and skills in the areas of measurement fundamentals, standardized tests and their interpretation, validity and reliability, constructing formal and informal assessments, and appropriately utilizing the results of both formative and summative assessments. Additionally, it is important for teachers to have knowledge of the No Child Left Behind Act and its impact on educational assessment. The paragraphs that follow describe the knowledge and skills in the above areas that we believe our students must possess in order to make effective assessment decisions.

Measurement Fundamentals: Educators must understand the basic terminology and concepts related to assessment, evaluation, and measurement. The following concepts and definitions are of central importance to this understanding:

Assessment, Evaluation, and Measurement: Though the terms assessment and evaluation are often used interchangeably (Cooper, 1999), many writers differentiate between them. For the purposes of this chapter, assessment is defined as gathering information or evidence, and evaluation is the use of that information or evidence to make judgments (Snowman, McCown, and Biehler, 2012). Measurement involves assigning numbers or scores to an "attribute or characteristic of a person in such a way that the numbers describe the degree to which the person possesses the attribute" (Nitco and Brookhart, 2011, p. 507). Assigning grade equivalents to scores on a standardized achievement test is an example of measurement.

Teacher-Made vs. Standardized Assessments: In the broadest sense, assessments may be classified into two categories: teacher-made and standardized. Teacher-made assessments are constructed by an individual teacher or a group of teachers in order to measure the outcome of classroom instruction. Standardized assessments, on the other hand, are commercially prepared and have uniform procedures for administration and scoring. They are meant for gathering information on large groups of students in multiple settings (Karmel and Karmel, 1978).

Norm-Referenced, Criterion-Referenced, and Standards-Referenced Frameworks: Standardized assessments may be norm-referenced, criterion referenced, or standards referenced.  Norm-referenced assessments compare individual scores to those of a norm-reference group, generally students of the same grade or age. They are designed to demonstrate "differences between and among students to produce a dependable rank order" (Bond, 1996, p.1) and are often used to classify students for ability-grouping or to help identify them for placement in special programs. They are also used to provide information to report to parents. Criterion-referenced assessments determine the specific knowledge and skills possessed by a student. In other words, identify "the kind of performances a student can do in a domain, rather than the student's standing in a norm group" (Nitco and Brookhart, 2011, p. 369). Standards-based assessments involve comparing students' scores to "clearly defined levels of achievement or proficiency" (Nitco and Brookhart, 2011, p. 514), such as state or national standards.

Formative vs. Summative Evaluation: Formative evaluation involves "collecting, synthesizing, and interpreting data for the purpose of improving learning or teaching (Airasian, 1997, p. 402). Thus, formative assessments support learning and are not graded. "Rather, they serve as practice for students, just like a meaningful homework assignment" (Dodge, 2011) and can provide valuable information to teachers for improving student performance. Summative evaluations, on the other hand, "are given periodically to determine at a particular point in time what students know and do not know . . ." (Garrison and Ehringhaus, 2007). They are often used for assigning grades.

Types of Standardized Tests: A description of all forms of standardized tests is far beyond the scope of this document. Therefore, only those standardized tests most central to educational decision-making are included in this brief summary.

Intelligence Tests: Intelligence tests, or tests of cognitive ability, are often classified into two categories: individual and group. Individual intelligence tests, such as the Stanford-Binet and Wechsler Scales, are administered in a one-to-one setting by a trained examiner, usually a psychologist. They are most frequently given as part of an overall psychological evaluation, particularly to determine if a student requires special education (Hallahan, Kauffman, and Pullen, 2012). Group intelligence tests, such as the Cognitive Abilities Tests, are typically administered in mass testing situations for general screening. Results of intelligence tests, whether group or individual, are often reported as intelligence quotients or IQ scores (Bee and Boyd, 2010). In the past, IQ scores were determined by calculating a ratio (mental age divided by chronological age multiplied by 100, thus the term intelligence "quotient"). However, they are now typically calculated as standard scores with a mean of 100 and a standard deviation of15 (Bee and Boyd, 2010). Though the results of individual intelligence tests tend to be more valid and reliable than those of group tests, teachers must be aware of the limitations of intelligence tests such as their tendency to primarily measure abilities related to classroom achievement and their variability over time for some individuals (Snowman and McCown, 2012).

Tests of Academic Achievement: Like intelligence tests, individual achievement tests are most often administered to students who have been referred for possible placement in special education or remedial programs. Commonly administered tests such as Wechsler Individual Achievement Test, Second Edition (WIAT-II) and the Woodcock Johnson Tests of Achievement, Third Edition (WJ-III) are designed to measure basic academic skills and can provide a comparison of students' academic performance to the levels predicted by the results of intelligence testing (King, 2006). Group achievement tests, such as the California Basic Educational Skills Test, are typically administered annually or at planned intervals to all students as part of the district-wide testing program in order to certify students' achievement and provide information to parents. Many school districts in Minnesota administer the Measures of Academic Progress (MAP), which are aligned with state standards (NWEA, 2006). Minnesota school districts also must administer the Minnesota Comprehensive Assessments - Third Edition (MCA-III) to determine if students are making adequate yearly progress in reading and math as required by the No Child Left Behind act (Minnesota Department of Education, 2011).

Assessment Data Interpretation: To understand and appropriately use the results of standardized assessments, educators must have a working knowledge of descriptive statistics, including measures of central tendency, measures of dispersion, norms, and standard scores.

Measures of central tendency include the mean, median, and mode. The mean is the arithmetic average and is obtained by adding all scores and dividing by the total number of scores. It is especially important because it is a necessary statistic for the calculation of standard scores. The median is the middle score of a distribution. In cases where there are extreme scores (or outliers), the median may be a better measure of central tendency than the mean. The mode is the most frequent score in a distribution. In large, normally-distributed populations, the mode does not differ greatly from the mean and median. However, in distributions with small numbers, it is typically the least useful of these three statistics (Glass and Stanley, 1970).

Measures of dispersion include the range and standard deviation. The range is the spread of scores in a distribution (Vogt, 1993). It may be calculated by subtracting the lowest score from the highest and adding one. It is, at best, a rather crude statistic because it is based on only the two most extreme scores in the distribution. The standard deviation is a more precise and useful measure of dispersion. It is calculated by finding the average of the absolute distance scores vary from the mean. Calculating the standard deviation is essential for determining standard scores such as those described below (Anastasi, 1988).

Norms: The results of standardized tests are typically reported as norms, which are statistics that allow one to interpret the score of an individual student in comparison to others of the same age or grade level. They include percentiles (the percentage of scores in the norm-reference group that fall at or below that of a particular student), grade equivalents (which describe a student's performance in terms of school grade levels), and standard scores such as Z-scores, T-scores, and I.Q. scores, all of which indicate how far a student's score varies from the mean in standard deviation units (Anastasi, 1988). It is important for teachers to understand the meaning of norms so that they are able to use the results of standardized tests to effectively plan instruction and interpret them to parents.

Validity and Reliability: In order to make effective decisions based on assessment data, the instruments used to collect that data must be valid and reliable. Validity is defined as "how accurately a test measures what users want it to measure" (Snowman, McCown, and Biehler, 2012). The validity of a test may be determined in different ways including carefully examining the test's content in regard to the curriculum (content validity), determining the degree to which test results reflect a theory or construct (construct validity), and examining the extent to which assessment results can be used to predict future performance (predictive validity). Of these approaches to validity, content validity is clearly the most important for teacher-made tests.

Reliability refers to the consistency or stability of scores yielded by a test (Airasian, 1997). In other words, a reliable test yields consistent scores over repeated administrations (repeated-measures reliability). A concrete means of judging a test's reliability is provided by the Standard Error of Measurement (SEM), which provides an estimate of the amount of error in the scores yielded by an test. It is inversely related to reliability: the higher the reliability coefficient, the smaller the SEM (Nitko and Brookhart, 2011). Though reliability and standard error of measurement are important considerations in choosing standardized tests, it is often difficult to determine the reliability and standard error of a teacher-made test. However, teachers should be knowledgeable regarding ways to increase the reliability of classroom assessments such as those described by Aiarasian (1997).

Constructing Assessments: Formal, teacher-made assessments can be classified as traditional or alternative. Traditional assessments are typically paper-and-pencil tests and are often categorized as objective or essay. Alternative assessments are most often performance-based. The type of assessment one chooses is dependent on many factors including grade level, content, and time constraints. Regardless of the assessment format, good assessments have three features in common: "1) the assessment exercises and questions are related to the teacher's objectives and instruction; 2) the exercises and questions cover a representative sample of what students were taught; and 3) the items, directions, and scoring procedures are clear and appropriate" (Airasian, 1997, p. 149).

Traditional Assessments: Objective assessments are traditional assessments on which students are expected to provide the one, correct answer. Typical objective assessment formats include multiple choice, true-false, matching, completion, and short answer. All of these assessments have the advantage of sampling students' knowledge of a wide range of content (and, therefore, have the potential for good content validity) in a minimal amount of time. A major disadvantage of objective test items is that they typically measure only student learning at the knowledge and comprehension levels of Bloom's Taxonomy (See Program Goal I for information about Bloom's Taxonomy). However, multiple choice items can be constructed to sample higher cognitive levels (Airasian, 1997). In order to construct effective objective assessments, teachers must be familiar with and apply guidelines for effective test construction such as those described by Nitko and Brookhart (2011) and Stiggins, Arter, Chappuis, and Chappuis (2006).

Essay Tests are another form of traditional, paper-and-pencil assessment. These tests provide an excellent format for assessing students' abilities to communicate ideas in writing. Other advantages of essay tests include measuring higher cognitive levels and directly measuring behaviors specified by performance objectives (Ooosterhof, 1999). However, essay tests have disadvantages as well in that they sample less content than objective tests and, therefore, may have poorer content validity. Additionally, essay tests are time-consuming to score, and their scoring tends to be less reliable (Oosterhof, 1999). Because of these potential disadvantages, it is especially important for teachers to follow appropriate guidelines for constructing and scoring essay tests such as those described by by Nitko and Brookhart (2011) and Stiggins, Arter, Chappuis, and Chappuis (2006).

Alternative Assessments: Performance assessments are the most common form of alternative assessment. Through the use of these assessments, teachers attempt "to gauge how well students can use basic knowledge and skill to perform complex tasks or solve problems under more or less realistic conditions" (Snowman, McCown, and Biefler, 2012, p. 612). A related form of alternate assessment, authentic assessment, "requires students to use the same competencies, or combinations of knowledge, skills and attitudes that they need to apply in the criterion situation in professional life" (Gulikers,, Bastiaens, and Kirschner 2004, p. 69).

Though the term authentic assessment is often used interchangeably with performance assessment, Oosterhof (1999) differentiates between the two, defining authentic assessments as tasks that require "a real application of a life skill beyond the instructional context" (p. 151). Therefore, according to Oosterhof, "All authentic assessments are performance assessments, but the inverse is not true" (p. 151). A major advantage of performance assessments is that they allow evaluation of skills that cannot be easily measured by paper-and-pencil tests. Additionally, they allow for evaluation of the process as well as the product. However, performance assessments tend to be time-consuming to administer and score; therefore, they typically do not allow one to sample a wide range of outcomes. Consistency of scoring (reliability) is also problematic (Oosterhof, 1999).

Portfolios: According Nitco and Brookhart (2011), a portfolio is "a limited collection of a student's work used for assessment purposes either to present the student's best work(s) or demonstrate growth over a given time" (p. 510). Though portfolios are often considered to be a form of authentic or performance assessment, as Oosterhof (1999) noted, they typically include materials that are neither authentic nor performance-based. Portfolios are particularly useful for tracking change or growth in student performance over time and are most effective when they "are characterized by a clear vision of the student skills to be addressed; student involvement in selecting what goes into the portfolio; use of criteria to define quality performance and provide a basis for communication; and self-reflection through which students share what they think about their work, their learning environment, and themselves" (Arter, p. 4, 1995).

Informal Assessment: As noted by Oosterhof (1999), the majority of classroom assessments are informal in nature. They typically take place during instruction and allow the teacher to monitor student learning and make any necessary adjustments. These informal assessments most often take the form of observations and questions and are typically formative (Nitco and Brookhart, 2011).Though these techniques are efficient and adaptable, their technical quality "tends to be inferior to techniques associated with formal assessments" (Oosterhof, 1999, pp. 148 and 149). However, teachers can improve their use of informal questions by basing them on instructional goals, allowing sufficient wait-time, and recognizing the importance of teacher reactions to student answers (Oosterhof, 1999). The effectiveness of observations can be improved by keeping anecdotal records, using informal checklists (Oosterhof, 1999), and recognizing one's own bias (Good and Brophy, 1984).

Reporting Grades: Grades serve the important purposes of indicating the extent to which learners have achieved classroom goals and communicating this information to students and their parents. Methods for calculating grades include norm-referenced (or relative) grading, criterion-referenced (or absolute) grading, and individual-referenced grading. Norm-referenced (or relative) grading involves comparing "a pupil's performance to that of other pupils in the class" (Airasian, p. 301, 1997). Grading "on the curve" is an example of norm-referenced grading. Criterion-referenced (or absolute) grading involves determining grades based on "the extent to which each student has attained a defined standard or criterion of achievement or performance (Snowman and McCown, 2012, p. 503). Calculating grades based on fixed ranges of cumulative scores is an example of criterion-referenced grading (Tombari and Borich, 1999).

The mastery approach, which "allows students multiple opportunities to learn and demonstrate their mastery of instructional objectives (Snowman, McCown, and Biehler, 2012, p. 504), is another form of criterion-referenced grading. Individual-referenced grading compares pupils' performance to their perceived abilities. Though this form of grading is sometimes used for students with disabilities, Airasian, (1997) recommends that other grading approaches such as contract grading, IEP-based grading, or narrative grading be used for determining these students' grades. Of the approaches for assigning letter grades, the use of criterion-referenced grading is most consistent with the beliefs of those in our department.

No Child Left Behind: The No Child Left Behind Act (NCLB), which reauthorized the Elementary and Secondary Education Act (ESEA) of 1965, was signed into law by President Bush on 8 January, 2002. It has become an important factor in specifying the timing and nature of school-wide assessments.

The goals of NCLB include increasing accountability for states, school districts, and schools; providing greater choice for parents and students, particularly those attending low-performing or failing schools; giving local education agencies (LEAs) more flexibility in using Federal education money (mostly Title I funds); and placing a stronger emphasis on reading, especially for young children. Every state must comply with NCLB or lose their Title I and other Federal education funds (U.S. Department of Education, 2004).
NCLB categorizes students into seven subgroups based on ethnicity, poverty, disability, and limited English proficiency LEP) in order to identify and close achievement gaps. All students, including those in each of the sub-groups, must be proficient in reading and math by 2014 (U.S. Department of Education, 2002). Each state sets its own criteria for proficiency. In Minnesota, students are considered proficient when they have "grade-level knowledge and skills" (AFT, 2003). To determine if schools are making adequate yearly progress (AYP) toward meeting this goal, states must administer high-stakes reading and math assessments (currently the Minnesota Comprehensive Assessment, Third Edition or MCA III, in Minnesota) annually to all students in grades three through eight. Students must also be tested at least once in science (Minnesota Department of Education, 2011). Each year a school does not make adequate yearly progress for their total population or any of the seven subgroups, increasingly severe penalties are imposed, culminating in corrective actions such as replacing most school staff (including the principal), closing the school and assigning students to other district schools, or transforming it into a charter school (Anderson, Minnesota Department of Education, 2009).

In addition to measuring AYP, schools also must annually assess the language skills of students with limited English proficiency (LEP). Minnesota schools meet this requirement by administering two language proficiency assessments: the Tests of Emerging Academic English (TEAE) and the Minnesota Student Oral Language Observation Matrix (MN SOLOM) (Minnesota Department of Education, 2003).

Graduation-Required Assessment for Diploma: In order to fulfill Minnesota high school graduation requirements, all students must take and pass the Graduation Required Assessments for Diploma (GRAD), which "measure student performance on essential skills in Writing, Reading and Mathematics for success in the 21st century" (Minnesota Department of Education, 2011).  These assessments are administered as part of the Minnesota Comprehensive Assessments (MCA III) in Grade 9 for written composition, grade 10 for reading, and grade 11 for mathematics. Students who do not attain the score necessary to meet the graduation requirement on their first attempt may retest (Minnesota Department of Education, 2011).

AFT (2003) State-by-State Resources: Minnesota.
Arter, J. (1995). Portfolios for assessment and instruction. Greensboro, NC: ERIC Clearinghouse on Guidance. (
Airasian, P. (1997). Classroom assessment (3rd Edition). New York: Macmillan.
Anastasi, A. (1988). Psychological testing (6th Edition). New York: Collier Macmillan Publishers.
Bond, L. A. (1996). Norm- and criterion-referenced testing. Practical Assessment, Research & Evaluation, 5(2). Retrieved July 25, 2011 from
Cooper, James M. (1999). The teacher as a decision-maker. In Classroom Teaching Skills. (6th Ed.) James M. Cooper (editor) pp. 1-19.. Boston: Houghton-Mifflin.
Dodge, J. (2011). What are formative assessments and why should we use them? Scholastic Retrieved from
Garrison, C., & Ehringhaus, M. (2007). Formative and summative assessments in the classroom. Retrieved from
Glass, G. and Stanley, J. (1970). Statistical Methods in Education and Psychology. Engelwood Cliffs, NJ: Prentice-Hall, Inc.
Good, T.L. & Brophy, J.E. (1984) Looking in Classrooms. NY: Harper & Row.
Gulikers, J., Bastiaens, T, & Kirschner, P. (2004). A five-dimensional framework for authentic assessment. Educational Technology Research and Development. 52 (3). 67-86.
Karmel, L. & Karmel, M. (1978). Measurement and evaluation in the schools (2nd Edition). New York: Macmillan Publishers.
King, E.N. (2006). Understanding Test Scores. Retrieved from
Minnesota Department of Education (2003). Accommodations for students with limited English proficiency (LEP) on Minnesota statewide assessments. Retrieved from
Minnesota Department of Education (2011). No Child Left Behind programs. Retrieved from
Minnesota Department of Education (2011). Graduation required assessment for diploma. Retrieved from
Oosterhof, A. (1999). Developing and using classroom assessments (2nd Edition). Upper Saddle River, NJ: Merrill Prentice Hall.
Nitco, A. & Brookhart, S. (2011). Educational assessment of student. (Sixth Edition). Boston: Pearson Education, Inc.
Northwest Evaluation Association (2006). Measures of Academic Progress (MAP) Minnesota State-Aligned Version 5. Retrieved from
Snowman & McCown (2009). Psychology applied to teaching (13th Edition). Belmont, CA: Wadsworth, Centage Learning.
Stiggins, R., Arter,J., Chappuis, J., & Chappuis, S. (2006). Classroom assessment for student learning. Boston: Pearson Education, Inc.
U.S. Department of Education (2004). Introduction: No Child left Behind. Retrieved from: Vogt, W. P. (1993). Dictionary of statistics and methodology. Newbury Park, CA: Sage.

Updated July 2012 by Edmund J. Sass