Title: Classroom Assessments in Large Scale Assessment Programs
1Classroom Assessments in Large Scale Assessment
Programs
- Catherine Taylor
- University of Washington/OSPI
- Lesley Klenk
- OSPI
2History of Criterion-Referenced Assessment Models
- Measurement driven instruction" (e.g., Popham,
1987) emerged during the 1980s - A process wherein the tests are used as the
driver for instructional change. - If we value something, we must assess it.
- Minimum-competency movement of the 1980's
- Drive" instructional practices toward teaching
of basic skills - Movement was successful - Teachers did teach to
the tests. - Unfortunately, teachers taught too closely to
tests (Smith, 1991 Haladyna, Nolen, Hass, 1991).
- The tests were typically multiple-choice tests of
discrete skills - Instruction narrowed to the content that was
tested in the same form that it was tested.
3History of Criterion-Referenced Assessment Models
- Large-scale achievement tests came under wide
spread criticism - Negative impacts on the classroom
(Darling-Hammond Wise, 1985 Madaus, West,
Harmon, Lomax, Viator, 1992 Shepard
Dougherty, 1991). - Lack of fidelity to valued performances
4History of Criterion-Referenced Assessment Models
- Studies compared indirect and direct measures of
- writing (Stiggins, 1982)
- mathematical problem-solving (Baxter, Shavelson,
Herman, Brown, Valadez, 1993) - science inquiry (Shavelson, Baxter, and Gao,
1993) - Demonstrated that some of the knowledge and
skills measured in each assessment format overlap - Moderate to low correlations between different
assessment modes - Questions about the validity of multiple-choice
test scores. - Other studies (Haladyna, Nolen, and Haas, (1991)
Shepard and Dougherty (1991), and Smith (1991))
showed - pressure to raise scores on large scale tests
- narrowing of the curriculum to the specific
content tested - substantial classroom time spent teaching to the
test and item formats.
5History of Criterion-Referenced Assessment Models
- In response to criticisms of multiple-choice
tests assessment reformers (e.g., Shepard, 1989
Wiggins, 1989) pressed for - Different types of assessment
- Assessments that measure students' achievement of
new curriculum standards - Assessment formats that more closely match the
ways knowledge, concepts and skills are used in
the world beyond tests - Encourage teachers to teach higher order
thinking, problem-solving, and reasoning skills
rather than rote skills and knowledge.
6History of Criterion-Referenced Assessment Models
- In response to these pressures to improve tests
- LEAs, testing companies, and projects (e.g., New
Standards Project) incorporated performance
assessments into testing programs - Performance assessments" included
- Short-answer items similar to multiple-choice
items - Carefully scaffolded, multi-step tasks with
several short-answer items (e.g., Yen, 1993) - Open-ended performance tasks (California, 1990
OSPI, 1997).
7History of Criterion-Referenced Assessment Models
- Still, writers criticized these efforts
- Tasks are contrived and artificial (see, for
example, Wiggins, 1992) - Teachers complain that standardized tests dont
assess what is taught in the classroom - Shepard (2000) indicated that the promises of
high quality performance-based assessments have
not been realized. - Authentic tasks are costly to implement,
time-consuming, and difficulty to evaluate - Less expensive performance assessment options are
less authentic
8Impact of National Curriculum Standards
- Knowledge is expanding rapidly
- Education must shift away from knowledge
dissemination - Students must learn how to
- Gather information
- Comprehend, analyze, interpret information
- Evaluate the credibility of information
- Synthesize information from different sources
- Develop new knowledge
9Early Attempts to Use Portfolios for State
Assessment
- Three states attempted to use collections of
classroom work for state assessment - California (Kirst Mazzeo, 1996 Palmquist,
1994) - Kentucky (Kentucky State Department of Education,
1996) - Vermont (Fontana, 1995 Forseth, 1992 Hewitt,
1993 Vermont State Department of Education,
1993, 1994a, 1994b).
10Early Attempts to Use Portfolios for State
Assessment
- Initial efforts were fraught with problems
- Inconsistency of raters when applying scoring
criteria (Koretz, Stecher, Deibert, 1992b
Koretz, Stecher, Klein, McCaffrey, 1994a), - Lack of teacher preparation in high quality
assessment development (Gearhart Wolf, 1996), - Inconsistencies in the focus, number, and types
of evidence included in portfolios (Gearhart
Wolf, 1996 Koretz, et al 1992b), and - Costs and logistics associated with processing
portfolios (Kirst Mazzeo, 1996).
11Research on Large Scale Portfolio Assessment
- Research on impact of portfolios showed mixed
results - Teachers and administrators have generally
positive attitudes about use of portfolios
(Klein, Stecher, Koretz, 1995 Koretz, et al
1992a Koretz, et al 1994a) - Positive effects on instruction (Stecher,
Hamilton, 1994) - Teachers develop a better understanding of
mathematical problem-solving (Stecher Mitchell,
1995) - Too much time spent on the assessment process
(Stecher, Hamilton, 1994 Koretz et al, 1994a) - Teachers work too hard to ensure that portfolios
"look good" (Callahan, 1997).
12Advantages to using classroom evidence in
large-scale assessment program
- Evidence that teachers are preparing students to
meet curriculum and performance standards
(opportunity to learn), - Broader evidence about student achievement
- Opportunity to assess knowledge and skills
difficult to assess via standardized tests (e.g.,
speaking and presenting, report writing,
scientific inquiry processes) - Opportunity to include work that more closely
represents the real contexts in which knowledge
and skill are applied
13Opportunity to Learn
- Little evidence is available about whether
teachers are actually teaching to curriculum
standards. - Claims about positive impacts of new assessments
on instructional practices are largely anecdotal
or based on teacher self-report - Legal challenges to tests for graduation,
placement, and promotion demand evidence that
students have had the opportunity to learn tested
curriculum (Debra P. v. Turlington, 1979). - There is no efficient method to assess students
opportunity to learn the valued concepts and
skills - Collections of classroom work provide a window
into the educational experiences of students - Collections of classroom work provide window into
the educational practices of teachers - Collections of classroom work could help
administrators evaluate the effectiveness of
in-service teacher development programs - Classroom assessments could be used in court
cases to provide evidence of individual students
opportunity to learn
14Broader Evidence of Student Learning
- Some students function well in the classroom but
do not perform well on tests. - Stereotype threat" research - fear of negative
stereotype can lead minority students and girls
to perform less well than they should on
standardized tests (Aronson, Lustin, Good,
Keough, Steele, Brown, 1999 Steele, 1999 Steele
Aronson, 2000). - Students may have cultural values or language
development issues that inhibit performance on
timed, standardized tests - These factors threaten the validity of
large-scale test scores. - Classroom work can be more sensitive to students
cultural and linguistic backgrounds - Collections of classroom work can be more
reliable than standardized test scores
15Including Standards that are Difficult Measure on
Tests
- Some desirable curriculum standards are too
unwieldy to measure on large-scale tests (e.g.,
scientific inquiry, research reports, oral
presentations) - Historically, standardized tests measure complex
work by testing knowledge of how to conduct the
work. Examples - Knowing where to locate sources for reports
- Knowing how to use tables of contents,
bibliographies, card catalogues, and indexes - Identifying control or experimental variables in
a science experiment - knowing appropriate strategies for oral
presentation - Knowing appropriate ways to use visual aids
- Critics often note that knowing what to do
doesn't necessarily mean one is able to do.
16Authenticity
- Frederickson (1984) question of authenticity in
assessment due to misrepresentation of domains by
standardized tests. - Wiggins (1989) claimed that in every discipline
there are tasks that are authentic to the given
discipline. - Frederickson (1998) stated that authentic
achievement is - significant intellectual accomplishment that
results in the construction of knowledge through
disciplined inquiry to produce discourse,
products, or performances that have meaning or
value beyond success in school. (p. 19, italics
added). - Examples of performances
- Policy analysis
- Historical narrative and evaluation of historical
artifacts - Geographic analysis of human movement
- Political debate
- Story and poetry writing
- Literary analysis/critique
- Mathematical modeling
- Investment or business analyses
- Geometric design and animation
- Written report of a scientific investigations
- Evaluation of the health of an ecosystem
17Authenticity
- Some measurement specialists question the use of
the terms authentic and direct measurement - All assessments are indirect measures from which
we make inferences about other, related
performances (Terwilliger, 1997)) - However
- Validity is related to the degree of inference
necessary from scores on a standardized tests to
valued work - Authentic classroom work requires less inference
than multiple choice test scores
18Challenges with Inclusion of Classroom Work in
Large Scale Programs
- Limited teacher preparation in classroom-based
assessment (which can limit the quality of
classroom-based evidence), - Selections of evidence (which can limit
comparisons across students), - Reliability of raters (which can limit the
believability of scores given to student work) - Construct irrelevant variance (which can limit
the validity of scores)
19Solving Teacher Preparation Issues
- Teachers must be taught how to
- Select, modify, and develop assessments
- Score (evaluate) student work
- Write scoring (marking) rules for assessments
that align to standards - Significant, ongoing professional development in
assessment is essential. - Teachers need to re-examine
- Important knowledge and skills within each
discipline - How to teach so that students are more
independent learners
20Selection of Evidence
- "For which knowledge, concepts, and skills do we
need classroom-based evidence?" - Koretz, et al (1992b) claimed that, when teachers
are free to select evidence, there is too much
diversity in tasks - Diversity may cause low inter-judge agreement
among raters of the portfolios. - Koretz and his colleagues recommended placing
some restrictions on the types of tasks
considered acceptable for portfolios. - Teachers need guidance in terms of what
constitutes appropriate types of evidence.
21Improving Selections of Evidence
- Provide guidelines for what constitutes an
effective collection of evidence - Provide models for the types of assignments
(performances) that will demonstrate the
standards. - Provide blueprints for tests that can assess that
EALRs assessed by WASL - Provide guides for writing test questions and
scoring rubrics - Provide guides for writing directions and scoring
rubrics for assignments (performances)
22Guidelines for Collections Include
- Lists of important work samples to collect (e.g.,
research reports, mathematics problems) - Number and types of evidence for each category
- Outline of steps in performances and work samples
- Tools for assessment of students performances
and work samples
23Example Lists of Number and Types of Work Samples
to Collect
- Writing Performances
- At least 2 different writing purposes
- At least 3 different audiences
- Some examples from courses other than English
- Science Investigations
- At least 3 investigations (physical, earth/space,
life) - Observational assessments of hands-on work
- Lab books
- Summary research reports
24Develop Benchmark Performance Assessments
- Benchmark performances are performances that
- Have value in their own right
- Are complex and interdisciplinary
- Students expected to do by the end of some
defined period of time (e.g., the end of middle
school). - Performance may require
- Application of knowledge, concepts and skills
across subject disciplines (e.g., survey
research) - Authentic work within one subject discipline
(e.g., scientific investigations, expository
writing)
25Example Description of a Benchmark Performance in
Reading
- By the end of middle school, students will select
one important character from a novel, short
story, or play and write a multi-paragraph essay
describing a character, how the character's
personality, actions, choices, and relationships
influence the outcome of the story, and how the
character was affected by the events in the
story. Each paragraph will have a central thought
that is unified into a greater whole supported by
factual material (direct quotations and examples
from the text) as well as commentary to explain
the relationship between the factual material and
the student's ideas.
26Example Description of a Benchmark Performance in
Mathematics
- By the end of high school, students will
investigate and report on a topic of personal
interest by collecting data for a research
question of personal interest. Students will
construct a questionnaire and obtain a sample a
relevant population. In the report, students will
report the results in a variety of appropriate
forms (including pictographs, circle graphs, bar
graphs, histograms, line graphs, and/or stem and
leaf plots and incorporating the use of
technology), analyze and interpret the data using
statistical measures (central tendency,
variability, and range) as appropriate, describe
the results, make predictions, and discuss the
limitations of their data collection methods.
Graphics will be clearly labeled (including name
of data, units of measurement and appropriate
scale) and informatively titled. References to
data in reports will include units of
measurement. Sources will be documented.
27Example of the Process of Developing Benchmark
Performances
- Select work that would be familiar or meaningful
- Purchasing decision
- Describe the performance in some detail
- A person plans to buy a ___ on credit. The
person figures out how much s/he can spend
(down-payment and monthly payments), does
research on the different types of ___, reads
consumer reports or product reviews, compares
costs and qualities, and makes a final selection.
The person then locates the chosen product and
purchases it or finances the purchase.
28Example of the Process (continued)
- Define the steps adults take to complete the
performance - A person plans to buy a ___ on credit for ____
purpose. - The person figures out how much s/he can spend
- Determines money available for down-payment
- Compares income and monthly expenses to determine
cash available for monthly payment - Does research on the different types of ___
including costs and finance options. - Reads consumer reports or product reviews
- Compares costs, qualities, and finance options
- Makes a final selection.
- Locates the chosen product and finances the
purchase.
29Example of the Process (continued)
- Create grade level appropriate steps
- The student plans to buy a ___ on credit for
_____ purpose. - The student
- Figures out how much s/he can spend
- Determines money available for down-payment
- Compares income and monthly expenses to determine
cash available for monthly payment - Does research on the at least 3 types of ______
- Determines costs and finance options.
- Reads consumer reports or product reviews
- Compares costs, qualities, and finance options
- Makes a final selection that is optimal for cost,
quality and finance options within budget.
30Example of the Process (continued)
- Identify the EALRs demonstrated at each step
- The student plans to buy a ___ on credit for
_____ purpose. - The student
- Figures out how much s/he can spend (EALR 4.1)
- Determines money available for down-payment (EALR
4.1) - Compares income and monthly expenses to determine
cash available for monthly payment (EALR 3.1) - Does research on the at least 3 types of ______
(EALR 4.1) - Determines costs and finance options (EALR 1.5.4)
- Reads consumer reports or product reviews (EALR
4.1) - Compares costs, qualities, and finance options
(EALR 3.1) - Makes a final selection that is optimal for cost,
quality and finance options within budget (EALR
2.1-2.3)
31Example of the Process (continued)
- Modify the steps as needed to ensure
demonstration of the EALRs - The student plans to buy a ___ on credit for
_____ purpose. - The student
- Figures out how much s/he can spend (EALR 4.1)
- Determines money available for down-payment (EALR
4.1) - Compares income and monthly expenses to determine
cash available for monthly payment (EALR 3.1) - Does research on the at least 3 types of ____
(EALR 4.1) - Determines costs and finance options (EALR 1.5.4)
- Reads consumer reports or product reviews (EALR
4.1) - Creates a table to show comparison of costs,
qualities, and finance options (EALR 3.1) - Makes a final selection and explains how it is
optimal for cost, quality and finance options
within budget (EALR 2.1-2.3)
32Possible Authentic Performances in Mathematics
- Survey Research
- Community issue
- School issue
- Return on investment (costs and sales)
- Purchasing decisions
- Graphic designs
- Animation
- Social science analyses
- Sources of GDP
- Major categories of federal budget
- Casualties during war
33Possible Authentic Performances in Reading
- Literary analyses
- Comparisons across different works by the same
author - Comparisons across works by different authors on
same theme - Analysis of theme, character, plot development
- Reading journals
- Research reports
- Summary of information on a topic from multiple
sources - Investigation of a social or natural science
research question using multiple sources - Position paper based on information from multiple
sources
34Providing example blueprint for tests that can
assess the standards
35Example blueprint for tests that can assess
standards
36Solving Score Reliability Issues
- Train expert teachers to evaluate diverse
collections of evidence - Expert teachers evaluate the collection of work
to determine whether it meets standards
37Construct Irrelevant Variance
- Factors that are unrelated to targeted knowledge
and skills that affect validity of performance - Teachers provide too much help
- Teachers provide differential types of help
- Students get help from parents
- Directions for assignments are not clear
- Students are taught the content but not how to do
the type of performance
38Solving Construct Irrelevant Variance Problems
- Provide guidelines for what constitutes valid
evidence - Provide model performance assessments or
benchmark performance descriptions - Provide professional development on appropriate
levels of help - Provide professional development on the EALRs and
GLEs - Provide professional development on how to teach
to authentic work
39Conclusion
- Collections of evidence CAN be used to measure
valued knowledge and skills - Collection of Evidence (COE) guidelines for
Washington State - Incorporate many of the characteristics that will
ensure more valid student scores - Will continue to improve as more examples are
provided - Scoring of collections
- Will involve use of the same rigor in scoring as
on WASL items - Will provide reliable student level scores