By Adam Robbins
If you were to write a list of the most important things a middle leader must do, how high up would ‘developing an effective assessment model’ be?
I imagine it would not be top five. Maybe it would be top ten, at a push. In many cases, developing an effective assessment model would not be prioritised because, in general, middle leaders have an enviable confidence in the way they assess students and in their response to the data they gather.
Either that or they think assessment must be done the way it has always been done and so there is no benefit to exploring it in more detail. I argue that not enough time is spent considering the validity, reliability and personnel factors that assessment choices impact.
I want to take some time to discuss some aspects of assessment with the aim of encouraging us all to think carefully about what assessment model we choose and what it will deliver. For the purpose of this article, I am focussing on the use of formal summative assessments, which I will call tests. I recognise that assessment also includes informal and more formative strategies, but they are not the focus of this article as I consider them to be less part of an “assessment model” and are more to do with general teaching and learning.
Assessment is all about choices
Assessment is an imperfect solution to a complex problem: How do we know what our students know and can do? This means that when we make decisions about our assessment model, they have consequences. We should start our review of our assessment model by considering carefully What do we want our model to achieve?
Do we want to generate data that shows a student’s performance in each area of the curriculum? Do we value an assessment style that guarantees marking-accuracy or one that embodies our subject’s intent more holistically? Do we need to report this to parents regularly? How often? Will performance in these assessments be used to determine future classes or a tier of entry for students? The list goes on.
Once we have decided what our utopian ideals might be, we may find that the practicalities of time, budget and expertise force us to make some pragmatic compromises, but at least we will get as close as we can.
In my opinion, the first place to start is to decide how valid we need the test to be. A valid test is one that accurately measures what you intend it to measure. We need to ask ourselves what does a test in my subject need to contain to make it a true representation of that subject? For some subjects like Maths this is straightforward, for others like Drama or History this might be a more complex discussion.
Test reliability is best interpreted as inter-rater reliability. This is a measure of how likely two markers are at awarding the same mark. Again, a subject like Maths benefits from high-reliability intrinsically. However, a subject like English has a choice to make. A more valid testing format, like essay writing, drastically impacts the reliability of the marking. This is not to say English teachers are poor at marking, just that marking essays is inherently harder to mark reliably than MCQs, as there is the natural subjectivity of the marker to consider.
Therefore, our first big decision is to decide how much we value validity and reliability and agree on a format we think suits our subject best.
How frequently you assess your students has a greater impact on the inferences you can make from the data than you might initially think. Some people think that when it comes to assessments, more is better. But often this is not the case. One of the key things leaders often want to do with assessment data is to generate some sort of grade to indicate student performance. Then the logic goes; if we do more frequent, but smaller, assessments we can generate more granular data that might give us an indication of student progress within the year.
Unfortunately, this is probably not possible. By increasing the frequency of testing we probably need to shorten them. By shortening them we reduce the amount of the domain, the subject content, that can be sampled. If the assessment is only assessing a small section of the domain then the inferences we make are going to be limited.
Our students’ score on this test will not, therefore, reflect their overall performance in the subject as there will not be sufficient marks available to test a representative sample of the subject. So frequent shorter tests are good at providing a level of informal data to the teachers about what aspects of a topic the students grasped well and what areas need more time, but they cannot give us reliable performance data that allows us to award a grade.
If we decide awarding the grade is a valuable thing then we will need to prioritise less frequent, but longer assessments, that cover a representative sample of all the content covered up to that point and have a high degree of test validity.
Communicating with students and parents
Awarding grades or scores is the most common way we communicate performance to our students and their parents/carers. But, it is also incredibly difficult. Middle leaders can easily fall into the trap of thinking they can just assign grades based on external boundaries, like previous GCSE grade boundaries. This only works if the test is of a similar length and is as challenging as a normal GCSE or other terminal assessment.
My personal rule of thumb is, if the total marks being awarded are not at least 50% of the total marks awarded for the final grade then it’s probably not worth awarding a grade at all. At least, that is what I have found in Science. Other subjects might have sections that are smaller but are comparable and representative of the whole. I would worry that this would only hold true if performance between the sections was consistent. If they get a grade 6 in section A will they get a grade 6 overall?
Students also have a right to know how they performed. With shorter tests, it might be better just to give them their score as a percentage. You could then contrast this with a class or year average so they have an idea of how they performed relative to their peers. Some prefer to give a rank order, but this can create problems and sometimes middle leaders feel it is cruel to those at the bottom. This makes ranking a value-judgment that will need to be made independently and very carefully.
Middle leaders ideally want an assessment model to measure a student’s progress within or between years. Often, they do not realise how incredibly difficult this is to do with any certainty. To measure performance over time we need to ditch the percentages, as there is no way we can create tests of equal difficulty. We will probably need to use some statistical analysis like creating standardised scores for each student. Then if we track a student’s progress over a series of tests, we will have an idea of how they have moved within our cohort. This is not literal progress but relative progress.
Going further than this requires a stronger set of skills. We will need to dip our toes into item response theory and the Rasch model. If we design our assessments with commonly recurring questions, then it will give us a way of comparing students’ performance with more certainty. Even then, the elusive “progress over time” goal is challenging to achieve. So, it’s probably best to just keep it simple and do our best to avoid making such comparisons.
Any decision a middle leader makes around assessment drastically impacts on the team’s workload. Moving from two to three assessments in Year 7 increases the workload of the teachers by 50%, as all students need their assessments to be marked. This is time that those teachers cannot spend planning lessons, etc.. and this opportunity-cost is a huge consideration. Often middle leaders are tempted to resolve all these conflicting issues by trying to do a bit of everything. This is when the workload can balloon if not carefully considered.
Finding the way forward
Where does this leave us as leaders then? We have competing objectives that we will most likely have to balance or compromise on. What I have found to be the best approach is to figure out which of the priorities I can shift out of formal assessment and into informal classroom activities. For example, if I want to know which parts of a particular topic students have grasped, instead of doing a ‘topic test’ can I find this information in a more informal way? Can I used regular AFL to build a picture? Or can I reduce the test to a practice activity that is pupil-marked and low-stakes?
This way we keep our teachers’ time and effort focused on the things we can’t get another way, a more robust measure of performance. To do this we need to have longer tests that are less frequent but will cause more intense periods of marking.
Assessment is a difficult thing to get right. When designing an assessment model it is vital we are clear about what we want it to achieve and accept that by making these choices there will be consequences that we will have to live with. However, an assessment model that suits your highest priorities will be highly effective and provide you and your teachers with a useful tool, so it is worth the time to discuss, debate and plan a comprehensive model that has its aims deliberately stated and its constraints acknowledged.
After all, as the Rolling Stones said “You can’t always get what you want, But if you try sometimes, well, you just might find, You get what you need”.