High Standards, Multiple Tries
This page is for resources for the talk “High Standards, Multiple Tries – How I’ve Been Grading”, which I’ve given in a few places.
Abstract: Supporting students who miss coursework or get off to a bad start presents many challenges. Extending deadlines, giving retakes, or creating alternate assignments strains course staff with rescheduling and regrading; further, judging which student situations warrant these extra opportunities is fraught. Curving course grades or exam grades, dropping lowest assignments, deciding grade cutoffs post hoc, and more fuel a perception of eroding rigor (justified or not).
In this talk I present how I’ve tried to make progress on this in my courses, at scale, over the past few years. I’ve used a mix of technological solutions, syllabus design decisions, staff training, and student communication. The primary design consideration is twofold: clearly articulating course outcomes and their mappings to activities, and assuming that most students will need more than one try on many course activities (to sloganize: “high standards, multiple tries”). From these principles, we design resubmission and regrade systems, a retake strategy for exams, and an overall grading policy that gives me straightforward replies to the vast majority of “exceptional” student situations and doesn’t make me feel like I’m compromising on assessment.
Slides: pptx
Several course websites that use (variants of) the grading scheme I mentioned:
This paper (though not about CSE15L, which was featured in the talk), describes a “two tries” exam strategy from our accelerated intro course, also in the spirit of the talk:
Stream your exam to the course staff
There are three mutually-supporting principles presented in the talk:
The development in the talk starts with (2) and then justifies the others from there. Here, we follow backwards design principles starting with outcomes and working back through implementation and student incentives that will get us there.
The first point is about what meaning is ascribed to an A, B+, C-, and so on. Most of this work I’ve done in the context of a programming-heavy lab-based course on software tools. Outcomes from the course are things like:
A student should be able to clone a repository that contains C code and a Makefile,
cdinto it, runmake, and if there is a compiler error fromgcc, identify which file and line the error is referring to.
A student should be able to write a C program that takes a UTF-8 encoded command-line argument and prints the first Unicode code point in it (given sufficient documentation about UTF-8).
A key observation about programming in today’s courses is this: Assigning these tasks as take-home work, having students submit code and prose about them, and inspecting those artifacts should not give us confidence that students have achieved the outcomes above. Excessive collaboration with peers, help from course staff, and LLM use all confound using this kind of assignment as a secure assessment. Efforts to police collaboration, coach LLM use, and train TAs to not “give away the answer” are a mix of laudable and ineffective, but are all insufficient for secure assessment.
This does not mean we shouldn’t give students take-home programming work! I simply don’t view take-home programming work as a secure way to assign grades on its own. It’s (an important) part of students’ learning, but not a sufficient certification.
By secure assessment I mean that we have high confidence that the work we assess was completed by a specific student under controlled conditions we chose, and those conditions were roughly uniform across students. In-person proctored exams are a great example. Secure assessments are an important part of certifying a student’s work at a particular grade level, because they remove the confounds of external help.
For the courses I’ve been teaching that have programming or tool-based learning outcomes like the ones above, a paper exam isn’t great. While I can have high confidence that students may be able to trace code, identify git commands, or fill in parts of a C program on a paper exam, I don’t actually observe the outcomes I stated above. So for these classes I’ve been doing proctored on-computer assessments that require actually running commands, checking out repositories, making commits, and so on in order to complete them. I do this because I don’t trust the programming assignments to certify that learning outcomes were reached, and I care deeply about observing that students achieved the outcomes I care about!
With secure assessments in hand that map directly to the learning outcomes I care about, and some take-home programming work, we can assemble the evidence of learning we have into a grade for each student. Here a bit of incentive design comes into play. We should not disincentivize students from doing take-home programming work (it is so good for their learning!). But we don’t want take-home programming work to dominate their grade; we can’t trust it for that. While we could try various weightings, I find it more direct to use the standards-based approach. To earn an A (or B, C, etc), students must achieve A/B/C-level success on both their assignments and the secure assessments. So a student who submits no programming work fails the course just as a student who doesn’t pass any exams fails the course.
Some people have called this “min-grading” because it is mathematically equivalent to taking the minimum of the categories. While true, this is not a helpful presentation for students (especially when considering the other factors below), because it comes across as draconian: if the assignments and exams are the same as they would have been otherwise (maybe they aren’t, maybe they are, but it’s not an unreasonable assumption) it seems we’ve simply “removed opportunities” or “made it harder” to get an A, such that a student who “used to” get an A under “normal” grading no longer would.
Indeed – this could be draconian if we made no other changes! And doing this with no further changes would immediately run into myriad predictable issues – students who do poorly early may have a “ceiling” on their grade that disincentivizes trying in other categories, we still need policies for missed work or exams, and so on.
Most modern courses have policies in place so that getting a bad grade on a single exam or an assignment (or even more than one) doesn’t put a hard ceiling on a student’s grade in the course. Often, since categories like assignments and exams are weighted, one category can help “make up” for another one (and sometimes policies take explicit advantage of this by giving extra points in categories, and so on).
In contrast, with category-based standards-based cutoff grading described above, if we aren’t careful, we can set up perverse incentives. A bad grade on an assignment or exam could cap a student’s overall grade at a B or C (or even F), no matter how well they do in another category, and this could even happen quite early. This is disappointing if it makes a student no longer try on assignments because of a bad exam grade, or vice versa!
There are other considerations about “bad” grades or “lost credit”.
Having retries, or a second round of submission addresses all of these points. For assignments, this typically means (for me) having a second deadline a week or two after the initial deadline. Work is submitted, then graded in a few days or a week, then there are a few days or a week for the resubmission. For exams, I’ve been giving several during the quarter and then scheduling all retry exams during finals week.
Some considerations:
In short, allowing multiple submissions supports the high standards of the assessments we discussed in the last section, along with providing other useful incentives and structures for missed work, engaging with feedback, and more.
A major tradeoff with resubmission for each assignment is the grading effort for anything that requires human review. If all students are afforded a resubmit, it seems natural that nearly twice as much grading would result.
In my experience, this is partially true. I have given fewer assignments (e.g. 5 or 6 instead of 8) when allowing resubmits. Also, my final exams have also been mostly made up of retry exam material. However, when I know up front that grading is going to have to happen in bulk and involve resubmits, there are some related decisions we can make that help a lot, specifically in how we set up rubrics.
I have found there to be a lot of false precision when I default to grading things out of e.g. 100 points. I don’t really think that I have 100 increments of credit. Rather, a student’s work on an assignment or exam largely falls into buckets of achievement – they’ve demonstrated mastery, proficiency, not much understanding, etc. Exposing this false precision to students in grades causes issues:
These issue have led me to use coarse-grained rubrics and scoring on just about everything. That is, we may have many mechanisms for grading on the backend (autograding with dozens or hundreds of unit tests, manual review of code, reading written work, reading handwritten code on exam, and so on), but the score exposed to students is a whole number in, say, the 0-4. Any grading work we do is projected to one of those scores, and they send clear signals.
The clearest signal is 4 – if a student scores a 4 there is no more work for them to do; we have evaluated their work as demonstrating mastery and there’s no more “credit” to get. Crucially, there’s usually a bit of leeway below “perfection” that still earns a 4. We may still give feedback about small mistakes, or provide the few less important tests that failed, but all the credit is still earned. This immediately removes an entire class of submissions from needing resubmissions or retries. It also subtly signals that students can relax a little bit – we are looking for them to demonstrate thorough understanding and execution, but not perfection.
Scores of 3 and 2 are similar (sometimes I’ve used 0-3 grading). Significant errors were made, but the work is (mostly) complete and there’s some meaningful understanding demonstrated. A score of 1 usually means something was submitted but it was quite incomplete or missed the point, and a score of 0 is reserved for blank submissions, submissions of something totally wrong or different, or missed deadlines.
A typical rubric for a programming assignment might require that a submission pass all but a handful of test cases out of hundreds, and also demonstrate understanding in written design questions, in order to earn a 4. If some core test cases are passed but some meaningful features are incorrect, this could result in a 2 or a 3 (perhaps depending on the supporting written design work), and so on. Crucially, the human effort of deciding on the boundaries between 1-4 can be done (relatively) quickly. Further, all the benefits of retries described above apply – a grader won’t assign a 2 or 3 to work with no mistakes, so even if a grader gives a 3 when a 4 would have been appropriate, the only consequence is the student resubmits (and they learn more, and they did make mistakes!) to earn full credit.
To avoid resubmissions being treated as the “real” deadlline, I’ll have policies like:
This puts a strong incentive on submitting complete (if not correct) work at the initial deadline in order to get feedback and have the option for a full-credit resubmission available. At the same time, the overall grade thresholds are set to make it so a few 3 grades don’t ruin an overall course grade.
Typically for exams I allow regaining all credit on the retry exam. There are typically fewer exams than assignments, so missing one has more of an impact. In addition, exams interact with students’ schedules differently – they happen in a specific time slot, which students can miss because of illness, emergency, and more that span just a few hours. In contrast, assignments are usually available for a week, and students can submit them early, and time management is a valuable skill. So incentivizing submitting assignments on time comes with a different set of tradeoffs than trying to incentivize the “be present at a specific time” nature of exams.
I’ve presented how I’ve been grading according to three principles: High Standards Across Categories of Work, Multiple Tries on Most Activities, and Coarse Grained Rubrics that support one another in various ways.
Much of the grading strategy presented here is “nonstandard”, at least at my institution. Other courses tend to use some kind of “pile of points” grading with weighted averages across categories. This presents communication challenges: I need students to understand how their work translates to grades, while avoiding pitfalls like students perceiving the policy as draconian or unfair.
There are several explicit strategies that have come up.
I’ve been using syllabi designed from these principles for several years now. Some of these use variations on the principles above: scores may be 0-3 or 0-2 instead of 0-4, there are creative ways of applying resubmission credit to assignments, there are elements like participation in lab factored in, and so on. There’s actually a fairly rich design space here – in a lot of what I say above I pick specific examples and policies to describe things, but they are far from the only version I can imagine working.
All of the courses below have a grading/syllabus section that describes how the scoring works, some description of how retries work, and some publicly-available assignments:
Often courses have 5-10% of the score dedicated to things like participation, completing surveys, attending lecture, doing low-stakes weekly reading quizzes, and so on. Broadly I refer to these as “engagement”. Depending on the course, these might be one of the major categories (e.g. in Fall 2021 CSE11 there are achievement levels for participation for various grade levels). If not, typically what I do is say that the major components like exams and assignments will decide the A/B/C/F letter grade, and various measures of engagement will determine the +/- modifiers on the grade. So low engagement can’t make a grade drop from an A to a B, but it can drop an A to A-. I find this to be a useful minor incentive to engage without having it take over too much of the assessment of mastery.
In particular, engagement cannot increase a grade level or make the difference between passing and failing the course. That has to come from the more carefully designed secure assessments of specific outcomes.
This has gone beyond an experimental phase for me; I plan to use variants of this grading strategy going forward until I find something deeply dissatisfying or notably better. It helps me balance objective and secure assessment with multiple-tries mastery. I have gotten positive or neutral feedback from students about it, and I like the set of motivations and incentives it sets up for them.
Feel free to reach out if you have questions or try some of these ideas and have feedback!