Single-case Intervention Designs in Educational Psychology
Last updated: Sep 16, 2025
Group designs have dominated educational/psychological intervention research. Aside from the general bias in these fields toward large-n research, there are several practical reasons group designs are especially useful in educational settings, not the least of which is that students are conveniently clustered in hierarchical units (classrooms, schools, districts, etc.).
However, this is often an infeasible approach due to the intensity of some interventions (e.g., intensive math interventions with multiple sessions per week) and the widely varying needs of students receiving them, such as those with low-incidence disabilities. Single-case intervention designs (SCIDs), known as single-case experimental designs elsewhere, are a necessary method to address such prohibitive limitations of group designs (for example, by lowering the number or amount of resources needed).
Schools have an ethical obligation to provide evidence-based services (National Association of School Psychologists, 2020). Group-based designs seek to create a foundation of scientific evidence to address the needs of an “average” student in a population and/or a subgroup. Even when group designs can be targeted to specific subgroups, design features allowing everyone to eventually receive treatment like waitlist control conditions may not address the unique needs of students with particularly intensive academic, behavioral, and mental health needs. SCIDs help address many of these logistical, ethical, and methodological limitations of group designs, helping establish the evidence base for more intensive and specific interventions.
In the last 25 years, research on evidence-based education practices in the United States has grown rapidly. This growth has been in large part the product of national legislation to improve student academic achievement and hold schools accountable (e.g., No Child Left Behind). Federal research funding mechanisms to conduct scientific research on educational practices also contributed to this growth.
Establishing a clearer scientific basis of educational practices required the enhancement of research design and statistical methods to evaluate program effectiveness within complex educational systems. This resulted in wide adoption of randomized trials for testing educational practices, trials that used individual- and group-level quantitative methods (e.g., multilevel modeling).
Around the same time, the primary federal special education legislation in the U.S. was reauthorized as the Individuals with Disabilities Education Improvement Act (IDEIA). This shifted several aspects of special education evaluation methods (specifically those for specific learning disability) toward a response-to-intervention (RTI) approach, compared to more traditional IQ-achievement discrepancy models for specific learning disability (SLD) evaluation. Although RTI approaches had been developed and promoted within the special education and school psychology community before IDEIA, broader national initiatives to take this approach required the development of more tools to evaluate whether interventions were improving students’ achievement as intended.
In the RTI model, if a student does not respond to evidence-based and high-fidelity interventions (e.g., a reading intervention successfully implemented with all necessary evidence-based components), the general education setting may be unable to meet the student’s needs. More intensive, longer-term special education services may then be warranted. Determining “response” is based on psychometrically sound progress-monitoring of the target skills (e.g., oral reading fluency) and on the internal validity of the evaluation procedure. This data-based individualization of interventions necessitates a single-case design approach to address individual student needs, and ensuring the integrity of data-based decisions requires nuanced understanding of the factors that may impede internal validity of intervention implementation and data collection (Riley-Tillman et al., 2020).
RTI is most often associated with academic skills, but many schools also implement a similar model for behavior called positive behavioral interventions and supports (PBIS; Center on PBIS, 2025). SCIDs are also essential for evaluating the impact of intensive behavioral interventions in schools (Steege et al., 2019). Both RTI and PBIS fall under a broader umbrella of multi-tiered system of supports that emphasizes evidence-based practices across a continuum of general to intensive educational supports (Center on Multi-Tiered System of Supports, 2025).
Advances in group design educational intervention research methods have undoubtedly improved the state of education science. Study design and quantitative methods for large-scale, cluster-randomized intervention studies have been essential to effectively scale up efficacy trials into more realistic school settings. Simultaneously, SCIDs have undergone substantial development in terms of their design options and effect size quantification.
Yet despite the parallel and occasionally overlapping developments (Pustejovsky et al., 2014), SCIDs and group designs have typically operated in different spheres of causal inference. The traditional models of causal inference (e.g., Rubin causal model) often focus more on between-person comparisons. SCIDs are an idiographic approach, and just like group designs, many of these employ features such as randomization and replication (Kazdin, 2020). However, the effects of interest in an SCID typically pertain to individuals.
As a result, it may not be immediately clear that group designs and SCIDs are subject to many of the exact same assumptions of causal inference. Historically, different terminology across subfields and different design goals have also obscured this similarity. For example, SCID researchers/practitioners have often used the term “functional relation” to describe a systematic change in an outcome as a function of intervention implementation. Elsewhere, this quantitative entity is referred to as a specific population-level parameter in a statistical model called an “average treatment effect”. In the end, these different terms may refer to the exact same thing: the average difference in some quantity across experimental conditions that is assumed to represent a causal relation.
A directed acyclic graph (DAG) is a tool for specifying causal relations using visual path-building procedures. In its simplest form, a DAG is just a directed arrow linking two variables, e.g., X → Y. A directed arrow indicates a causal relation.
In the DAG of X → Y, if there is nothing else that also causes both X and Y, then X causes Y. But let’s say that Z causes X, so Z → X and X → Y. That still doesn’t interfere with the causal relation between X and Y. However, if X ← Z → Y, then the path X → Y would be confounded. This is a problem unless we adjust for Z in some way in the design of the study; e.g., via a statistical adjustment (as long as Z is really the only confounder).
Using an example from SCIDs, let’s say that three students receive a reading intervention (X) where the outcome (Y) is oral reading fluency (i.e., words read correct per minute) measured nine times over a fixed timespan. A common decision is when to start the intervention. For simplicity, suppose the intervention can be implemented at three different time points; it will either be started after two, four, or six baseline measurements. This is called a “multiple-baseline design” because each student has a different baseline length (AllPsych, 2021; Kazdin, 2020).
To best meet each student’s instructional needs, the researcher decides that one student has more intensive reading needs (i.e., is a “higher-need” student), and so will receive intervention sooner (which may be a necessary decision in practice). This student will have two baseline-period measurements and seven intervention-period measurements. This “early intervention” plan is similar to a “high dose” or “active treatment” in biomedicine. The remaining two students have better pre-baseline reading abilities (i.e., “lower need”), and so will have four (or six) baseline-period measurements and five (or three) intervention-period measurements; i.e., a “late intervention” plan akin to a “low dose” or “control treatment” in biomedicine.
The instructor will then compare the average change in reading fluency across the earlier-intervention student to that across the later-intervention students. They will then conclude that the difference in these average changes is a representative estimate of the overall intervention effect.
But there is a problem. The instructor won’t observe how the higher-need student would have performed under a late-intervention plan. That is, they will not observe how such a student would have truly benefited from starting the intervention earlier versus later. The converse applies to the lower-need students. These “counterfactual” outcomes cannot be observed for any one student because each student only receives either an early or late intervention.
Does assigning students to early or late interventions based on their pre-baseline reading ability solve the problem? If the instructor almost always assigns higher-need students to early intervention, they will only ever observe the change in reading fluency for the handful of such students they assign to late intervention. This will not give them a reliable idea of the treatment effect among higher-need students. Similarly, they will not get a reliable idea of the treatment effect among lower-need students. In other words, the causal relation between X and Y is confounded by a student’s pre-baseline reading ability (X ← pre-baseline reading ability → Y). Hence, this design would lead to an inaccurate measure of the actual average impact of the intervention on all students’ oral reading fluency.
The way to reliably estimate the average treatment effect across both higher- and lower-need students is to randomize the interventions. Randomizing these three students to different intervention start points would experimentally control for this confounding.
Recently, my colleagues and I published a paper (Hall et al., 2024) describing the application of DAGs to single-case designs to help identify internal validity threats, even when a design includes randomization. Randomization is a clear solution to many internal validity problems, but how randomization accomplishes that (and its limitations) hadn’t yet been described in the context of DAGs. We hope this paper is a starting point for more DAG applications in SCID and small-n studies that can bridge causal inference across different kinds of designs and fields.
Causal inference in educational settings has both scientific implications (i.e., evidence-based practices) and legal implications (e.g., special education evaluations). DAGs are one method that can assist in explaining the key study design assumptions necessary to determine the impact of educational practices.
- AllPsych. (2021). Chapter 4.3: Multiple baselines. allpsych.com/research-methods/singlesubjectdesign/multiplebaselines/
- Center on PBIS. (2025). What is PBIS? https://www.pbis.org/pbis/what-is-pbis
- Center on Multi-Tiered System of Supports. (2025). Multi-level prevention system. Center on Multi-Tiered System of Supports at American Institutes for Research. https://mtss4success.org/essential-components/multi-level-prevention-system
- Hall, G. J., Putzeys, S., Kratochwill, T. R., & Levin, J. R. (2024). Discovering internal validity threats and operational concerns in single-case experimental designs using directed acyclic graphs. Educational Psychology Review, 36, 128. https://doi.org/10.1007/s10648-024-09962-2
- Kazdin, A. E. (2020). Single-case research designs. Oxford University Press.
- National Association of School Psychologists. (2020). The professional standards of the National Association of School Psychologists. Author.
- Pustejovsky, J. E., Hedges, L. V., & Shadish, W. R. (2014). Design-comparable effect sizes in multiple baseline designs: A general modeling framework. Journal of Educational and Behavioral Statistics, 39(5), 368-393. https://doi.org/10.3102/1076998614547577
- Riley-Tillman, C. T., Burns, M. K., & Kilgus, S. (2020). Evaluating educational interventions (2nd ed.). Guilford.
- Steege, M. W., Pratt, J. L., Wickard, G., Guare, R., & Watson, T. S. (2019). Conducting school-based functional behavioral assessments. Guilford.
Dr. Garret Hall is an assistant professor in the MS/EdS School Psychology program as well as the Combined Counseling and School Psychology PhD program. He graduated with his PhD in Educational Psychology (School Psychology Area) from the University of Wisconsin-Madison in 2020. He received his M.S. in Educational Psychology (School Psychology Area) from the University of Wisconsin-Madison in 2017, and he received a B.A. with majors in English and Psychology from Northern Illinois University in 2015.
Dr. Hall’s research focuses primarily on development of academic skills (especially mathematics). This includes examining predictors of academic development (such as language and executive functions) as well as assessment, prevention, and intervention methods for academic difficulties. He is also interested in quantitative methods in school psychology research and practice (such as causal inference, measurement, and longitudinal methods) and how these methods can inform school-based prevention and intervention within multi-tiered systems of supports (MTSS). Last, he is interested in how ecological factors within MTSS promote students’ school success, such as implementation fidelity and family-school involvement.