March 17, 2026·retentionspaced-repetitionllmelaboration

LLM-Assisted Elaboration Reduces Flashcard Re-Lapse Rates by 48%

A within-card paired analysis of 246,951 spaced repetition reviews

Gokul Srinivasan, Founder · Complement Research

Abstract. We analyzed 34,179 lapse events across 24,732 flashcards studied over 18 months on Complement, an AI-augmented spaced repetition platform. Using three analytical designs—raw comparison, stratified matching, and within-card paired analysis—we measured whether chatting with an LLM after forgetting a card reduces subsequent re-lapse probability. In the most conservative design (within-card paired, n = 496 treatment vs. 1,461 control across 472 cards), the re-lapse rate was 11.1% [95% CI: 8.5%, 14.1%] with chat vs. 21.2% [19.2%, 23.3%] without, a 48% relative reduction (OR = 0.46 [0.34, 0.63], p < 10⁻⁶). The effect exhibits a dose-response relationship, peaking at 3–5 message exchanges (5.1% re-lapse rate), and persists across 5 subsequent reviews.

1. Introduction

Spaced repetition systems schedule reviews at increasing intervals to maximize long-term retention. When a learner lapses—fails to recall a card—the scheduling algorithm resets the interval, and the cycle restarts. Lapses are costly: each one represents wasted prior study time and signals that the learner’s encoding may be insufficient.

Complement embeds an LLM chat interface directly into the study screen. When a learner encounters a card they’ve forgotten, they can immediately ask the AI to explain the underlying mechanism, provide context, or draw connections to related concepts. The hypothesis is that this interaction creates a richer memory trace—one grounded in understanding rather than rote pattern-matching—that makes the card more resistant to subsequent forgetting.

This is grounded in established cognitive science: elaborative interrogation improves retention over re-reading (Pressley et al., 1987; Dunlosky et al., 2013), generating explanations produces stronger memory traces than passive review (Bjork & Bjork, 2011), and encountering material in different formats creates additional retrieval pathways (Bower, 1972).

2. Data

All data comes from the author’s own account on Complement—the founder’s personal study log spanning August 2024 to March 2026. This is an N = 1 longitudinal study with a large number of within-subject observations.

Metric	Value
Unique cards	24,732
Total review events	249,281
Lapse events (ease ≤ 1)	34,179
Chat sessions linked to cards	3,989
Lapses with chat within 5 min	678
Study period	Aug 2024 – Mar 2026
Overall lapse rate	13.7%

A lapse is defined as any review with ease ≤ 1 on the 0–4 Anki rating scale. A treatment event is a lapse where an AI chat session occurred within 300 seconds of the review timestamp (median proximity: 37 seconds).

Disclosure. The author is the founder of Complement. All data analyzed is from the author’s personal study activity. While this creates a potential conflict of interest, the within-card paired design and statistical methods are designed to let the data speak independently of the analyst’s role. Readers should weigh this context when interpreting results.

3. Methods

3.1 Analytical designs

We employ three progressively more rigorous designs:

Raw comparison. All chat-assisted lapses (n = 678) vs. all unassisted lapses (n = 33,501). High-powered but vulnerable to selection bias.
Stratified matching. Lapse events stratified by previous ease (0–4) and card maturity bucket (reviews 1–3, 4–6, 7–10, 11+). Eliminates confounders correlated with pre-lapse trajectory and card maturity.
Within-card paired. Cards that lapsed both with and without chat at different times. The same card—same content, same inherent difficulty, same learner—serves as its own control. 472 cards qualify.

3.2 Statistical methods

All confidence intervals are 95% bootstrap CIs (2,000 resamples, percentile method). Between-group comparisons use Fisher’s exact test. Effect sizes are reported as odds ratios with Woolf logit CIs and Cohen’s h for proportions.

4. Results

4.1 Primary finding

48%

relative reduction in re-lapse rate (within-card paired)

11.1% [8.5, 14.1] vs. 21.2% [19.2, 23.3], OR = 0.46 [0.34, 0.63], p = 2.4 × 10⁻⁷

Raw comparison

n = 678 vs. 33,501

Within-card paired

472 cards as own controls

Figure 1. Re-lapse rate on the next review after a lapse event. Error bars show 95% bootstrap CIs. Left: all lapse events. Right: same cards serving as their own controls.

Metric	Chat (n = 678)	No chat (n = 33,501)	Difference
Re-lapse rate	8.1% [6.0, 10.2]	20.1% [19.7, 20.6]	−12.0pp [−14.1, −9.8]
Good-or-better rate	90.4% [88.2, 92.6]	73.8% [73.3, 74.3]	+16.6pp
Mean next ease	3.26 [3.19, 3.32]	2.72 [2.71, 2.74]	+0.54
Odds ratio	0.35 [0.27, 0.46] (raw) · 0.46 [0.34, 0.63] (within-card)
Cohen’s h	−0.35 (raw) · −0.28 (within-card)

4.2 Stratified analysis

After stratifying by previous ease and maturity bucket, the chat effect holds in every stratum where sufficient data exists (n_chat ≥ 5). In 11 qualifying strata, the weighted average re-lapse rate is 3.2% (chat) vs. 6.6% (control)—a 51% relative reduction.

Individual strata lack statistical power (most p > 0.05), which is expected given small per-stratum chat samples. The consistency of direction across all strata is the relevant signal.

Odds ratios across analytical designs

OR < 1 favors chat; horizontal bars show 95% Woolf logit CIs

Consecutive lapse, early

0.66 [0.28, 1.51]

Post-Hard, mid

0.35 [0.02, 5.84]

Post-Hard, mature

0.35 [0.02, 5.84]

Post-Hard, very mature

0.45 [0.06, 3.27]

Post-Good, early

0.51 [0.07, 3.87]

Post-Good, mid

0.15 [0.01, 2.47]

Post-Good, mature

0.74 [0.10, 5.58]

Post-Good, very mature

0.11 [0.01, 1.80]

Raw (all events)

0.35 [0.27, 0.46]

Within-card paired

0.46 [0.34, 0.63]

0.010.11.08.0

← Favors chatFavors no chat →

Figure 5. Forest plot of odds ratios across stratified and aggregate analyses. Circles show per-stratum estimates; diamonds show aggregate estimates. All strata show OR < 1 (favoring chat), though individual strata are underpowered. The aggregate estimates are statistically significant.

4.3 Dose-response

Bucketing chat-assisted lapses by the number of user messages reveals a dose-response relationship:

Dose-response relationship

Re-lapse rate by number of user messages in chat session

Figure 2. Re-lapse rate by chat engagement depth. Bars show point estimates; whiskers show 95% bootstrap CIs. The dashed line marks the baseline (no chat) rate. The optimal engagement is 3–5 messages (75% reduction from baseline).

The optimal engagement is a 3–5 message exchange—enough to ask a question, receive an explanation, and confirm understanding. The slight regression at 6+ messages likely reflects unusually confusing cards requiring extended discussion (note the wide CI).

4.4 Longitudinal trajectory

The chat effect persists beyond the immediate next review. Tracking lapse rates across 5 subsequent reviews:

Lapse rate across subsequent reviews

Percentage of reviews resulting in a lapse at each horizon

Figure 3. Lapse rate trajectory across 5 subsequent reviews after the treatment event. The chat effect is strongest at R+1 (−12pp), nearly disappears at R+2 (same-day re-review), then re-emerges and persists through R+3 to R+5. Shaded regions show 95% CIs.

FSRS stability convergence

Memory half-life in days (FSRS model estimate)

Figure 4. FSRS stability trajectory. Chat-assisted cards start with 10× lower stability (4.3 vs. 43.4 days) but converge to match non-chatted cards by R+5 (31.7 vs. 30.1 days). Note: this comparison is confounded by baseline differences; the stratified and within-card analyses provide properly controlled estimates.

4.5 Subject area

Deck	Chat n	Chat relapse [95% CI]	Control relapse	Reduction
Step 1 (AnKing)	631	7.8% [5.7, 9.8]	20.2%	−61%
Step 2	47	12.8% [4.3, 23.4]	18.5%	−31%

The stronger effect in Step 1 material (preclinical sciences) may reflect that these subjects have more mechanistic content where an AI explanation adds particular value.

5. Limitations

N = 1, founder’s data. All data comes from the founder’s personal account. This introduces both a generalizability limitation and a potential conflict of interest. Internal validity is strong (249,281 review events, 472 paired cards), but external validity requires replication across independent users.
Selection bias. The learner may selectively chat about cards where they feel “close” to remembering. The within-card paired design controls for card-level confounders but cannot fully eliminate within-card temporal selection effects.
Observational design. This is not a randomized trial. The gold standard would be randomly assigning chats to some lapses and withholding them from others.
Stratified tests are underpowered. Most individual strata have p > 0.05 due to small per-stratum chat samples. The aggregate pattern (consistent direction across all strata) is the relevant evidence.
Chat content not analyzed. Not all chats may be substantive. The dose-response relationship provides indirect evidence that quality of engagement matters.

6. Discussion

Across three designs, the re-lapse reduction ranges from 48% to 60%. The most plausible mechanism is that the AI chat helps the learner transition from surface-level memorization to mechanistic understanding. A card previously stored as an isolated fact becomes grounded in a causal model, creating more retrieval pathways and greater resistance to interference.

The magnitude is large by educational intervention standards. For comparison, the testing effect (retrieval practice vs. re-reading) typically produces 20–40% improvement, and elaborative interrogation 15–30%. The AI chat effect may exceed these because it combines elaboration, retrieval, and encoding variability simultaneously at the optimal moment—the point of failure.

The dose-response finding has a direct product implication: the AI should aim for focused, 3–5 turn conversations rather than single-shot explanations or open-ended discussions.

7. Conclusion

Chatting with an LLM after forgetting a flashcard reduces the probability of forgetting it again by approximately 48% in a within-card paired design (OR = 0.46 [0.34, 0.63], p = 2.4 × 10⁻⁷). The effect exhibits a dose-response relationship and persists across 5 subsequent reviews. These findings, while limited to a single user, represent one of the first empirical analyses of LLM-augmented spaced repetition at scale.

Methodology notes

All CIs are 95% bootstrap (2,000 resamples, percentile method)
Between-group tests: Fisher’s exact (two-sided)
Effect sizes: Odds ratios with Woolf logit CIs; Cohen’s h for proportions
Lapse definition: ease ≤ 1 on 0–4 scale
Chat linkage: card_association match + temporal proximity < 300s
Data: Founder’s personal study data, 24,732 cards, 249,281 reviews, 18 months
Raw data and analysis scripts available in the project repository

Start studying free