3
21 Comments

I kept rewriting the same quiz + spaced-repetition code. So I packaged it into an API

Show IH: While building my own medical edtech product, I wrote the same "text → quizzes + flashcards + review schedule" plumbing at least three times. Every new feature, same chore: call an LLM, parse the output, schedule the next review, and avoid paying twice for the same input.
So I cleaned it up into one small API — QuizForge:

  • POST /generate — text, a URL, or a PDF → MCQs, short questions, flashcards (in the language you ask).
  • POST /grade — score a free-text answer 0–5 with feedback.
  • POST /review/next — SM-2 spaced repetition, the next due date for a card.
    Two things I cared about:
  1. Don't pay twice. Identical input is served from a content-hash cache — no second LLM call.
  2. Make it testable. The LLM is injected, so the whole thing runs offline — 29 tests, no network.
    It's live as a hosted API with a free tier to try it, and there's a self-host version (Docker, bring your own LLM key) for people who'd rather run it themselves.
    Full disclosure: I build with AI assistance (I pair with Claude), and I'm a solo founder — a maxillofacial surgeon who got into shipping software, not a team. Support is email-only, but I answer everything.
    Happy to dig into the design — and I'd genuinely like to hear how you'd price the hosted version.
    Hosted (free tier): https://rapidapi.com/limack0/api/quizforge
    Self-host (source + Docker): https://limack.gumroad.com/l/quizforge
posted to Icon for group Show IH
Show IH
on June 2, 2026
  1. 1

    The abstain option is probably the most honest solution here. A low confidence flag is more defensible than a 3/5 that could mean anything. The grounding route has a ceiling too: if the source passage itself is thin, pulling context back to it just moves the vagueness upstream. I'd try the abstain logic first and have the grader return a flag when the reference quality falls below a threshold. The question is whether your users accept 'not enough context to grade this accurately' as a result or if that breaks the product experience for them.

    1. 1

      Yeah, you talked me into both — pushed them live yesterday.
      /grade takes an optional source passage now. If it's there, the grader treats it as the real reference and the expected answer is just a hint. And if the passage is too thin or off topic to actually decide anything, it returns null + an "abstained" flag instead of making up a number. Old calls that only send an expected answer still work the same.
      But your ceiling point stuck with me, because you're right — grounding doesn't fix a thin source, it just moves the vagueness somewhere else. That's basically why I didn't treat grounding as the answer and leaned on the abstain flag instead. When there's genuinely not enough to go on, the honest thing is to say so, not guess a 3.
      On whether "not enough context to grade this" breaks the experience — I kind of dodged that on purpose. The API just hands back the flag and a null score, it doesn't show anything to a learner. So whoever builds on top decides what to do: re-ask, send it to a human, skip it, whatever. A flashcard app and a real exam want totally different things there and I didn't want to pick for them. Honestly I don't have enough real grading traffic yet to know which way people lean — that's the thing I'm waiting to see.
      You've clearly built one of these before though — when you added an abstain path, did you actually show it to the user, or just quietly not count the question?

      1. 1

        Shipping both the same day is the move. The null + abstained design is clean — it puts the UX decision where it belongs, in the layer that actually knows the learner. You're right that grounding just moves the vagueness, it doesn't resolve it. Curious what you're seeing so far — are the abstain flags clustering around any particular type of question?

  2. 1

    The spaced repetition scheduling algorithm is deceptively simple to describe but gets annoying to maintain when you add edge cases — what happens when someone misses a session, how do you handle items that are 'almost due', how does the interval scale for items that keep getting wrong? Packaging this as an API makes sense because those edge cases accumulate fast and most teams just want the 'when should this user see this item next' answer without owning the algorithm. Curious how you handle the multi-device case where a user reviews on mobile and desktop — does the scheduling stay consistent across sessions or does each device have its own state?

    1. 1

      You've named the exact reason /review/next is stateless. QuizForge doesn't store the card — you POST the current state (repetitions, ease factor, interval) plus how the answer went, and it returns the next state and due date. It's a pure function, not a session. Which answers your multi-device question by sidestepping it: there's no per-device state to drift, because the API holds none at all. Consistency is whatever your backend record says. If mobile and desktop read and write the same card row, scheduling is identical by construction; if each device keeps its own local copy, they diverge — but that's an architecture choice on the caller's side, not something the algorithm decides.
      On your edge cases — two are in the algorithm, one deliberately isn't, and I'd rather be straight about which:
      - Keeps getting it wrong (quality < 3): repetitions reset to 0, interval drops back to 1 day, and the ease factor decays on the SM-2 curve with a hard floor at 1.3. A stubborn item collapses to daily and stays aggressive until it sticks.
      - Missed session / "almost due": not modeled, on purpose. The scheduler computes the next due date forward from the day you actually review — it doesn't penalize lateness or decay an overdue item. The "is this due / almost due / overdue" bookkeeping lives in the caller. I didn't want the API guessing a learner's calendar. The honest tradeoff: stateless means I don't solve multi-device for you — I just make sure I'm never the thing that breaks it. Did you end up wanting the scheduler to punish overdue items, or did leaving lateness out keep it more predictable?

  3. 1

    I built an internal study-tool API last month, and the content-hash cache jumped out at me. "dont pay twice" is the exact pain dev buyers understand without a demo.

    Also, the hosted plus self-host split is smart. A lot of teams will compare RapidAPI, a DIY Supabase edge setup, or just rolling their own, but the privacy/control buyer usually wants the boring answer on data retention and deletion before they plug real student text into anything. I tried Termly, then iubenda, then PrivacyForge when our policy docs kept drifting from the product.

    Feels like your first paying users are developers building course tools, not teachers shopping around.

    1. 1

      The cache landing first is the right read — "don't pay twice for the same chapter" is the line that needs no demo.You've clearly felt it.
      On the boring-but-important part, here's the honest data story, because privacy buyers deserve the unsexy version:

      • QuizForge doesn't persist raw student text. On /generate the text is resolved, sent to the model, and dropped. The cache stores only the derived output — quiz and flashcards — keyed by a one-way SHA-256 hash of the input. You can't reverse the key back into the source.
      • That cache is in-memory by default, so it's wiped on every restart. Redis is optional and, today, has no TTL — a gap
        I'd close before anyone leans on it hard.
      • /grade and /review/next store nothing at all.
      • The one boundary I won't hand-wave: on the hosted rail the text does transit to the LLM provider to be processed. So if "real student text" is the bar, the honest answer is the self-host kit — run it on your own infra against your own model key, and the text never leaves your perimeter except to the model you chose. That's exactly why the split exists.
      • And the gap, said plainly: there's no per-record delete endpoint yet. Deletion today means restarting the instance or flushing the store. If the privacy/control buyer is real, that's the next thing I'd build.
  4. 1

    Turning your own repeated pain into an API is usually the right move — you're not guessing at what developers need because you were the developer.

    The content-hash cache for identical inputs is the detail that actually matters at scale. Without it, edtech products that let users paste the same textbook chapter get expensive fast.

    The grading endpoint is interesting. 0-5 with feedback is a good choice — open-ended grading is the part that most quiz APIs skip or just offload entirely to the caller. How are you handling the edge case where a technically correct answer uses different terminology than the source material? That's the one that trips up most LLM graders.

    1. 1

      Exactly — that terminology-vs-meaning gap is the case I worried about most. A learner who's right but phrases it differently is the worst person to mark wrong.
      The way it's set up: the grader doesn't match strings. The prompt explicitly tells the model to judge on meaning rather than wording, and to score 0–5 instead of pass/fail — so "right idea, off vocabulary" can land at a 3 or 4 instead of a hard zero (the correct flag flips at score ≥ 3). That soft band is doing most of the work on exactly your edge case.
      The honest limit: it grades the answer against the expected answer for that question, not against the full source passage. So if the expected answer is thin, the model can still under-credit a valid alternative phrasing — the robustness is only as good as that reference answer. I haven't grounded grading back to the source text yet; that's the
      next thing if graders keep tripping on it. You clearly have scar tissue here — how did you end up handling it? Better reference answers, or did you go to retrieval against the source?

      1. 1

        The 0-5 soft band is the right call. Binary pass/fail punishes edge cases too hard, and in any learning context that translates directly into drop-off — learner fails on wording, loses confidence, stops engaging. The next hardest thing after vocabulary gap is probably confidence calibration: how the model handles an ambiguous or thin reference answer. That's where partial scores can quietly over-credit. Have you run into cases where the expected answer itself was the weak link?

        1. 1

          That over-credit framing is sharper than the one I gave you. I was worried about marking a right answer wrong; you're pointing at the quieter failure — marking a vague answer right because the reference gave the model nothing firm to push against. That's the worse bug, because nobody complains about it. And yes — the expected answer has been the weak link, structurally. In QuizForge the reference answer isn't hand-written: /generate produces the short-answer questions and their answers with the same model. So at grade time you've got a model scoring a learner against a reference another model wrote. When the source passage is thin or abstract, that generated reference comes out vague — and a vague reference next to a vague-but-plausible learner answer is exactly where the 0–5 band drifts up to a soft 3 it hasn't earned. The grader has nothing to anchor on, so it splits the difference.I haven't fixed it, and I don't think "better prompts" fixes it. The two directions I'd actually trust: ground both the reference and the grade on the source passage instead of the standalone Q/A pair, or let the grader abstain — return low confidence when the reference is too thin to defend a score, rather than inventing a partial one.When you hit this, did you go the grounding route, or did you make the grader admit it didn't know?

  5. 1

    The first buyer segment question is worth thinking through carefully. The self-host option tells you something: the people most likely to pay early are developers already building an edtech product who have hit this exact loop themselves, not people evaluating quiz tools generically. They come in with context, they understand the build cost you saved them, and they can justify $15/mo instantly because they know what one afternoon of plumbing this yourself costs.

    I would find one or two of those people specifically, give them free hosted access for a month, and ask them to put a number on what it saved them. That number becomes your pricing anchor for everything above the starter tier.

    1. 1

      The self-host option as a signal for who to target first is something I learned with Genie 007. People who had already tried to solve the problem manually were the fastest to convert and the most useful for pricing. They could actually quantify it. "This saves me 30 minutes a day" is a completely different sales conversation than "this might save me time." The free month to anchor on their own number is the right sequence. Makes the pricing conversation about their data, not your estimate.

    2. 1

      That first buyer segment framing is exactly right — and it reframes something I had been treating as a distribution problem.
      The people most likely to pay early aren't browsing for quiz tools. They're developers who already built the loop once, felt the cost, and are doing it again. They arrive with context and a number in their head. That's a completely different conversation than convincing someone the problem exists.
      On the free hosted month in exchange for a number: I'm going to try this. I have one person already in a positioning conversation (another IH commenter who offered a pricing breakdown). He's the right profile — he understood the build immediately, not generically. That's where I'll start.
      One honest caveat: the "what did this save you" number is only useful if the person actually builds something during that month, not just pokes at the API for ten minutes. So I'm going to be selective about the first one or two rather than offering it broadly.
      Thanks for the anchor framing — that's a concrete output I can actually use in copy once it's a real number.

      1. 1

        Makes sense. 1 developer who actually ships something with it during that month is worth 10 who just poke around. Good luck with the first 1!

  6. 1

    This is more useful than a generic quiz generator because you packaged the painful backend loop: generate, grade, schedule review, cache identical input, and make it testable.

    For pricing, I would avoid pricing it only like an API wrapper around LLM calls. The value is saved build time plus saved repeated LLM spend.

    I’d probably test three hosted tiers: a small free tier for developers, a starter tier for indie edtech builders, and a higher tier based on generated items or active learners. The self-hosted version can stay as the “control/privacy” option for people who want to bring their own key.

    The positioning should probably be less “text to quizzes API” and more “assessment and spaced-repetition infrastructure for edtech builders.” That makes the hosted pricing easier to justify.

    If useful, I can put together a short written pricing/positioning breakdown for QuizForge: tier structure, who each tier is for, hosted vs self-host framing, and the first buyer segment to test.

    1. 1

      This is a very useful replies I've gotten — thank you for taking the time on it. You actually described the build better than my own post did: the value isn't the LLM call, it's the loop around it —generate → grade → schedule review → cache identical input → keep it testable — plus the build time and repeated LLM spend it saves. Funny timing on the pricing, because that's roughly the structure I already landed on, which makes me think the instinct is right:

      • free dev tier (30 calls/mo) to try it
      • a starter tier for indie edtech builders ($15/mo)
      • higher hosted tiers above that
      • self-host kit (source + Docker, bring-your-own-key) as the control/privacy option The piece I haven't done is your last point — metering the top tier on generated items / active learners instead of raw requests. Right now it's request-based hard-limit tiers (no overage, so no surprise bills), but a learner/item-based unit is a much better story for an edtech buyer and I hadn't framed it that way. Same with the positioning: "assessment + spaced-repetition infrastructure for edtech builders" is a far sharper line than "text → quizzes API." Taking both.
        And yes — I'd genuinely take you up on the written breakdown. Tier structure + who each tier is for + the first buyer segment to test is exactly the part I'm least sure about. I can share my real numbers (actual per-call LLM cost, etc.) so it's grounded rather than hypothetical.
      1. 1

        Yes, real numbers would make this much stronger.

        Send me your email and I’ll continue privately. I can keep the breakdown focused on hosted pricing, self-host framing, and the first buyer segment to test.

        1. 1

          Thanks — yes, let's do it.
          Happy to share real numbers: actual per-call LLM cost, current tier limits, what I'm seeing on the self-host vs hosted split so far.
          My email: [email protected] — feel free to send the breakdown there and I'll reply with the data so you have something concrete to work with.

          1. 1

            Just sent you a note.

            Kept it focused on hosted pricing, self-host framing, buyer segment, and the pricing unit question we discussed.

Trending on Indie Hackers
Most founders don't have a product problem. They have a visibility problem User Avatar 95 comments Day 4: Why I Built a $199 Workspace Nobody Asked For User Avatar 51 comments How to automatically turn customer feedback into high-converting testimonials User Avatar 39 comments Spent months building LazyEats AI. Spent 1 day realizing I have no idea how to get users. User Avatar 30 comments Why Claude Skills Are Becoming Important for Tech Careers User Avatar 25 comments