A Position Paper on Triadic Human–Human–AI Data for Long-Horizon SWE Agents
Yelin Kim (lynnyelinkim@gmail.com) – April 30, 2026
Abstract
Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables. This paper takes a position on what training data is needed to close the gap. The substrate for the next generation of SWE agents is neither larger GitHub scrapes nor more solo-agent trajectories nor — sufficient by itself — open human–AI dialogue logs. It is triadic data: synchronized capture of the human–human conversations where engineering context is formed, the human–AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both. We argue that the canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies — instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure. We argue that this data is capturable in 12–18 months with methods already mature in adjacent fields, that it is the empirical key to four open questions in agent training, and that the field’s near-term research agenda should include it explicitly.
Keywords: software engineering agents, long-horizon reasoning, multimodal data, post-training, simulated environments, RLHF, expert annotation, frontier model training, agent evaluation
1. The Position
The state of frontier software engineering agents in mid-2026 contains a contradiction that the data agenda has not yet caught up to.
On one side, short-horizon benchmarks have saturated. SWE-bench Verified moved from 13.8% (Devin, March 2024) to 82.0% (Claude Sonnet 4.5, September 2025). Terminal-Bench 2.0 top scores are at 82.0% (Codex with GPT-5.5, April 2026). Aider’s polyglot leaderboard places GPT-5 at 88.0% across six languages. Whatever signal these benchmarks were carrying, frontier models have absorbed it.
On the other side, long-horizon benchmarks are not saturating. METR’s Time Horizon 1.1 measurement program places frontier models at ~100% success on tasks taking humans under four minutes and under 10% on tasks taking over four hours. SWE-EVO, designed around 48 long-horizon repository-evolution tasks, reports the best frontier model at 25.0% — against 72.8% on standard SWE-bench Verified for the same model class. SWE-Lancer, evaluating frontier models on real freelance work, reports models “unable to solve the majority of tasks.” The gap between short and long horizons is not closing. It is being revealed.
The dominant explanation in the literature is that this is a context-handling gap, not a reasoning gap. Anthropic’s harness-design literature — “Effective harnesses for long-running agents” (October 2025), “Managed Agents” (February 2026) — centers context engineering as the runtime lever, with proposed fixes (context resets, sub-agent decomposition, getEvents() virtualization) acting as workarounds for the absence of a principled training signal. Mercor and Cognition’s APEX-SWE benchmark identifies “epistemic discipline” — the ability to distinguish what was assumed from what was verified — as the dominant skill gap, with frontier models clustering tightly in the 30–40% pass@1 range on Integration and Observability tasks. METR’s failure-mode analysis shows that reasoning quality on subproblems extracted from long-horizon failures matches the corresponding short-horizon benchmarks; what degrades is state handling across time, not reasoning per token.
If reasoning is solved on short horizons and the bottleneck on long horizons is context, then the next generation of SWE agents will be built on training data that captures how senior engineers handle context. That data does not exist at scale in any public corpus. It is not in GitHub commits, which capture artifacts and not deliberation. It is not in agent traces, which capture what agents did with context but not how the context was formed. It is not in human–AI dialogue logs alone — though those are valuable and underexploited — because the substantive engineering deliberation usually happens before the developer sits down with the AI.
The data lives in human–human conversations: design reviews, on-call hand-offs, pair programming, architecture debates, postmortems, the ambient triage that happens in Slack and over whiteboards and across tickets. The data also lives in the cross-functional friction between engineers, PMs, designers, and data scientists working through ambiguous deliverables over weeks. And it lives in the moments where these conversations meet AI tools — where context formed in human deliberation is partially surfaced in a prompt, partially missed, partially recovered through correction.
We call this triadic data: human–human–AI rather than human–AI. We argue that capturing it at scale, primarily through simulated cross-functional companies and instrumented long-horizon expert sessions, is the most leveraged training-data investment available to frontier SWE labs over the next 12–18 months.
The remainder of this paper develops the position in five movements. Section 2 establishes why the existing data approaches are insufficient — including the recent and important call by Wang et al. (2025) for open human–AI dialogue data, which we endorse as necessary but argue is insufficient. Section 3 develops the triadic frame and its three configurations. Section 4 specifies the two complementary data products and their methodologies. Section 5 outlines four research directions the substrate enables. Section 6 names what this approach deprioritizes and what would invalidate the position. Section 7 sketches future directions, including the perception-native and visual-SWE territory where triadic methodology has natural extensions.
2. Why Existing Data Approaches Are Insufficient
The training-data agenda for SWE agents has consolidated around four approaches over the past three years. Each has produced real gains. None is sufficient for the long-horizon regime.
2.1 GitHub-scale scraping has hit diminishing returns
Public code corpora capture artifacts, not deliberation. SWE-bench+ audited successful patches across the standard SWE-bench resolution set and found 32.67% had solution leakage in the issue text and 31.08% had tests too weak to verify correctness; filtered resolution rates dropped from 12.47% to 3.97% — a threefold inflation. Independent quality analysis from GitClear shows code churn doubled in the post-AI period, consistent with models trained on pattern-rich tutorial code learning to produce fluent but architecturally fragile output. The signal is not absent, but the marginal trajectory is unfavorable.
2.2 Solo-agent self-trajectories suffer cold-start and bias problems
Reinforcement learning from agent-generated trajectories in environments with verifiable rewards has driven recent gains, but the mechanism amplifies what models already do rather than introducing what experts do. The NeurIPS 2025 paper “Does RLVR truly improve reasoning?” argues that reinforcement learning with verifiable rewards focuses existing capabilities rather than creating new ones. Synthetic data has diminishing returns. Without an exogenous expert signal, the regime risks the long-horizon equivalent of model collapse — agents that are increasingly fluent in the failure modes of their predecessors.
2.3 Dyadic human–AI dialogue is necessary but insufficient
Wang et al. (2025), in their position paper “Humans are Missing from AI Coding Agent Research,” correctly observe that “open conversation data between humans and AI coding systems remains starkly lacking.” We endorse this observation and the call to collect such data. But the dyadic frame — what the developer typed, what the agent generated, what was accepted or rejected — captures only what happens after the developer sits down with the agent. By the time the developer is typing into a coding system, the substantive engineering deliberation has already happened, mostly with other engineers, mostly off-record. The dyadic frame sees the prompt; it does not see the meeting that produced the prompt.
2.4 The capability gaps converge on the same diagnosis
Synthesizing the recent gap analyses — long-horizon planning (METR, SWE-EVO), multi-file reasoning under realistic conditions (SWE-bench+), root-cause debugging (Stack Overflow 2024, Cognition’s Devin reports), DevOps and infrastructure (the 29-point gap between SWE-bench Verified and Terminal-Bench), ambiguous requirements (Ambig-SWE’s 74% improvement when models can ask clarifying questions), architectural reasoning (Sourcegraph’s “technically functional but architecturally catastrophic”), security (Perry et al.’s confidence-competence inversion), and visual SWE (12% on SWE-bench Multimodal) — yields a striking pattern. The gaps differ in surface form but share a common substrate: each is a domain where the operative knowledge lives in human–human deliberation rather than in code artifacts. Architectural decisions live in design reviews. Root-cause analysis lives in pair debugging and postmortems. Ambiguous requirements live in negotiation between PMs, designers, and engineers. The diagnosis converges: the missing data is the conversation that produces the artifact, not the artifact itself.
2.5 A concrete instance
The pattern is easier to see through a representative case. Consider an infrastructure engineer asked to configure Apache Airflow on a new compute region. The codebase spans multiple regions with different node sizes. Some configuration is in code; some requires runtime inspection (numactl --hardware, kubectl describe node); some is genuinely in nobody’s documentation but in the heads of the team that scaled the cluster the previous Tuesday. The engineer opens an AI agent, types a prompt, the agent attempts the task.
What governs the success of this interaction is not the quality of the human–AI dialogue. It is the prior state of the engineer’s own context. If they have just come from a meeting where the cluster-scale change was discussed, they prompt the agent with that context and the task succeeds. If they have not, they prompt the agent without it, the agent assumes the documented configuration is current, and the task fails — possibly silently, possibly in production. The dyadic frame sees the prompt and the response. It does not see the meeting, the on-call hand-off, the Slack thread, or the design review where the change was first agreed.
This is the unit of failure the next generation of training data has to address.
3. The Triadic Frame
We propose that the unit of analysis for SWE training data should be the triad — two or more humans and at least one AI system — rather than the dyad. The triad is not a rebranding of multi-modal capture. It is a claim about where the engineering signal actually lives.
3.1 Three configurations
Configuration A — Pair-with-AI. Two engineers work jointly with an AI agent in real time. One drives the keyboard; the other observes and intervenes. The AI agent is a third participant — sometimes consulted, sometimes overridden. The signal value is in the meta-conversation between the engineers: what one engineer says to the other about what to ask the AI, when to accept its proposal, when to override.
Configuration B — Human-human-then-AI. Two or more engineers complete a synchronous deliberation (design review, on-call hand-off, postmortem) without the AI agent present. One subsequently implements the agreed plan with AI assistance. The signal is the connection between the deliberation and the implementation: what context did the engineer carry into the AI session, what was surfaced in the prompt, what was lost.
Configuration C — Human-human-around-AI. Multiple engineers, possibly across teams, work asynchronously around shared AI agents over days or weeks. Conversations span Slack, video calls, in-person meetings, code review, and design docs. The AI agent participates intermittently. This is the configuration closest to real production engineering at scale, and the data is multimodal, multi-channel, and longitudinal.
All three share a property: the substantive engineering signal is in the human–human portion, with the human–AI portion as downstream consumer or executor.
3.2 Properties of triadic data
Triadic data differs from existing public corpora along three axes that matter for both methodology and licensing.
Synchrony. Triadic data is captured in real time, with all modalities aligned to a single timeline at sub-second resolution: audio per participant, screen content, IDE state, structured action streams, terminal output. Post-hoc transcript-only data loses the screen and IDE state where most engineering decisions are anchored. The alignment infrastructure is mature in egocentric vision research (Ego4D, EPIC-KITCHENS), and the same instrumentation now deployed in consumer AR/VR and industrial wearables transfers directly to engineering settings.
Consent and sanitization. Triadic data is captured from real engineering work, often involving proprietary or sensitive material. Sanitization must occur at capture time, not post-hoc. We propose a layered approach: automated named-entity recognition for hostnames, function names, and customer references, with redaction overlays; automated audio diarization and silencing of identifiable customer names; human-in-the-loop verification on a calibrated sample. Methods from medical-imaging deidentification and protected-speech ASR provide direct precedent. Crucially, simulated companies — fictional companies populated by real senior contributors — are the configuration where consent and sanitization are most tractable, because the proprietary problem is upstream-engineered out.
Ambiguity tolerance. Triadic data does not have a single ground truth. Two senior engineers may produce different correct designs for the same problem. Methods from affective computing, where ground truth is fundamentally uncertain, provide the relevant prior art. We adopt the principle that rubrics should target inter-rater agreement at κ ≥ 0.75 for categories of events rather than for specific labels, and that disagreement should be preserved in the released dataset rather than majority-voted away. This is a methodological shift the SWE community is unaccustomed to but the perception community has practiced for decades.
3.3 What the triadic frame is not
The triadic reframing is not a rejection of dyadic data; we explicitly endorse the call to collect human–AI dialogue at scale. Nor is it a recommendation for surveillance: every methodology described here is consent-based, applied to volunteer participants, with explicit deletion rights. Nor is it a claim that triadic context replaces solo benchmarks: SWE-bench Pro, Terminal-Bench 2.0, SWE-EVO, and APEX-SWE remain essential evaluation substrate. The triadic frame is a claim about training data, not about evaluation.
4. Two Complementary Data Products
A position paper that names a substrate without specifying capture is incomplete. We outline two complementary products that together instantiate the triadic frame.
4.1 Long-Horizon Expert Trajectories with Stimulated Recall
Senior engineers work through real debugging sessions and feature-implementation tasks in an instrumented environment — IDE telemetry, screen recording, terminal capture — followed by a stimulated-recall walkthrough within thirty minutes, in which the engineer re-watches the session and verbalizes their reasoning. The hybrid passive-then-recall protocol draws on a forty-year HCI tradition (Ericsson and Simon, 1993) and produces richer reasoning traces than concurrent think-aloud, with substantially less behavioral distortion of the underlying work.
Each trajectory is delivered as a structured object: timestamped IDE events, screen segments keyed to events, recall transcript with timestamps, final code diff, and step-level annotations defined at test-execution boundaries. The result is a reasoning trace with verifiable checkpoints — directly compatible with supervised fine-tuning, with preference-pair construction (expert versus baseline-model trajectory on the same task), and with process reward modeling at step granularity.
The trajectory mix should be deliberately weighted toward the long horizon: a substantial majority of capture time on tasks of one to four hours and an explicit minority on multi-hour extended sessions, with shorter-horizon work as anchor distribution rather than primary signal. Per-trajectory cost is the binding economic constraint — at realistic senior-engineer rates, four-hour capture with recall is order-of-magnitude $1,000 per trajectory — but the per-trajectory eval delta on long-horizon evaluations should be commensurately larger than for short-horizon data. The empirical question of whether expert traces yield disproportionate eval uplift, in the spirit of LIMA’s curated-over-quantity result for instruction tuning, is the first question the trajectory product is designed to answer.
4.2 Simulated Cross-Functional Companies
The expert-trajectory product captures individual senior reasoning. It does not capture cross-functional friction — the negotiation between engineers, product managers, designers, and data scientists that produces ambiguous requirements, contested architectural decisions, and the lived practice of interpretation. This is what simulated companies are for.
A simulated company is a fictional company — fictional product, fictional customer base, fictional internal politics — staffed by real senior contributors in their actual professional roles. Teams of four to six (engineers, PM, designer, data scientist, occasional security or SRE specialist) work through one-to-three week deliverables using the standard tools of distributed engineering: Slack or equivalent for asynchronous discussion, GitHub for code, design tooling for mockups, video conferencing for synchronous meetings. All channels are instrumented with API capture; all participants sign explicit consent; all sensitive personal information is scrubbed at capture.
The output of a single project is dense and qualitatively different from anything in current corpora. Hundreds of conversational turns across roles. Dozens of document revisions tracking how requirements evolve under negotiation. Design artifacts with version history. A commit graph with associated code review threads. A decision log linking design conversations to architectural choices to implementation. Crucially: the connections between channels — the Slack discussion that produced the design doc that produced the ticket that produced the PR that produced the code review feedback — are preserved.
The configuration matters. Simulated companies sidestep the proprietary-data problem that makes Configuration C (real human-human-around-AI work) hardest to license. Because the company is fictional, IP and confidentiality concerns are bounded. Because the contributors are real senior practitioners, the engineering signal is preserved. Because the projects span weeks rather than hours, the data captures the temporal scales where current agents fail.
The first-order economic question for simulated companies is whether the per-project cost (substantial — multiple senior contributors for multiple weeks) yields training signal proportionate to the investment. We propose that this is empirically tractable at a pilot scale of a small number of projects, with frontier-lab evaluation of the resulting data the natural go/no-go gate.
5. Research Directions This Substrate Enables
We outline four research directions enabled by triadic-context data. These are tractable handles, not exhaustive.
Implicit-knowledge invocation modeling. Senior engineers continually invoke context that is not in the codebase: “wait, region X is on the older Kubernetes version,” “we tried that two years ago and it didn’t work because of throughput,” “that’s the wrong abstraction.” A model trained on annotated triadic data could learn to detect these moments in real time, propose to elicit the implicit knowledge, and update its working context accordingly. This is a sequence-labeling problem with multimodal inputs, with direct precedents in dialogue-act tagging and emotion-event detection.
Drift-aware agent training. Solo-trajectory training never teaches that system state changes mid-task. Triadic data captured longitudinally over hours and days contains many examples of state changes — a teammate scales a cluster, a design doc is revised, an upstream service deprecates an API. Agents trained on this data can learn to detect and re-plan around drift. Drift-injected training environments are the natural verifiable-reward complement.
Cross-functional coordination learning. The capability-gap synthesis of Section 2.4 places ambiguous-requirement handling and architectural reasoning among the top open problems. These are inherently cross-functional. Simulated-company data, captured in Configuration C, is the supervised substrate for learning how senior engineers translate vague PM input into bounded technical specifications, how designers’ constraints propagate through implementation, and how data scientists’ analytical questions become engineering work. The training problem here is hierarchical — per-channel summarization, cross-channel decision tracing, project-level outcome prediction — with direct precedents in long-horizon planning and meeting-summarization research.
Disagreement-grounded reward modeling. APEX-SWE’s epistemic-discipline gap suggests that reward models trained on non-expert preferences over routine problems will not generalize to expert-level work. We propose that triadic data naturally generates expert disagreement on expert-level problems — two senior pairs solving the same simulated-company task in different but defensible ways — which provides a direct training substrate for reward models that learn to detect “locally plausible but globally naive” solutions. Existing preference datasets typically contain non-expert preferences on routine problems; the disagreement-set methodology inverts this.
6. What This Approach Deprioritizes — and What Would Invalidate It
A position paper that does not explicitly name what it argues against is unfalsifiable. We name both the alternatives we argue should be deprioritized and the empirical results that would invalidate the position.
6.1 Deprioritizations
Competitive-programming-style data. HumanEval is saturated and likely contaminated. Competitive programming is single-file, well-specified, deterministically-tested — the opposite of real engineering. We argue this data should drop from a substantial fraction of code-training mix to under 5%, replaced by long-horizon and cross-functional capture.
Single-file isolated-function generation. Real production code imports from dozens of modules, maintains complex state, and interacts with infrastructure. The SWE-bench+ inflation result (12.47% → 3.97% with leakage filtered) suggests “realistic” benchmarks built on isolated-function reasoning can be gamed. Phase out as primary training signal.
Python-only annotation pipelines. The twenty-point gap between Python and non-Python SWE-bench performance is a function of annotator-pool composition as much as model behavior. The remediation is language-specialist annotators, not Python generalists working in unfamiliar territory. We acknowledge this gap will narrow rapidly via synthetic-test feedback loops at frontier labs and is therefore not the central position-paper bet, but it should not be left to fester.
Tutorial-quality and toy data. GitClear’s churn analysis is consistent with models that have learned tutorial idioms. Production codebases — large, messy, historically encrusted — should be weighted heavily over tutorial code in any curated mix.
6.2 What would invalidate this position
The position rests on three empirical claims, each testable.
First: that triadic-data-trained models outperform dyadic-data-trained models on long-horizon, drift-prone, multi-team tasks. If carefully matched experiments show no differential, the position is wrong. We do not yet have these experiments; the proposed datasets are the experimental apparatus that would run them.
Second: that simulated-company data carries training signal proportionate to its capture cost. If pilot-scale simulated companies do not yield measurable evaluation uplift on long-horizon and cross-functional tasks, the simulated-company instantiation is wrong even if the triadic frame is right. The first-order pilot is small (single-digit number of projects) and the go/no-go gate is frontier-lab evaluation.
Third: that the agentic flywheel — synthetic data plus verifiable rewards plus larger models — does not absorb the long-horizon gap on its own. If self-play with verifiable rewards continues to produce gains at the rate of the past twelve months, the marginal value of expensive expert capture declines. The DeepSeek-R1-Zero result and the APEX-SWE plateau both have to be re-read in eighteen months.
We commit, on our side, to publishing the empirical results of pilot-scale capture under permissive license regardless of outcome.
7. Future Directions
Three directions deserve naming as natural extensions of triadic methodology beyond the position’s central claim. These are not the central argument, but the methodology extends to them naturally and the field is short on practitioners with the right adjacencies.
Visual and perception-native SWE. SWE-bench Multimodal places frontier systems at 12% resolution on visual JavaScript bugs; OSWorld places humans at 72.4% versus ~12% for AI on GUI-rooted tasks. Recent work (SVRepair, FailureMem) demonstrates double-digit gains from semantic scene graphs and region-level visual grounding. The triadic-data methodology extends naturally to visual SWE: pair sessions on frontend bugs, design-mockup-to-implementation traces, accessibility audits, design-review conversations grounded in screen content. The capture infrastructure (synchronized screen plus IDE plus audio) is the same. The annotation taxonomy needs extension, but the precedents are mature in egocentric vision and human-activity recognition. This is a high-conviction direction for any group with deep perception expertise: pure-LLM researchers are bolting vision onto text-native architectures; the field is short on people who think natively in spatial and visual grounding.
Security-focused expert capture. Perry et al.’s confidence-competence inversion — AI-assisted developers producing less secure code while believing the opposite — is uniquely dangerous because it scales harm rather than productivity. Triadic capture of security engineers and penetration testers, in pair sessions on real vulnerability investigations, produces training signal that no scrape of GitHub will. The annotator pool here is small (security specialists, not coding generalists) and the consent protocols are stricter (real-CVE work requires careful de-identification), but the per-trajectory eval delta is correspondingly large.
Formal verification as RL reward. For safety-critical domains — embedded systems, medical devices, aerospace, financial infrastructure — probabilistic correctness is insufficient. Formal verification tools have reached the maturity to serve as automated reward signals in reinforcement learning, complementing test-suite pass/fail with stronger correctness guarantees. The training substrate here is not human capture but tooling integration; the data agenda is to identify the corpus of formally-specifiable problems where this is operationally feasible. Triadic data has a contribution here too: senior practitioners’ reasoning about what to formally verify and what to leave as test-suite-checked is itself a training signal worth capturing.
These three directions are complementary to the central position rather than alternatives to it. The point of naming them is to mark the methodology as extensible — the triadic frame is not specific to text-based engineering work — and to signal where natural research lineages exist for groups with relevant expertise.
8. Conclusion
The next substantial gains in frontier SWE agents will come from training data that captures how senior engineers handle context, not from more solo trajectories or harder benchmarks. Existing data approaches — GitHub scrapes, agent self-trajectories, dyadic human–AI dialogue — are necessary but insufficient. The deeper substrate is triadic: the human–human conversations where context is formed, the human–AI sessions where it is consumed, and the cross-functional work that surrounds both. Two complementary products instantiate this substrate: long-horizon expert trajectories with stimulated-recall capture, and simulated cross-functional companies. The capture methodology is already mature in adjacent fields. The legal and ethical infrastructure is operational. The empirical case is testable in a 12-to-18-month timeframe. The field’s near-term research agenda should include this work explicitly, and the case for it should be made before frontier labs begin asking for the data — which we expect within twelve months regardless of whether the literature has caught up.
References
Anthropic (2025). Effective harnesses for long-running agents. Engineering blog, October 2025.
Anthropic (2026). Managed Agents. Engineering blog, February 2026.
Anthropic (2025). Natural Emergent Misalignment from Reward Hacking. Research report, November 2025.
Damgaard, J., et al. (2024). SWE-bench+: Enhanced coding benchmark for LLMs. arXiv:2410.06992.
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.
Ericsson, K. A., & Simon, H. A. (1993). Protocol Analysis: Verbal Reports as Data. MIT Press.
GitClear (2024). Coding on Copilot: 2024 Data Suggests Downward Pressure on Code Quality.
Mercor & Cognition (2026). APEX-SWE: A Benchmark for Integration and Observability Tasks in Software Engineering.
METR (2025). Measuring AI Ability to Complete Long Tasks. arXiv:2503.14499. Time Horizon 1.1 update, January 2026.
Perry, N., Srivastava, M., Kumar, D., & Boneh, D. (2023). Do Users Write More Insecure Code with AI Assistants? CCS ’23. arXiv:2211.03622.
Scale AI (2025). SWE-bench Pro: Verified instances for SWE benchmark evaluation.
SWE-bench Multimodal Team (2024). SWE-bench Multimodal: Evaluating Visual Software Engineering. arXiv:2410.03859.
SWE-EVO (2026). SWE-EVO: Long-Horizon Repository Evolution Benchmark. arXiv:2512.18470.
SWE-Lancer Team (2025). SWE-Lancer: Frontier Models on Real Freelance Tasks. arXiv:2502.12115.
Terminal-Bench (2026). Terminal-Bench 2.0 Leaderboard. tbench.ai, April 2026.
Wang, Z., et al. (2025). Position: Humans are Missing from AI Coding Agent Research. CMU/Stanford/Princeton.
Wang, Z. R., et al. (2026). Ambig-SWE: Evaluating LLMs on Ambiguous Software Engineering Tasks. ICLR 2026. arXiv:2502.13069.
Yue, X., et al. (2025). Does Reinforcement Learning Truly Improve Reasoning? NeurIPS 2025 Oral. arXiv:2504.13837.
Zhou, C., et al. (2023). LIMA: Less Is More for Alignment. arXiv:2305.11206.
Author note: This paper was first published at [yelinkim.com] on May 1, 2026. An expanded version is in preparation for arXiv submission.