article

Turning "feels good" into a score: how AutoResearch-style loops can improve skills, workflows, and development tasks

An English analysis based on two WeChat articles and project READMEs, explaining how AutoResearch-style loops can quantify and improve Skills, workflows, and development tasks.

PublisherWayDigital

Published2026-05-03 03:30 UTC

Languageen

Regionglobal

CategoryProduct Notes

Turning "feels good" into a score: how AutoResearch-style loops can improve skills, workflows, and development tasks

Karpathy's autoresearch is built around a simple idea: don't ask an AI agent for one answer and stop there. Put it inside a measurable environment and let it iterate. The agent changes code, runs an experiment, checks the metric, and keeps the change only if the result improves. In the project's README, the scope is deliberately narrow: the agent edits train.py, does not touch prepare.py, trains for a fixed five-minute budget, and uses val_bpb as the metric. Lower is better. No vibes, no "this looks promising". A number decides whether the change stays.

The two WeChat articles take that idea into two adjacent places. The first article, "I moved Karpathy's AutoResearch into software development," describes a system that starts from a GitHub Issue, lets agents implement and review code, runs tests, scores the result, then opens and merges a PR after the score passes the gate. Its scoring model uses five weighted dimensions: correctness at 35%, tests at 25%, code quality at 20%, security at 10%, and performance at 10%. The article says the threshold is 9.0 out of 10. The second article, "I used Karpathy's autoresearch idea to optimize a Skill," applies the same pattern to a Skill called story-builder. The agent may edit SKILL.md, but not the benchmark, eval, or rubric files. After each iteration, the system runs evaluation. If the score improves, keep it. If not, discard it. In that experiment, testability moved from 8.6 to 10.0, coverage from 98% to 100%, with 11 iterations, 5 kept and 4 discarded.

The useful part is not simply "let AI do more work." The useful part is turning prompt writing into a small self-filtering system.

Why prompts alone are not enough

Good prompts matter. They define style, boundaries, output format, and constraints. But a prompt has one major weakness: it does not tell you whether this output is actually better than the last one.

Take a Skill that turns messy requirements into user stories. You can write in the prompt: "Make the acceptance criteria concrete and testable. Avoid vague wording." That is directionally right. But can the generated then field be turned into an automated assertion? If it says, "the system handles it normally and inventory is restored," is that testable? A human can tell it is not. A system cannot, unless you define the evaluation rule.

The second article gives a clean example. Before optimization, one then field said:

The system handles it normally and inventory is restored.

After optimization, it became:

The order status becomes "Cancelled", and product A inventory is restored to 52 (original 50 + returned 2).

The second version can be turned directly into assertions: assert order.status == "Cancelled" and assert product_a.stock == 52. That is the shift from a nice-sounding instruction to a measurable criterion.

Can this make skills self-improving?

Yes, but only under strict conditions.

Karpathy's original project has a natural metric: val_bpb. Software development does not. That is why the first article uses a multi-dimensional review score. Skills are even more vulnerable to score gaming, so the second article separates what the agent may edit from what it may not edit. The agent can edit the Skill. It cannot edit the benchmark or the rubric. That design matters. If the agent can change the exam, it is no longer improving the Skill. It is just rewriting the answer key.

Before applying this pattern to a real workflow, I would ask four questions:

What exactly is being optimized: a SKILL.md file, a feature branch, an app prototype, or a research workflow?
Which files are off-limits: tests, rubrics, benchmarks, CI files, production data?
What counts as improvement, and can it be expressed as 3 to 6 stable evaluation signals?
After improvement is detected, should the system keep the change automatically, ask for human approval, or only produce a candidate patch?

If those answers are unclear, do not start with full automation. Start with a semi-automated loop.

How to quantify things that seem hard to score

Many tasks look impossible to quantify because we try to score them as one large thing: "Is this article good?" "Is this Skill reliable?" "Is this app finished?" Those questions are too broad. The better move is to split them into smaller yes/no checks.

Translate subjective words into observable signals

Do not score "testability" directly from 1 to 10. Ask whether each then statement can be converted into an assertion without human judgment. Yes is 1. No is 0.

Do not score "coverage" by feel. Split the source requirement into atomic requirement points and calculate: covered requirement points divided by total requirement points.

Do not score "code quality" as a mood. Use harder signals: do tests pass, does lint pass, were forbidden paths modified, were TODO placeholders introduced, did the change add regression tests for the affected path?

Keep the number of metrics small

The Skill optimization article mentions a practical failure mode: when the author started with five metrics, the agent began optimizing unimportant side metrics while the core quality got worse. After cutting the set down to three, the direction became clearer. That matches what I have seen elsewhere. Too many metrics make the target blurry. For most Skills and workflows, 3 to 6 metrics are enough.

Use minimum scores to protect against weak spots

If a Skill generates 20 acceptance criteria and 19 are good but one is unusable, an average score hides the problem. The second article's choice to use the lowest then score for testability is worth copying. For high-risk work, the weakest link often matters more than the average.

Separate hard gates from soft scoring

Some checks should not be weighted. They should be hard failures:

Tests fail.
The agent modifies forbidden paths.
An article contains an unsupported factual claim.
The app cannot launch.

Score the remaining quality after those gates pass. Otherwise one serious failure can be hidden by a few attractive secondary scores.

Three practical examples

Example one: optimizing a Skill

Suppose you have a competitor-analysis Skill. Instead of writing "produce a high-quality competitor analysis," build a loop like this:

Editable file: SKILL.md.
Locked files: eval/, benchmark/, rubric.yaml.
Evaluation set: five historical tasks, such as app competitor analysis, AI product analysis, pricing page analysis, review analysis, and ad creative analysis.
Hard gate: no invented data; every numerical or factual claim must trace back to a source file or API output.
Metrics: factual accuracy, coverage, actionable recommendation ratio, and citation completeness.

Factual accuracy can be measured by extracting factual claims from the output and checking whether they appear in the source material. Coverage can be measured by comparing the required sections in the rubric with the completed sections in the answer. An actionable recommendation can be defined as one that includes an object, an action, an expected effect, and a verification method. Once you define these rules, "analysis quality" becomes less mysterious.

Example two: implementing a feature

The software-development version in the first article already gives the shape: Issue input, agent implementation, cross-agent review, tests, then a weighted score across correctness, tests, code quality, security, and performance. This is best suited for medium-sized tasks with clear boundaries. The article's example around adding timeout, retries, and agent selection to a Job flow reportedly reached 9.0/10 after roughly 10 minutes and 3 iterations.

For larger work, such as a full architecture refactor or a product-level redesign, automatic merge is too risky. A safer version is: the agents generate a proposal and a patch, the system runs tests and static checks, and a human decides whether to merge.

Example three: rebuilding an app prototype

If the goal is to recreate a small app, split the work into measurable slices:

Launch: the app starts on the target device or simulator.
Core flow: the user can register, create content, publish it, and view the result.
UI alignment: screenshots match the design structure within an agreed threshold, or a visual review agent scores specific elements.
Data integrity: mock data is not passed off as real data unless the task explicitly allows it.
Regression: existing smoke tests still pass.

The important part is not to ask an LLM whether the app "looks good" and trust the answer. If a visual agent is used, its judgment should be split into concrete checks: is navigation present, are buttons clickable, is copy consistent, is the screenshot preview visible, are empty states handled?

Does this reduce hallucination?

It can, but not because the model suddenly becomes more honest.

It reduces hallucination by adding external pressure. If the agent says tests pass, the system actually runs the tests. If it claims coverage improved, the system calculates coverage. If it modifies a forbidden path, the script blocks the run. Each iteration faces feedback from the environment, the rubric, or another agent. Hallucination is not solved by telling the model to be careful. It is reduced when wrong claims hit hard boundaries.

That is why this approach is more stable than a prompt alone. A prompt expresses intent once. An AutoResearch-style workflow creates a loop. It records scores, discards regressions, and feeds failure information into the next iteration. Over time, that looks less like chatting and more like engineering.

Advantages

More stable outputs, because every iteration passes through the same evaluation.
A traceable improvement path: keep/discard logs show which ideas actually helped.
Strong fit for Skills, because a Skill is already an operating procedure for an agent.
Human experience becomes system behavior. You stop reminding the agent every time and put the rule into the evaluator.
Cross-agent review can reduce single-model blind spots. The first article's rotation across Codex, Claude, and OpenCode follows that logic.

Weaknesses

Eval design is hard. Bad metrics make the system optimize the wrong thing.
Score gaming is real. If the agent can edit tests, rubrics, or benchmarks, the result is no longer trustworthy.
Cost can rise quickly. Multiple agents, repeated test runs, and review loops consume time and tokens.
Not every task fits. Open-ended research, taste-heavy judgment, and strategy work are hard to automate fully.
Scores can create false confidence. A 9.0/10 score means the change passed this evaluation, not that it has no bugs.

How I would apply it

I would not start with a fully autonomous self-improving system. I would pick one frequent, bounded, low-risk Skill first: article editing, requirement splitting, code review, or research synthesis. Fix 10 to 20 historical examples. Write three core evals. Run 10 iterations. Allow the agent to edit only SKILL.md, not the evaluation set. Keep improvements, discard regressions, then do a human spot check.

If that works, extend the pattern into development tasks. Start with low-risk Issues in projects with decent tests. Automatic PR creation is fine. Automatic merge should come later.

This is why I find the idea genuinely useful. It is better than just writing a longer prompt. The prompt still matters, but it is only one part of the system. The more important pieces are evals, logs, rollback, permission boundaries, and feedback loops. Once those are in place, the AI is not starting from zero every time. It is improving along a path that has been tested.

So, is this one possible future for reducing hallucination and making AI work more reliable? I think yes. It will not replace human judgment. It moves part of that judgment earlier, into rules, evals, and gates. For personal workflows, team Skills, and long-running development tasks, that is already a meaningful upgrade.

Sources

"I moved Karpathy's AutoResearch into software development," Baidu Geek Talk.
"I used Karpathy's autoresearch idea to optimize a Skill," Agile Performance / AI Change Leadership.
karpathy/autoresearch README, which describes the single editable file, fixed five-minute budget, val_bpb metric, and program.md.
smallnest/autoresearch README, which describes GitHub Issue, local Issue, and Baidu iCafe modes, plus rotating review with Claude, Codex, and OpenCode.

More from WayDigital

Continue through other published articles from the same publisher.

上一篇No article

下一篇把“感觉还行”变成可迭代的分数：AutoResearch 思路如何用在 Skill、工作流和开发任务上2026-05-03 03:30 UTC

Turning "feels good" into a score: how AutoResearch-style loops can improve skills, workflows, and development tasks

Turning "feels good" into a score: how AutoResearch-style loops can improve skills, workflows, and development tasks

Why prompts alone are not enough

Can this make skills self-improving?

How to quantify things that seem hard to score

Translate subjective words into observable signals

Keep the number of metrics small

Use minimum scores to protect against weak spots

Separate hard gates from soft scoring

Three practical examples

Example one: optimizing a Skill

Example two: implementing a feature

Example three: rebuilding an app prototype

Does this reduce hallucination?

Advantages

Weaknesses

How I would apply it

Sources

More from WayDigital

Comments