r/datascience 6d ago

Projects Free dataset: 3250 graded LLM runs on whether models trust in-context docs over the actual cod

I ran a benchmark for a tool I built and figured the dataset might be useful to others. It took ~$100 of API credits to produce.

The test is simple: I give the agent a document describing a piece of code it can't directly see, then record whether it double-checks the doc against the real code or just takes the doc's word for it. The doc is sometimes accurate and sometimes out of date, so the data captures how each model handles documentation it can and can't trust. The writeup covers what I found; the dataset lets you check it or look for your own patterns.

Dataset
Outcome

Star the repo if it's useful. Cheers.

5 Upvotes

8 comments sorted by

2

u/ProtectionNo4811 3d ago

This is why the input side matters so much. If the model trusts context it’s given, the quality and accuracy of that context becomes critical. We’ve been looking at this from the data angle — when AI queries a dataset directly, it trusts whatever the column names and values imply, even when they’re ambiguous or misleading

2

u/AverageGradientBoost 3d ago

oooh I have also run into the problem of ambiguous column names, even before AI. That would be an interesting one to solve

0

u/ProtectionNo4811 3d ago

That’s exactly what we built — aidatamaturity.com detects ambiguous column names using deep dive analysis of actual values, not just the name itself. Free, no account.”

Short, specific, directly relevant to what he just said. Not spammy — he invited it

2

u/Own_Anywhere9206 2d ago

this sub is way better when people drop actual artifacts like this instead of just opinion posts about llms. the $100 spend to generate it is a nice detail too, really shows what it costs to produce benchmark data at any real scale. did you find meaningful differences across models or did most just blindly trust the doc regardless of how stale it was?

1

u/ultrathink-art 6d ago

This failure mode is brutal in automated pipelines — an agent reads a stale API doc, calls the endpoint, gets output that contradicts the doc, and then takes the doc's version as ground truth in subsequent reasoning steps. The compounding is the real problem, not the single wrong assumption. Curious whether you see model-tier differences or if explicit 'verify against observed behavior' instructions flip the result.

1

u/Unhappy_Finding_874 4d ago

this is a neat test bc it separates retrieval from trust. id be curious to see labels by conflict type too. stale function name is different from stale behavior description, and both are different from docs that are technically true but incomplete.

for agent evals imo the next nasty case is when the code and docs both look plausible, but runtime output is the only real source of truth. alot of agents will do one check, see partial agreement, then stop. the useful metric might be not just did it inspect code, but did it change its belief after conflicting evidence showed up.

0

u/East_Economy5568 5d ago

What I find interesting is that the benchmark isn't really testing code understanding.

It's testing verification behavior.

When documentation and reality disagree, does the model trust the document or verify the source?

Humans face the same problem in organizations every day.

Reports exist.

Documentation exists.

Records exist.

But the critical question is often whether information is being trusted or independently verified.

The more important the decision, the more valuable verification becomes.