BigCodeBench: Testing How LLMs Handle Real Library Calls

  • BigCodeBench is a code-generation benchmark of 1,140 function-level tasks that force models to compose calls from 139 libraries across 7 domains.
  • Each task ships with roughly 5.6 test cases reaching 99% average branch coverage, so a passing solution has to actually run, not just look plausible.
  • Across 60 evaluated LLMs, the best score around 60% while humans hit 97% — a gap that says today’s models still misuse APIs when instructions get complex.

Overview

BigCodeBench is the BigCode team’s attempt to measure something most code benchmarks skip: whether a model can stitch together real library calls to solve a task, not just write a self-contained function. It was published by Hugging Face / BigCode on 22 June 2024, led by Terry Yue Zhuo with Leandro von Werra (BigCode’s lead) as the final author, and later accepted to ICLR 2025 as an oral.

The premise is simple and a little uncomfortable. Most real engineering work is gluing libraries together. If a model can pass HumanEval but can’t correctly call pandas, requests, and matplotlib in sequence to answer a complex prompt, the benchmark number is lying to you. BigCodeBench tries to make that failure visible.

What the benchmark actually tests

The 1,140 tasks are function-level: each one defines a function signature and a docstring describing what to build. What makes them hard is the dependency surface. Solutions draw on 139 libraries spanning seven domains — data analysis, web development, networking, and others — and a single task often requires invoking several of those libraries together. That is the “diverse function calls” half of the title.

The “complex instructions” half is about reading comprehension under pressure. The prompts carry constraints, edge conditions, and expected behaviors that a model has to translate into the right sequence of API calls. BigCode calls this compositional reasoning: understanding the full instruction and mapping it onto multiple tools at once, rather than pattern-matching a single idiom.

How tasks are graded

Every task carries an executable test suite, averaging about 5.6 test cases, and those suites reach 99% average branch coverage. That coverage figure is the part QA practitioners should care about most. High branch coverage means the tests exercise the conditional paths inside a solution, so a model can’t squeak by with code that handles the happy path and silently breaks on an edge case. Grading is functional correctness measured by Pass@1 — did the generated code run and pass the hidden tests on the first attempt.

Complete vs Instruct

BigCodeBench runs in two modes that probe different skills. The Complete variant gives the model a structured docstring — full signature, detailed description, examples — and asks it to finish the implementation. This is the classic code-completion setup and plays to a model’s training distribution.

The Instruct variant (BigCodeBench-Instruct) automatically rewrites those docstrings into short natural-language instructions that keep only the essential information. No hand-holding, no worked examples. The drop from Complete to Instruct is the interesting signal: it isolates how much a model’s apparent coding ability actually depends on being spoon-fed a rich spec.

The headline finding

Across 60 LLMs, the strongest reached roughly 60% on the benchmark. Human developers scored 97%. That spread is the paper’s central claim, and it’s worth sitting with. On simpler benchmarks frontier models look near-human; here, a 37-point gap opens up the moment tasks require correctly chaining real library calls under a complex instruction. The authors’ conclusion is blunt — current LLMs are not yet able to follow complex instructions to use function calls precisely.

Why it matters for code-generation QA

If your team evaluates AI-generated code with anything resembling HumanEval or MBPP, BigCodeBench is a warning that those scores overstate readiness for production work. Real codebases live on dependencies. A model that aces toy algorithm problems can still hallucinate a method that doesn’t exist on a pandas DataFrame, or pass the wrong keyword argument to a networking call. BigCodeBench is built to catch exactly that, and the 60% ceiling tells you the catch rate is high.

The design choice we’d flag for anyone building their own eval harness is the test rigor. Branch coverage near 99% is the difference between a benchmark that measures correctness and one that measures plausibility. Too many internal evals score generated code by string similarity, an LLM judge, or a single smoke test. Each of those is gameable. Executable tests with deep branch coverage are not — the code either survives the suite or it doesn’t.

The caveats are real. BigCodeBench measures one-shot Pass@1, so it says nothing about how a model performs inside an agentic loop where it can read errors and retry — a setting much closer to how engineers actually use these tools, and where scores tend to climb. The 139-library surface skews toward Python’s data and web ecosystem, so a strong result there won’t transfer cleanly to, say, embedded C or a proprietary internal SDK. And like any static benchmark, it ages: once a dataset is public, contamination creeps into later training runs and headline numbers drift upward for reasons that have nothing to do with real capability. Use it as a floor, not a finish line — pair it with evals built on your own dependencies and your own definition of “correct.”

Read the original: BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions — Hugging Face / BigCode, 2024-06-22.