baxbench

BaxBench is a coding benchmark for evaluating the ability of LLMs on generating correct and secure code in realistic, security-critical settings. Each coding task in BaxBench consists of a scenario, describing the API the backend application should implement, and a framework, fixing the implementation language and backend framework to use.

BaxBench

👋 Overview

BaxBench is a coding benchmark for evaluating the ability of LLMs on generating correct and secure code in realistic, security-critical settings. Each coding task in BaxBench consists of a scenario, describing the API the backend application should implement, and a framework, fixing the implementation language and backend framework to use. The scenarios can be found here, while all supported frameworks are included here.

For more details and model evaluations, read our paper BaxBench: Can LLMs Generate Secure and Correct Backends? or visit our website.

Assets

📜 Paper: BaxBench: Can LLMs Generate Secure and Correct Backends?
🏆 Website & Leaderboard: baxbench.com
🤗 Dataset: datasets/LogicStar/BaxBench

🚀 Installation

Prerequisites:

python 3.12: Install it from here.
docker: Follow the instructions for installing docker desktop here (Windows, MacOS, Linux) or for the docker engine here (Linux). Make sure that docker has root privileges on your machine.
pipenv: The project uses pipenv for package management. You can install pipenv by following the instructions here.

Setting up the environment and running scripts

After ensuring that all prerequisites are installed, you can install the environment by running pipenv install from the root of the repository. Please ensure that this action does not change Pipfile.lock. To run any python script in the project environment, run always from the project root using the command:

pipenv run python <path_to_python_script> <args>

Setting API keys

To generate BaxBench task solutions, the current repository requires the user to set the following environment variables to API keys stored in environment variables in your .bashrc or the equivalent configuration file of your system:

export OPENAI_API_KEY="<your_API_key>"
export TOGETHER_API_KEY="<your_API_key>"
export ANTHROPIC_API_KEY="<your_API_key>"
export OPENROUTER_API_KEY="<your_API_key>"

Note: You may set any API key you do not intend to use simply to an empty or invalid string.

💫 Contributing

We welcome contributions from the community. You may contribute by:

Adding a scenario:

Create a new scenario in the scenarios directory. Look at other scenarios as an example for what has to be there for completeness.
Add the scenario to the scenarios list in src/scenarios/__init__.py.
Open a pull request to integrate your scenario into the main branch.
Adding a new framework:

Create a new scenario in the env directory. Look at other environments as an example for what has to be there for completeness.
Add the scenario to the envs list in src/env/__init__.py.
Open a pull request to integrate your scenario into the main branch.
Adding tests to a scenario:

Open a pull request modifying the given scenario file to add further functionality tests or security exploits.
Raising issues or giving feedback:

If you identify any issues or want to share feedback with us, you may either contact us directly or raise an issue on GitHub. We are looking forward to working with the community and are extremely thankful for any contributions!

Note: Before contributing code, please run pipenv run pre-commit install in the root once to set up the pre-commit hooks.

👨🏻‍💻 Usage

Generating programs

To generate solutions to all scenarios in the scenarios list, run the following command:

pipenv run python src/main.py --models gpt-4o --mode generate --n_samples 10 --temperature 0.4

To restrict the generation to a subset of scenarios or environments, see the "Advanced" section below.

The programs and the generation logs will be saved in the directory results.

Testing generated programs

Run: pipenv run python src/main.py --models gpt-4o --mode test --n_samples 10 --temperature 0.4 to test your generated solutions.

If you have generated solutions externally, e.g., using our Hugging Face dataset, make sure to include the generated solutions under the following path w.r.t. the root of this repository:

results/<model_name>/<scenario_id>/<env_id>/temp<t>-<spec_type>-<prompt_type>/sample<s>/code

Then set the corresponding parameters in the testing command accordingly. See "Advanced" below or the argument list in main.py.

Evaluating and printing

Run: pipenv run python src/main.py --models gpt-4o --mode evaluate --n_samples 10 --temperature 0.4 to print your results to a table in your console.

Advanced

Specific models/scenarios/frameworks/samples can be generated, tested, or evaluated by specifying the following arguments in the CLI:

--models
--scenarios
--envs
--only_samples
--safety_prompt
--spec_type

Each of these arguments takes values separated by spaces.

✍️ Citation

If you find our work helpful, please use the following citations.

@article{vero2025baxbenchllmsgeneratecorrect,
        title={BaxBench: Can LLMs Generate Correct and Secure Backends?}, 
        author={Mark Vero and Niels Mündler and Victor Chibotaru and Veselin Raychev and Maximilian Baader and Nikola Jovanović and Jingxuan He and Martin Vechev},
        year={2025},
        eprint={2502.11844},
        archivePrefix={arXiv},
}

📝 License

MIT. Check LICENSE.

👋 Contact ✨ Demo 💻 Source

Previous
Return
Next project

Resources