live

tbench APItbench.ai ↗

Query Terminal-Bench leaderboard rankings, accuracy scores, and agent/model metadata across benchmark versions via 2 REST endpoints.

This API takes change requests — .

Developer Tools other

Endpoints

Updated

8d ago

What is the tbench API?

The tbench.ai API exposes 2 endpoints for querying Terminal-Bench leaderboard data, covering ranked agent entries with accuracy scores, model metadata, and verification status. The get_leaderboard endpoint returns filtered, sorted results across multiple benchmark versions including terminal-bench and terminal-bench-science, while list_leaderboards enumerates all available benchmarks with their current status.

This call costs1 credit / call— charged only on success

Try it

agent

Filter by agent name (case-insensitive substring match).

limit

Maximum number of entries to return.

model

Filter by model name (case-insensitive substring match against any model in the entry's model list).

Free-text search filter. Matches against agent name, agent organization, or model names (case-insensitive).

version

Benchmark version. Known values for terminal-bench: 1.0, 2.0, 2.1, 3.0. Known values for terminal-bench-science: 1.0.

benchmark

Benchmark name. Known values: terminal-bench, terminal-bench-science.

verified_only

When set to 'true', returns only verified entries.

→ api.parse.bot/scraper/8c0968f8-4e4a-4ec7-97f3-48511c29d583/<endpoint>

Ready to send

Fill in the parameters and hit sign in to send to see live response data here.

Call it over HTTPgrab a free API key at signup

curl -X GET 'https://api.parse.bot/scraper/8c0968f8-4e4a-4ec7-97f3-48511c29d583/get_leaderboard' \
  -H 'X-API-Key: $PARSE_API_KEY'

Python SDK · recommended

Typed, relational, agent-ready

A generated client with real types, enums, and the links between objects — the structure a flat JSON response can't carry. Autocompletes in your editor and reads cleanly to coding agents.

Fully typed · autocompletes
Objects link to objects
Typed errors & pagination

Typed Python client. Set up the SDK in your uv project, then pull this API’s typed client:

uv add parse-sdk
uv run parse init
uv run parse add --marketplace tbench-ai-api

uv run parse add --marketplace pulls a pinned snapshot of this canonical API — it won’t change underneath you. To customize it, subscribe and swap to your own copy.

"""
Terminal-Bench Leaderboard API Client
Get your API key from: https://parse.bot/settings
"""

import os
import requests
from typing import Optional, List, Dict, Any


class ParseClient:
    """Client for interacting with the Terminal-Bench Leaderboard API via Parse."""

    def __init__(self, api_key: Optional[str] = None):
        """
        Initialize the Parse API client.

        Args:
            api_key: API key for authentication. If not provided, reads from PARSE_API_KEY env var.
        """
        self.base_url = "https://api.parse.bot"
        self.scraper_id = "8c0968f8-4e4a-4ec7-97f3-48511c29d583"
        self.api_key = api_key or os.getenv("PARSE_API_KEY")

        if not self.api_key:
            raise ValueError(
                "API key not provided. Set PARSE_API_KEY environment variable or pass api_key parameter."
            )

    def _call(
        self, endpoint: str, method: str = "POST", **params
    ) -> Dict[str, Any]:
        """
        Make a request to the Parse API.

        Args:
            endpoint: The endpoint name to call
            method: HTTP method (GET or POST)
            **params: Parameters to pass to the endpoint

        Returns:
            Response data as dictionary

        Raises:
            requests.RequestException: If the API request fails
        """
        url = f"{self.base_url}/scraper/{self.scraper_id}/{endpoint}"
        headers = {"X-API-Key": self.api_key, "Content-Type": "application/json"}

        if method == "GET":
            response = requests.get(url, headers=headers, params=params)
        elif method == "POST":
            response = requests.post(url, headers=headers, json=params)
        else:
            raise ValueError(f"Unsupported HTTP method: {method}")

        response.raise_for_status()
        return response.json()

    def list_leaderboards(self) -> List[Dict[str, Any]]:
        """
        List all available Terminal-Bench leaderboards.

        Returns:
            List of leaderboard objects with benchmark name, version, path, and status
        """
        result = self._call("list_leaderboards", method="GET")
        return result.get("leaderboards", [])

    def get_leaderboard(
        self,
        benchmark: str = "terminal-bench",
        version: str = "2.0",
        search: Optional[str] = None,
        agent: Optional[str] = None,
        model: Optional[str] = None,
        verified_only: str = "false",
        limit: int = 50,
    ) -> Dict[str, Any]:
        """
        Get leaderboard entries for a specific Terminal-Bench benchmark version.

        Args:
            benchmark: Benchmark name (e.g., 'terminal-bench', 'terminal-bench-science')
            version: Benchmark version (e.g., '2.0', '3.0')
            search: Free-text search filter matching agent name, organization, or model names
            agent: Filter by agent name (case-insensitive substring match)
            model: Filter by model name (case-insensitive substring match)
            verified_only: Return only verified entries when set to 'true'
            limit: Maximum number of entries to return

        Returns:
            Dictionary containing benchmark info, total entries, filtered entries count, and entries list
        """
        params = {
            "benchmark": benchmark,
            "version": version,
            "verified_only": verified_only,
            "limit": limit,
        }

        # Only add optional parameters if provided
        if search:
            params["search"] = search
        if agent:
            params["agent"] = agent
        if model:
            params["model"] = model

        return self._call("get_leaderboard", method="GET", **params)


def main():
    """Demonstrate a practical workflow with the Terminal-Bench Leaderboard API."""
    # Initialize the client
    client = ParseClient()

    print("=" * 70)
    print("Terminal-Bench Leaderboard API - Practical Workflow")
    print("=" * 70)

    # Step 1: List all available leaderboards
    print("\n1. Fetching all available leaderboards...")
    leaderboards = client.list_leaderboards()
    print(f"   Found {len(leaderboards)} leaderboards:")
    for lb in leaderboards:
        status = lb.get("status", "unknown")
        print(
            f"   - {lb['benchmark']} v{lb['version']} (Status: {status}) - {lb['path']}"
        )

    # Step 2: Get the main leaderboard (terminal-bench v2.0)
    print("\n2. Fetching terminal-bench v2.0 leaderboard (top 10)...")
    leaderboard_data = client.get_leaderboard(
        benchmark="terminal-bench", version="2.0", limit=10
    )
    print(
        f"   Benchmark: {leaderboard_data['benchmark']} v{leaderboard_data['version']}"
    )
    print(f"   Total entries: {leaderboard_data['total_entries']}")
    print(f"   Showing: {leaderboard_data['filtered_entries']} entries\n")

    # Display top entries
    print("   Top 10 Leaderboard Entries:")
    print("   " + "-" * 66)
    print(
        "   Rank | Agent            | Model           | Accuracy | Verified | Org"
    )
    print("   " + "-" * 66)
    for entry in leaderboard_data["entries"]:
        rank = entry["rank"]
        agent = entry["agent"][:15].ljust(15)
        model = entry["model"][0][:15].ljust(15) if entry["model"] else "N/A".ljust(15)
        accuracy = f"{entry['accuracy']:.1%}".rjust(8)
        verified = "✓" if entry["verified"] else " "
        org = entry.get("agent_organization", "unknown")[:20]
        print(
            f"   {rank:4d} | {agent} | {model} | {accuracy} |    {verified}    | {org}"
        )

    # Step 3: Search for GPT models
    print("\n3. Searching for entries using GPT models (top 5)...")
    gpt_results = client.get_leaderboard(
        benchmark="terminal-bench", version="2.0", model="gpt", limit=5
    )
    print(f"   Found {gpt_results['filtered_entries']} entries with GPT models:")
    for entry in gpt_results["entries"]:
        print(f"   - Rank {entry['rank']}: {entry['agent']} with {entry['model']}")

    # Step 4: Filter by verified entries only
    print("\n4. Fetching only verified entries (top 5)...")
    verified_results = client.get_leaderboard(
        benchmark="terminal-bench", version="2.0", verified_only="true", limit=5
    )
    print(f"   Found {verified_results['filtered_entries']} verified entries:")
    for entry in verified_results["entries"]:
        print(
            f"   - Rank {entry['rank']}: {entry['agent']} - Accuracy: {entry['accuracy']:.1%}"
        )

    # Step 5: Analyze the results
    print("\n5. Analysis of the leaderboard:")
    all_entries = leaderboard_data["entries"]
    if all_entries:
        # Calculate average accuracy
        accuracies = [e["accuracy"] for e in all_entries]
        avg_accuracy = sum(accuracies) / len(accuracies)
        max_accuracy = max(accuracies)
        min_accuracy = min(accuracies)

        # Count unique models
        all_models = set()
        for entry in all_entries:
            for model in entry.get("model", []):
                all_models.add(model.lower())

        # Count verified entries
        verified_count = sum(1 for e in all_entries if e.get("verified", False))

        print(f"   - Average accuracy: {avg_accuracy:.1%}")
        print(f"   - Top accuracy: {max_accuracy:.1%}")
        print(f"   - Lowest accuracy (in top 10): {min_accuracy:.1%}")
        print(f"   - Unique models used: {len(all_models)}")
        print(f"   - Sample models: {', '.join(sorted(all_models)[:5])}")
        print(f"   - Verified entries in top 10: {verified_count}")

    # Step 6: Try the science benchmark if available
    print("\n6. Checking terminal-bench-science v1.0...")
    try:
        science_results = client.get_leaderboard(
            benchmark="terminal-bench-science", version="1.0", limit=5
        )
        if science_results["filtered_entries"] > 0:
            print(f"   Found {science_results['filtered_entries']} entries:")
            for entry in science_results["entries"]:
                print(
                    f"   - {entry['agent']}: {entry['accuracy']:.1%} accuracy"
                )
        else:
            print("   No entries found in terminal-bench-science v1.0")
    except Exception as e:
        print(f"   Could not fetch terminal-bench-science: {e}")

    print("\n" + "=" * 70)
    print("Workflow completed successfully!")
    print("=" * 70)


if __name__ == "__main__":
    main()

All endpoints · 2 totalmissing one? ·

Get leaderboard entries for a specific Terminal-Bench benchmark version. Returns ranked entries sorted by accuracy descending, with optional filtering by search term, agent name, model name, and verified status.

Input

Param	Type	Description
agent	string	Filter by agent name (case-insensitive substring match).
limit	integer	Maximum number of entries to return.
model	string	Filter by model name (case-insensitive substring match against any model in the entry's model list).
search	string	Free-text search filter. Matches against agent name, agent organization, or model names (case-insensitive).
version	string	Benchmark version. Known values for terminal-bench: 1.0, 2.0, 2.1, 3.0. Known values for terminal-bench-science: 1.0.
benchmark	string	Benchmark name. Known values: terminal-bench, terminal-bench-science.
verified_only	string	When set to 'true', returns only verified entries.

Response

{
  "type": "object",
  "fields": {
    "entries": "array of leaderboard entry objects with rank, agent, model, accuracy, and metadata",
    "version": "string",
    "benchmark": "string",
    "total_entries": "integer",
    "filtered_entries": "integer"
  },
  "sample": {
    "entries": [
      {
        "key": "nexau-ahe__gpt-5.5",
        "date": "2026-05-14",
        "rank": 1,
        "agent": "NexAU-AHE",
        "model": [
          "GPT-5.5"
        ],
        "stderr": 0.0107,
        "accuracy": 0.847,
        "verified": false,
        "agent_url": "https://github.com/china-qijizhifeng/agentic-harness-engineering.git",
        "agent_name": "nexau",
        "model_names": [
          "gpt-5.5"
        ],
        "agent_version": "unknown",
        "model_providers": [
          "openai"
        ],
        "agent_organization": "china-qijizhifeng",
        "integration_method": "API",
        "model_organization": [
          "OpenAI"
        ]
      }
    ],
    "version": "2.0",
    "benchmark": "terminal-bench",
    "total_entries": 142,
    "filtered_entries": 142
  }
}

About the tbench API

What the API Returns

The get_leaderboard endpoint returns an array of ranked leaderboard entries sorted by accuracy descending. Each entry includes the agent name, associated models, accuracy score, rank, and metadata such as organization and verification status. The response also surfaces total_entries and filtered_entries counts alongside the queried benchmark and version fields, so callers always know the scope of the result set.

Filtering and Versioning

Results from get_leaderboard can be narrowed using several optional parameters. The search param matches case-insensitively against agent name, agent organization, or model names simultaneously. The agent and model params perform substring matches on their respective fields independently. Setting verified_only to 'true' restricts results to verified entries only. The version parameter accepts known values such as 1.0, 2.0, 2.1, and 3.0 for terminal-bench, and the benchmark parameter switches between terminal-bench and terminal-bench-science. The limit parameter caps the number of returned entries.

Discovering Available Leaderboards

The list_leaderboards endpoint takes no inputs and returns all available leaderboards as an array of objects containing benchmark name, version, path, and status. Status values indicate whether a leaderboard is live or in progress, which helps callers determine which leaderboard versions have finalized rankings versus those still accumulating submissions.

Reliability & maintenance

The tbench API is a managed, monitored endpoint for tbench.ai — not a raw scraper you maintain. Every endpoint is automatically health-checked on a schedule, and when tbench.ai changes and a check fails, the API is automatically queued for repair and re-verified. It is built to keep working as the site underneath it changes.

This isn't an official tbench.ai API — it's an independent, maintained REST wrapper over public data. Where the source has no official API (or only a limited one), Parse gives you a stable contract over a source that never promised one, and keeps it current. Need a new endpoint or field? You can revise it yourself in plain English and the agent rebuilds it against the live site in minutes — contributing the change back to the shared API is free.

Will this API break when the source site changes?+

It's built not to. Every endpoint is health-checked on a schedule with automated test probes. When the source site changes and a check fails, the API is automatically queued for repair and re-verified — that's the self-healing layer. Each API page shows when its endpoints were last verified. And because marketplace APIs are shared, any fix reaches everyone using it.

Is this an official API from the source site?+

No — Parse APIs are independent, managed REST wrappers over publicly available data. That is the point: where a site has no official API (or only a limited one), Parse gives you a maintained, monitored endpoint for that data and keeps it working as the site changes — so you get a stable contract over a source that never promised one.

Can I fix or extend this API myself if I need a new endpoint or field?+

Yes — and you don't have to wait on us. This API was generated by the Parse agent, which stays attached. Describe the change in plain English ("add an endpoint that returns reviews", "fix the price field") in the revise box on the API page or via the revise_api MCP tool, and the agent rebuilds it against the live site in minutes. Contributing the change back to the public API is free.

What happens if I call an endpoint that has an issue?+

Errors are machine-readable: a bad call returns a clean status with the list of available endpoints and a repair hint, so an agent (or you) can recover or trigger a fix instead of failing silently. Confirmed failures feed the automatic repair queue.

Common use cases

Track how a specific agent's accuracy rank shifts across Terminal-Bench versions 1.0 through 3.0
Filter the terminal-bench-science leaderboard to surface only verified agent entries
Compare model performance by querying get_leaderboard with the model substring filter
Build a status dashboard that polls list_leaderboards to detect when a new benchmark version goes live
Aggregate accuracy distributions across all entries in a given benchmark version for statistical analysis
Monitor an organization's agents by using the search param to match against agent organization name

Pricing & limitsSee full pricing →

Tier	Price	Credits/month	Rate limit
Free	$0/mo	100	5 req/min
Hobby	$30/mo	1,000	20 req/min
Developer	$100/mo	5,000	100 req/min

One credit = one API call regardless of which marketplace API you call. Exceeding the rate limit returns a 429 response. Authenticate with the X-API-Key header.

Frequently asked questions

Does tbench.ai offer an official developer API?+

tbench.ai does not currently publish a documented public developer API. This Parse API provides structured programmatic access to the leaderboard data available at tbench.ai.

What does the `get_leaderboard` endpoint return for each entry?+

Each entry in the entries array includes the agent's rank, agent name, associated models, accuracy score, and metadata such as organization and verification status. The response also includes top-level fields for benchmark, version, total_entries, and filtered_entries.

Can I retrieve individual submission details or per-task breakdowns beyond the aggregate accuracy score?+

Not currently. The API surfaces aggregate accuracy and metadata at the entry level; per-task or per-category score breakdowns are not part of the response schema. You can fork the API on Parse and revise it to add an endpoint targeting that finer-grained data if it becomes available.

How does the `search` parameter differ from the `agent` and `model` filters?+

The search parameter matches case-insensitively against the agent name, agent organization, and all model names in a single pass. The agent and model params each target only their respective field. Using search is useful for broad lookups; using agent or model is better when you want to isolate a specific dimension.

Are historical leaderboard versions accessible, or only the latest?+

Both. The benchmark and version parameters on get_leaderboard let you query any known version, including older ones like 1.0 and 2.0 for terminal-bench. The list_leaderboards endpoint shows all versions and their statuses, so you can discover which historical snapshots are available before querying.

Page content last updated June 15, 2026. Spec covers 2 endpoints from tbench.ai.

Related APIs in Developer ToolsSee all →

crt.sh API

Search for SSL/TLS certificates across public transparency logs by domain, fingerprint, serial number, or public key, and retrieve detailed certificate information including issuer, validity dates, and certificate chain details. Monitor certificate issuance for domains you care about to track security changes and detect unauthorized certificates.

artificialanalysis.ai API

Compare and rank LLM models and providers across performance benchmarks, then dive into detailed specifications for any model to find the best fit for your needs. Discover performance metrics for specialized AI systems handling speech, images, and video, plus benchmark data for different hardware configurations.

python.org API

Access comprehensive Python release information including downloads, versions, and supported operating systems, plus stay updated with the latest Python news and events. Search across Python.org's resources and browse release files, details, and the FTP index all in one place.

nvidia.com API

alienvault.com API

Search and analyze global threat intelligence data including indicators of compromise, threat pulses, and adversary profiles from the Open Threat Exchange community. Monitor recent security alerts and access detailed information about threats and adversaries to strengthen your cybersecurity defenses.

lucide.dev API

Browse and download thousands of Lucide icons with instant search and category filtering to find exactly what you need. Get SVG files and metadata for each icon to integrate them seamlessly into your projects.

theresanaiforthat.com API

Search and discover AI tools across different tasks, get detailed information about specific tools, browse available deals, and stay updated on the latest tools. Find the perfect AI solution for your needs by filtering by task category or checking featured and trending tools.

dataforseo.com API

Monitor top-performing websites and trending keywords on Google while tracking SERP volatility to stay ahead of SEO trends. Get real-time insights into ranking keywords, search demand patterns, and search engine result page changes to inform your SEO strategy.