NanoGPT Evals and Observability API

NanoGPT Evals lets you run durable prompt and model experiments, freeze datasets, version scorers, inspect traces, and aggregate latency, cost, token, error, and score trends. The API is available under /api/v1/evals/*. The same platform powers Prompt Lab at /prompt-lab.

Core Concepts

Projects

A project groups datasets, experiments, traces, and dashboard metrics. If an experiment is created without a project_id, NanoGPT uses a default Prompt Lab project for the authenticated user.

Datasets and Dataset Versions

A dataset is an editable set of eval rows. A dataset version is a frozen snapshot of the dataset at a point in time. Experiments should use dataset versions when reproducibility matters. Dataset rows support:

{
  "input": "User prompt or task",
  "expected_output": "Optional target answer",
  "context": "Optional extra context",
  "metadata": {
    "case": "optional structured metadata"
  }
}

Scorers

A scorer grades candidate outputs. Scorers are versioned, so an experiment stores the exact scorer snapshot used at run time. Supported scorer types:

exact_match
contains
regex
json_schema
threshold
llm_judge
pairwise_llm

This version does not run arbitrary JavaScript or Python scorers.

Experiments

An experiment compares one or more candidates over a dataset or inline rows. Experiments are asynchronous and durable. Experiment statuses:

queued
in_progress
completed
failed
cancelled

Experiments store candidate prompt, model, and config snapshots, scorer snapshots, progress, traces, scores, errors, and cost and usage metadata.

Traces

A trace records a generation or scoring call. Traces store metadata, timings, usage, cost, status, and errors. Prompt and output content is not stored unless explicitly requested.

Privacy Defaults

By default, traces are metadata-only. NanoGPT stores prompt and output content only when:

a user explicitly saves Prompt Lab dataset or experiment content
an API caller passes a content-storage opt-in such as nanogpt_eval_store_content: true

Requested content storage can be suppressed when content is not safe or not available to store. When suppression happens, trace metadata includes content_suppressed_reason. Current suppression reasons include:

Reason	Description
`pii_redaction_enabled`	Redaction was enabled for the request.
`output_content_unavailable`	Content storage was requested, but no output text was available to persist.

Authentication

Use the same authentication as the NanoGPT API. For API callers, pass your API key in the Authorization header:

Authorization: Bearer $NANOGPT_API_KEY

All eval objects are scoped to the authenticated session, team, and API key context.

Limits

Current experiment limits:

up to 100 eval items per dataset or inline run
up to 5 candidates per experiment
up to 10 scorers per experiment
up to 100 generation and scoring work units per experiment

Work units are calculated as:

items * candidates * max(1, scorer_count + 1)

Eval run and item rate limits are applied to normal runs and reruns. Rate-limited responses return HTTP 429 with Retry-After.

Object Shapes

Project

{
  "id": "project_...",
  "object": "eval.project",
  "name": "Support Bot",
  "description": "Support prompt evaluation",
  "settings": {},
  "created_at": "2026-05-15T12:00:00.000Z",
  "updated_at": "2026-05-15T12:00:00.000Z"
}

Dataset Version

{
  "id": "datasetv_...",
  "object": "eval.dataset_version",
  "dataset_id": "evaldataset_...",
  "version": 3,
  "item_count": 25,
  "items": [
    {
      "id": "evalitem_...",
      "dataset_item_id": "evalitem_...",
      "input": "Explain rate limits",
      "expected_output": "Mentions quotas and retry behavior",
      "context": null,
      "metadata": { "topic": "billing" },
      "metadata_index": 0
    }
  ],
  "source": "manual",
  "created_at": "2026-05-15T12:00:00.000Z"
}

Scorer

{
  "id": "scorer_...",
  "object": "eval.scorer",
  "version_id": "scorerv_...",
  "version": 1,
  "name": "Helpful Judge",
  "description": "Scores helpfulness from 0 to 1",
  "scorer_type": "llm_judge",
  "config": {},
  "prompt": "Evaluate the response. Return {\"score\": number, \"reasoning\": string}.",
  "judge_model": "openai/gpt-5.4-mini",
  "created_at": "2026-05-15T12:00:00.000Z"
}

Experiment

{
  "id": "experiment_...",
  "object": "eval.experiment",
  "project_id": "project_...",
  "name": "Support answer comparison",
  "description": null,
  "dataset_version_id": "datasetv_...",
  "status": "in_progress",
  "progress": {
    "total_items": 10,
    "total_traces": 20,
    "completed_traces": 8,
    "failed_traces": 0,
    "total_scores": 40,
    "completed_scores": 12,
    "failed_scores": 0
  },
  "candidates": [],
  "scorers": [],
  "settings": {
    "store_content": true,
    "redaction": false
  },
  "error": null,
  "created_at": "2026-05-15T12:00:00.000Z",
  "started_at": "2026-05-15T12:00:02.000Z",
  "completed_at": null,
  "cancelled_at": null,
  "expires_at": "2026-06-14T12:00:00.000Z"
}

Trace

{
  "id": "trace_...",
  "object": "eval.trace",
  "project_id": "project_...",
  "experiment_id": "experiment_...",
  "experiment_item_id": "experimentitem_...",
  "trace_type": "generation",
  "source": "prompt_lab_experiment",
  "group_id": "experiment_...",
  "parent_trace_id": null,
  "status": "completed",
  "model": "openai/gpt-5.4-mini",
  "provider": "nanogpt",
  "store_content": false,
  "input_content": null,
  "output_content": null,
  "metadata": {
    "candidate_id": "candidate_1"
  },
  "usage": {
    "prompt_tokens": 120,
    "completion_tokens": 80
  },
  "cost_usd": 0.0004,
  "latency_ms": 1200,
  "error": null,
  "started_at": "2026-05-15T12:00:00.000Z",
  "completed_at": "2026-05-15T12:00:01.200Z",
  "expires_at": "2026-06-14T12:00:00.000Z"
}

Quick Start

1. Create a dataset

curl -X POST "https://nano-gpt.com/api/v1/evals/datasets" \
  -H "Authorization: Bearer $NANOGPT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Support QA",
    "items": [
      {
        "input": "Explain API rate limits to a non-technical founder.",
        "expected_output": "Mentions quotas, retry behavior, and practical next steps.",
        "metadata": { "topic": "api" }
      }
    ]
  }'

2. Freeze a dataset version

curl -X POST "https://nano-gpt.com/api/v1/evals/datasets/evaldataset_abc123/versions" \
  -H "Authorization: Bearer $NANOGPT_API_KEY"

3. Create a scorer

curl -X POST "https://nano-gpt.com/api/v1/evals/scorers" \
  -H "Authorization: Bearer $NANOGPT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Helpful Judge",
    "scorer_type": "llm_judge",
    "judge_model": "openai/gpt-5.4-mini",
    "prompt": "Evaluate whether the response is helpful. Input: {{input}}\nResponse: {{output}}\nExpected: {{expected_output}}\nReturn only JSON: {\"score\": number, \"reasoning\": string}."
  }'

4. Create an async experiment

curl -X POST "https://nano-gpt.com/api/v1/evals/experiments" \
  -H "Authorization: Bearer $NANOGPT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Support QA prompt comparison",
    "dataset_version_id": "datasetv_abc123",
    "candidates": [
      {
        "id": "baseline",
        "name": "Baseline",
        "model": "openai/gpt-5.4-mini",
        "system": "You are concise and practical.",
        "prompt": "{{input}}"
      },
      {
        "id": "detailed",
        "name": "Detailed",
        "model": "openai/gpt-5.4-mini",
        "system": "You are clear, practical, and include examples.",
        "prompt": "{{input}}"
      }
    ],
    "scorer_ids": ["scorer_abc123"],
    "settings": {
      "store_content": true,
      "redaction": false
    }
  }'

The response returns an experiment with status queued or in_progress. Poll the experiment until status is terminal.

5. Poll the experiment

curl "https://nano-gpt.com/api/v1/evals/experiments/experiment_abc123" \
  -H "Authorization: Bearer $NANOGPT_API_KEY"

6. Read output items

curl "https://nano-gpt.com/api/v1/evals/experiments/experiment_abc123/output_items" \
  -H "Authorization: Bearer $NANOGPT_API_KEY"

API Reference

Projects

List projects

GET /api/v1/evals/projects

Returns:

{
  "object": "list",
  "data": []
}

Create project

POST /api/v1/evals/projects

Body:

{
  "name": "Support Bot",
  "description": "Optional description",
  "settings": {}
}

Get project

GET /api/v1/evals/projects/{project_id}

Update project

PATCH /api/v1/evals/projects/{project_id}

Body fields:

{
  "name": "New name",
  "description": "New description",
  "settings": {}
}

Delete project

DELETE /api/v1/evals/projects/{project_id}

Returns:

{
  "deleted": true,
  "id": "project_..."
}

Datasets

List datasets

GET /api/v1/evals/datasets

Create dataset

POST /api/v1/evals/datasets

Body:

{
  "id": "evaldataset_optional_custom_id",
  "name": "Dataset name",
  "description": "Optional description",
  "items": [
    {
      "input": "Required input",
      "output": "Optional existing output",
      "system": "Optional system message",
      "expected_output": "Optional expected output",
      "context": "Optional context",
      "metadata": {}
    }
  ]
}

Custom dataset IDs must start with evaldataset_ and contain 6 to 80 letters, numbers, underscores, or dashes after the prefix.

Get dataset

GET /api/v1/evals/datasets/{dataset_id}

Returns the dataset and its current items.

Delete dataset

DELETE /api/v1/evals/datasets/{dataset_id}

Deletes the dataset by marking it deleted. Historical runs and versions keep their snapshots.

Dataset Versions

List dataset versions

GET /api/v1/evals/datasets/{dataset_id}/versions

Create dataset version

POST /api/v1/evals/datasets/{dataset_id}/versions

Freezes the current dataset rows into a new immutable version.

Scorers

List scorers

GET /api/v1/evals/scorers

Includes built-in scorers, legacy custom evaluators, and versioned scorers.

Create scorer

POST /api/v1/evals/scorers

Body:

{
  "id": "optional_scorer_id",
  "scorer_id": "optional_scorer_id",
  "name": "Scorer name",
  "description": "Optional description",
  "scorer_type": "llm_judge",
  "config": {},
  "prompt": "Required for llm_judge",
  "judge_model": "openai/gpt-5.4-mini"
}

For llm_judge, prompt is required. If both id and scorer_id are omitted, NanoGPT generates a scorer ID.

Get latest scorer

GET /api/v1/evals/scorers/{scorer_id}

Delete scorer

DELETE /api/v1/evals/scorers/{scorer_id}

Deletes all stored versions for the scorer ID.

Scorer Configuration

exact_match

Compares output to expected_output. Config:

{
  "case_sensitive": false
}

contains

Checks whether output contains a configured value or expected_output. Config:

{
  "value": "required substring",
  "case_sensitive": false
}

regex

Checks whether output matches a regular expression. Config:

{
  "pattern": "success|passed",
  "flags": "i"
}

If pattern is omitted, the scorer uses expected_output as the pattern.

json_schema

Parses output as JSON and validates a supported JSON-schema subset. Config:

{
  "schema": {
    "type": "object",
    "required": ["answer"],
    "properties": {
      "answer": {
        "type": "string",
        "minLength": 2
      }
    }
  }
}

Supported schema fields:

type
required
properties
items
enum
minimum
maximum
minLength
maxLength

Nested properties and items validation is capped at 10 levels.

threshold

Converts a value to a number and passes if it is greater than or equal to a threshold. Config:

{
  "source": "metadata.score",
  "threshold": 0.7
}

Supported sources:

output
expected_output
metadata.score

llm_judge

Calls a judge model and expects JSON:

{
  "score": 0.8,
  "reasoning": "The response directly answers the user."
}

The score is clamped to 0..1. Prompt templates may reference:

{{input}}
{{output}}
{{expected_output}}
{{context}}
{{system}}
{{metadata.some_key}}

pairwise_llm

Compares a challenger candidate against a baseline candidate with an LLM judge. A score of 1 means the challenger is better, 0 means the baseline is better, and 0.5 means a tie. Config:

{
  "baseline_candidate_id": "baseline"
}

If no baseline is configured, the first candidate is used.

Experiments

List experiments

GET /api/v1/evals/experiments

Query parameters:

Parameter	Description
`project_id`	Optional project filter.
`limit`	Default `50`, maximum `100`.

Create experiment

POST /api/v1/evals/experiments

Body:

{
  "project_id": "project_...",
  "name": "Experiment name",
  "description": "Optional description",
  "dataset_id": "evaldataset_...",
  "dataset_version_id": "datasetv_...",
  "data": [
    {
      "input": "Inline row",
      "expected_output": "Optional expected output",
      "context": "Optional context",
      "metadata": {}
    }
  ],
  "candidates": [
    {
      "id": "baseline",
      "name": "Baseline",
      "model": "openai/gpt-5.4-mini",
      "system": "Optional system message",
      "prompt": "{{input}}",
      "config": {
        "temperature": 0
      }
    }
  ],
  "scorer_ids": ["scorer_..."],
  "settings": {
    "store_content": true,
    "redaction": false,
    "trace_group": "optional-group",
    "judge_model": "openai/gpt-5.4-mini"
  }
}

Use one data source:

dataset_id
dataset_version_id
inline data or items

Do not pass both dataset_id and dataset_version_id. For inline experiments, content storage must be enabled because the experiment needs row snapshots to run asynchronously. Candidate fields:

Field	Description
`id`	Optional. Defaults to `candidate_1`, `candidate_2`, and so on.
`name`	Optional candidate name.
`model`	Required model ID.
`system`	Optional system message.
`prompt`	Optional. Defaults to `{{input}}`.
`config`	Optional generation config. `model`, `messages`, and `stream` are ignored if included.

The endpoint returns 202 Accepted and an experiment object.

Get experiment

GET /api/v1/evals/experiments/{experiment_id}

Use this endpoint to poll status and progress.

Cancel experiment

POST /api/v1/evals/experiments/{experiment_id}/cancel

Only queued or in-progress experiments can be cancelled.

Rerun experiment

POST /api/v1/evals/experiments/{experiment_id}/rerun

Creates a new experiment from the original experiment snapshot and schedules it asynchronously.

List experiment output items

GET /api/v1/evals/experiments/{experiment_id}/output_items

Query parameters:

Parameter	Description
`redact_content`	Set to `true` to hide stored input and output content in the response.

Response:

{
  "experiment": {},
  "items": [],
  "data": [
    {
      "id": "trace_...",
      "object": "eval.trace",
      "status": "completed",
      "output_content": "Stored output if store_content was true",
      "scores": [
        {
          "id": "score_...",
          "scorer_id": "scorer_...",
          "score": 0.8,
          "reasoning": "Good answer",
          "status": "completed"
        }
      ]
    }
  ]
}

Traces

List traces

GET /api/v1/evals/traces

Query parameters:

Parameter	Description
`project_id`	Optional project filter.
`experiment_id`	Optional experiment filter.
`model`	Optional model filter.
`provider`	Optional routing label. Public responses use generic NanoGPT routing labels rather than internal provider names.
`status`	Optional status filter.
`source`	Optional source filter.
`limit`	Default `50`, maximum `200`.

Get trace

GET /api/v1/evals/traces/{trace_id}

Returns the trace plus attached scores.

Dashboard

GET /api/v1/evals/dashboard

Query parameters:

Parameter	Description
`project_id`	Optional project filter.
`experiment_id`	Optional experiment filter.

Returns aggregate metrics:

{
  "trace_count": 100,
  "cost_usd": 0.42,
  "prompt_tokens": 10000,
  "completion_tokens": 5000,
  "avg_latency_ms": 1200,
  "p50_latency_ms": 900,
  "p95_latency_ms": 2400,
  "error_count": 2,
  "error_rate": 0.02,
  "model_provider_breakdown": [],
  "scorer_trends": []
}

Opt-in Chat Completion Tracing

Normal /v1/chat/completions requests do not create eval traces. To trace a normal API request, add metadata.nanogpt_eval_trace: true to the chat completion request. Example:

curl -X POST "https://nano-gpt.com/v1/chat/completions" \
  -H "Authorization: Bearer $NANOGPT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-5.4-mini",
    "messages": [
      { "role": "user", "content": "Explain API rate limits." }
    ],
    "metadata": {
      "nanogpt_eval_trace": true,
      "nanogpt_eval_project_id": "project_...",
      "nanogpt_eval_trace_group": "docs-example",
      "nanogpt_eval_store_content": false,
      "customer_request_id": "kept-and-forwarded"
    }
  }'

Supported eval metadata keys:

Key	Description
`nanogpt_eval_trace`	Boolean. Must be `true` to create a trace.
`nanogpt_eval_project_id`	Optional project ID.
`nanogpt_eval_experiment_id`	Optional experiment ID.
`nanogpt_eval_trace_group`	Optional group ID.
`nanogpt_eval_store_content`	Boolean. Stores prompt and output content only when `true`.

NanoGPT strips only metadata.nanogpt_eval_* keys before provider dispatch. Other metadata keys remain untouched. If nanogpt_eval_store_content is omitted or false, the trace stores metadata, usage, cost, latency, status, and errors, but not prompt or output content. If nanogpt_eval_store_content is true but the request does not produce output text available to the trace recorder, NanoGPT keeps the trace metadata-only and records content_suppressed_reason: "output_content_unavailable" in trace metadata.

Legacy Evaluator Endpoints

The original evaluator API remains available for compatibility.

List legacy evaluators

GET /api/v1/evals

Create legacy evaluator

POST /api/v1/evals

Body:

{
  "id": "eval_optional_custom_id",
  "name": "Helpfulness",
  "description": "Optional",
  "prompt": "Evaluate this response. Return JSON with score and reasoning.",
  "judge_model": "openai/gpt-5.4-mini"
}

Custom evaluator IDs must start with eval_ and contain 6 to 80 letters, numbers, underscores, or dashes after the prefix.

Run legacy evaluator

POST /api/v1/evals/{eval_id}/runs

Body:

{
  "dataset_id": "evaldataset_...",
  "data": [
    {
      "input": "Question",
      "output": "Candidate answer",
      "expected_output": "Expected answer",
      "context": "Optional context",
      "metadata": {}
    }
  ],
  "store": true,
  "judge_model": "openai/gpt-5.4-mini",
  "redaction": false,
  "concurrency": 4,
  "metadata": {}
}

Use either dataset_id or inline data.

Get stored legacy run

GET /api/v1/evals/{eval_id}/runs/{run_id}

Get stored legacy run output items

GET /api/v1/evals/{eval_id}/runs/{run_id}/output_items

Retention

Default retention is 30 days. Trace records, stored trace content, jobs, and old experiment artifacts are cleaned up by the eval cleanup job. Projects, datasets, dataset versions, and scorers are durable until deleted.

Error Responses

Validation errors return HTTP 400:

{
  "error": "name is required"
}

Missing resources return HTTP 404:

{
  "error": "Experiment not found"
}

Rate limits return HTTP 429:

{
  "error": {
    "message": "Too many eval runs. Please slow down and try again later.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Unexpected server failures return HTTP 500:

{
  "error": "Internal Server Error"
}

Operational Notes

The eval platform stores durable project, dataset, scorer, experiment, trace, and score records in dedicated tables, including:

eval_projects
eval_dataset_versions
eval_scorer_versions
eval_experiments
eval_experiment_items
eval_traces
eval_trace_scores

The migration also ensures the legacy eval tables exist. Async experiment execution uses NanoGPT’s background scheduler. Stale queued or in-progress experiments are retried, and the cleanup job removes expired legacy runs, traces, and experiments.

Get Started

Endpoint Examples

API Reference

Miscellaneous

Integrations

Documentation Index

​NanoGPT Evals and Observability API

​Core Concepts

​Projects

​Datasets and Dataset Versions

​Scorers

​Experiments

​Traces

​Privacy Defaults

​Authentication

​Limits

​Object Shapes

​Project

​Dataset Version

​Scorer

​Experiment

​Trace

​Quick Start

​1. Create a dataset

​2. Freeze a dataset version

​3. Create a scorer

​4. Create an async experiment

​5. Poll the experiment

​6. Read output items

​API Reference

​Projects

​List projects

​Create project

​Get project

​Update project

​Delete project

​Datasets

​List datasets

​Create dataset

​Get dataset

​Delete dataset

​Dataset Versions

​List dataset versions

​Create dataset version

​Scorers

​List scorers

​Create scorer

​Get latest scorer

​Delete scorer

​Scorer Configuration

​exact_match

​contains

​regex

​json_schema

​threshold

​llm_judge

​pairwise_llm

​Experiments

​List experiments

​Create experiment

​Get experiment

​Cancel experiment

​Rerun experiment

​List experiment output items

​Traces

​List traces

​Get trace

​Dashboard

​Opt-in Chat Completion Tracing

​Legacy Evaluator Endpoints

​List legacy evaluators

​Create legacy evaluator

​Run legacy evaluator

​Get stored legacy run

​Get stored legacy run output items

​Retention

​Error Responses

​Operational Notes

NanoGPT Evals and Observability API

Core Concepts

Projects

Datasets and Dataset Versions

Scorers

Experiments

Traces

Privacy Defaults

Authentication

Limits

Object Shapes

Project

Dataset Version

Scorer

Experiment

Trace

Quick Start

1. Create a dataset

2. Freeze a dataset version

3. Create a scorer

4. Create an async experiment

5. Poll the experiment

6. Read output items

API Reference

Projects

List projects

Create project

Get project

Update project

Delete project

Datasets

List datasets

Create dataset

Get dataset

Delete dataset

Dataset Versions

List dataset versions

Create dataset version

Scorers

List scorers

Create scorer

Get latest scorer

Delete scorer

Scorer Configuration

exact_match

contains

regex

json_schema

threshold

llm_judge

pairwise_llm

Experiments

List experiments

Create experiment

Get experiment

Cancel experiment

Rerun experiment

List experiment output items

Traces

List traces

Get trace

Dashboard

Opt-in Chat Completion Tracing

Legacy Evaluator Endpoints

List legacy evaluators

Create legacy evaluator

Run legacy evaluator

Get stored legacy run

Get stored legacy run output items

Retention

Error Responses

Operational Notes