Documentation Index
Fetch the complete documentation index at: https://docs.nano-gpt.com/llms.txt
Use this file to discover all available pages before exploring further.
NanoGPT Evals and Observability API
NanoGPT Evals lets you run durable prompt and model experiments, freeze datasets, version scorers, inspect traces, and aggregate latency, cost, token, error, and score trends. The API is available under /api/v1/evals/*.
The same platform powers Prompt Lab at /prompt-lab.
Core Concepts
Projects
A project groups datasets, experiments, traces, and dashboard metrics. If an experiment is created without a project_id, NanoGPT uses a default Prompt Lab project for the authenticated user.
Datasets and Dataset Versions
A dataset is an editable set of eval rows. A dataset version is a frozen snapshot of the dataset at a point in time. Experiments should use dataset versions when reproducibility matters.
Dataset rows support:
{
"input": "User prompt or task",
"expected_output": "Optional target answer",
"context": "Optional extra context",
"metadata": {
"case": "optional structured metadata"
}
}
Scorers
A scorer grades candidate outputs. Scorers are versioned, so an experiment stores the exact scorer snapshot used at run time.
Supported scorer types:
exact_match
contains
regex
json_schema
threshold
llm_judge
pairwise_llm
This version does not run arbitrary JavaScript or Python scorers.
Experiments
An experiment compares one or more candidates over a dataset or inline rows. Experiments are asynchronous and durable.
Experiment statuses:
queued
in_progress
completed
failed
cancelled
Experiments store candidate prompt, model, and config snapshots, scorer snapshots, progress, traces, scores, errors, and cost and usage metadata.
Traces
A trace records a generation or scoring call. Traces store metadata, timings, usage, cost, status, and errors. Prompt and output content is not stored unless explicitly requested.
Privacy Defaults
By default, traces are metadata-only.
NanoGPT stores prompt and output content only when:
- a user explicitly saves Prompt Lab dataset or experiment content
- an API caller passes a content-storage opt-in such as
nanogpt_eval_store_content: true
Requested content storage can be suppressed when content is not safe or not available to store. When suppression happens, trace metadata includes content_suppressed_reason.
Current suppression reasons include:
| Reason | Description |
|---|
pii_redaction_enabled | Redaction was enabled for the request. |
output_content_unavailable | Content storage was requested, but no output text was available to persist. |
Authentication
Use the same authentication as the NanoGPT API. For API callers, pass your API key in the Authorization header:
Authorization: Bearer $NANOGPT_API_KEY
All eval objects are scoped to the authenticated session, team, and API key context.
Limits
Current experiment limits:
- up to 100 eval items per dataset or inline run
- up to 5 candidates per experiment
- up to 10 scorers per experiment
- up to 100 generation and scoring work units per experiment
Work units are calculated as:
items * candidates * max(1, scorer_count + 1)
Eval run and item rate limits are applied to normal runs and reruns. Rate-limited responses return HTTP 429 with Retry-After.
Object Shapes
Project
{
"id": "project_...",
"object": "eval.project",
"name": "Support Bot",
"description": "Support prompt evaluation",
"settings": {},
"created_at": "2026-05-15T12:00:00.000Z",
"updated_at": "2026-05-15T12:00:00.000Z"
}
Dataset Version
{
"id": "datasetv_...",
"object": "eval.dataset_version",
"dataset_id": "evaldataset_...",
"version": 3,
"item_count": 25,
"items": [
{
"id": "evalitem_...",
"dataset_item_id": "evalitem_...",
"input": "Explain rate limits",
"expected_output": "Mentions quotas and retry behavior",
"context": null,
"metadata": { "topic": "billing" },
"metadata_index": 0
}
],
"source": "manual",
"created_at": "2026-05-15T12:00:00.000Z"
}
Scorer
{
"id": "scorer_...",
"object": "eval.scorer",
"version_id": "scorerv_...",
"version": 1,
"name": "Helpful Judge",
"description": "Scores helpfulness from 0 to 1",
"scorer_type": "llm_judge",
"config": {},
"prompt": "Evaluate the response. Return {\"score\": number, \"reasoning\": string}.",
"judge_model": "openai/gpt-5.4-mini",
"created_at": "2026-05-15T12:00:00.000Z"
}
Experiment
{
"id": "experiment_...",
"object": "eval.experiment",
"project_id": "project_...",
"name": "Support answer comparison",
"description": null,
"dataset_version_id": "datasetv_...",
"status": "in_progress",
"progress": {
"total_items": 10,
"total_traces": 20,
"completed_traces": 8,
"failed_traces": 0,
"total_scores": 40,
"completed_scores": 12,
"failed_scores": 0
},
"candidates": [],
"scorers": [],
"settings": {
"store_content": true,
"redaction": false
},
"error": null,
"created_at": "2026-05-15T12:00:00.000Z",
"started_at": "2026-05-15T12:00:02.000Z",
"completed_at": null,
"cancelled_at": null,
"expires_at": "2026-06-14T12:00:00.000Z"
}
Trace
{
"id": "trace_...",
"object": "eval.trace",
"project_id": "project_...",
"experiment_id": "experiment_...",
"experiment_item_id": "experimentitem_...",
"trace_type": "generation",
"source": "prompt_lab_experiment",
"group_id": "experiment_...",
"parent_trace_id": null,
"status": "completed",
"model": "openai/gpt-5.4-mini",
"provider": "nanogpt",
"store_content": false,
"input_content": null,
"output_content": null,
"metadata": {
"candidate_id": "candidate_1"
},
"usage": {
"prompt_tokens": 120,
"completion_tokens": 80
},
"cost_usd": 0.0004,
"latency_ms": 1200,
"error": null,
"started_at": "2026-05-15T12:00:00.000Z",
"completed_at": "2026-05-15T12:00:01.200Z",
"expires_at": "2026-06-14T12:00:00.000Z"
}
Quick Start
1. Create a dataset
curl -X POST "https://nano-gpt.com/api/v1/evals/datasets" \
-H "Authorization: Bearer $NANOGPT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Support QA",
"items": [
{
"input": "Explain API rate limits to a non-technical founder.",
"expected_output": "Mentions quotas, retry behavior, and practical next steps.",
"metadata": { "topic": "api" }
}
]
}'
2. Freeze a dataset version
curl -X POST "https://nano-gpt.com/api/v1/evals/datasets/evaldataset_abc123/versions" \
-H "Authorization: Bearer $NANOGPT_API_KEY"
3. Create a scorer
curl -X POST "https://nano-gpt.com/api/v1/evals/scorers" \
-H "Authorization: Bearer $NANOGPT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Helpful Judge",
"scorer_type": "llm_judge",
"judge_model": "openai/gpt-5.4-mini",
"prompt": "Evaluate whether the response is helpful. Input: {{input}}\nResponse: {{output}}\nExpected: {{expected_output}}\nReturn only JSON: {\"score\": number, \"reasoning\": string}."
}'
4. Create an async experiment
curl -X POST "https://nano-gpt.com/api/v1/evals/experiments" \
-H "Authorization: Bearer $NANOGPT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Support QA prompt comparison",
"dataset_version_id": "datasetv_abc123",
"candidates": [
{
"id": "baseline",
"name": "Baseline",
"model": "openai/gpt-5.4-mini",
"system": "You are concise and practical.",
"prompt": "{{input}}"
},
{
"id": "detailed",
"name": "Detailed",
"model": "openai/gpt-5.4-mini",
"system": "You are clear, practical, and include examples.",
"prompt": "{{input}}"
}
],
"scorer_ids": ["scorer_abc123"],
"settings": {
"store_content": true,
"redaction": false
}
}'
The response returns an experiment with status queued or in_progress. Poll the experiment until status is terminal.
5. Poll the experiment
curl "https://nano-gpt.com/api/v1/evals/experiments/experiment_abc123" \
-H "Authorization: Bearer $NANOGPT_API_KEY"
6. Read output items
curl "https://nano-gpt.com/api/v1/evals/experiments/experiment_abc123/output_items" \
-H "Authorization: Bearer $NANOGPT_API_KEY"
API Reference
Projects
List projects
GET /api/v1/evals/projects
Returns:
{
"object": "list",
"data": []
}
Create project
POST /api/v1/evals/projects
Body:
{
"name": "Support Bot",
"description": "Optional description",
"settings": {}
}
Get project
GET /api/v1/evals/projects/{project_id}
Update project
PATCH /api/v1/evals/projects/{project_id}
Body fields:
{
"name": "New name",
"description": "New description",
"settings": {}
}
Delete project
DELETE /api/v1/evals/projects/{project_id}
Returns:
{
"deleted": true,
"id": "project_..."
}
Datasets
List datasets
GET /api/v1/evals/datasets
Create dataset
POST /api/v1/evals/datasets
Body:
{
"id": "evaldataset_optional_custom_id",
"name": "Dataset name",
"description": "Optional description",
"items": [
{
"input": "Required input",
"output": "Optional existing output",
"system": "Optional system message",
"expected_output": "Optional expected output",
"context": "Optional context",
"metadata": {}
}
]
}
Custom dataset IDs must start with evaldataset_ and contain 6 to 80 letters, numbers, underscores, or dashes after the prefix.
Get dataset
GET /api/v1/evals/datasets/{dataset_id}
Returns the dataset and its current items.
Delete dataset
DELETE /api/v1/evals/datasets/{dataset_id}
Deletes the dataset by marking it deleted. Historical runs and versions keep their snapshots.
Dataset Versions
List dataset versions
GET /api/v1/evals/datasets/{dataset_id}/versions
Create dataset version
POST /api/v1/evals/datasets/{dataset_id}/versions
Freezes the current dataset rows into a new immutable version.
Scorers
List scorers
GET /api/v1/evals/scorers
Includes built-in scorers, legacy custom evaluators, and versioned scorers.
Create scorer
POST /api/v1/evals/scorers
Body:
{
"id": "optional_scorer_id",
"scorer_id": "optional_scorer_id",
"name": "Scorer name",
"description": "Optional description",
"scorer_type": "llm_judge",
"config": {},
"prompt": "Required for llm_judge",
"judge_model": "openai/gpt-5.4-mini"
}
For llm_judge, prompt is required.
If both id and scorer_id are omitted, NanoGPT generates a scorer ID.
Get latest scorer
GET /api/v1/evals/scorers/{scorer_id}
Delete scorer
DELETE /api/v1/evals/scorers/{scorer_id}
Deletes all stored versions for the scorer ID.
Scorer Configuration
exact_match
Compares output to expected_output.
Config:
{
"case_sensitive": false
}
contains
Checks whether output contains a configured value or expected_output.
Config:
{
"value": "required substring",
"case_sensitive": false
}
regex
Checks whether output matches a regular expression.
Config:
{
"pattern": "success|passed",
"flags": "i"
}
If pattern is omitted, the scorer uses expected_output as the pattern.
json_schema
Parses output as JSON and validates a supported JSON-schema subset.
Config:
{
"schema": {
"type": "object",
"required": ["answer"],
"properties": {
"answer": {
"type": "string",
"minLength": 2
}
}
}
}
Supported schema fields:
type
required
properties
items
enum
minimum
maximum
minLength
maxLength
Nested properties and items validation is capped at 10 levels.
threshold
Converts a value to a number and passes if it is greater than or equal to a threshold.
Config:
{
"source": "metadata.score",
"threshold": 0.7
}
Supported sources:
output
expected_output
metadata.score
llm_judge
Calls a judge model and expects JSON:
{
"score": 0.8,
"reasoning": "The response directly answers the user."
}
The score is clamped to 0..1.
Prompt templates may reference:
{{input}}
{{output}}
{{expected_output}}
{{context}}
{{system}}
{{metadata.some_key}}
pairwise_llm
Compares a challenger candidate against a baseline candidate with an LLM judge. A score of 1 means the challenger is better, 0 means the baseline is better, and 0.5 means a tie.
Config:
{
"baseline_candidate_id": "baseline"
}
If no baseline is configured, the first candidate is used.
Experiments
List experiments
GET /api/v1/evals/experiments
Query parameters:
| Parameter | Description |
|---|
project_id | Optional project filter. |
limit | Default 50, maximum 100. |
Create experiment
POST /api/v1/evals/experiments
Body:
{
"project_id": "project_...",
"name": "Experiment name",
"description": "Optional description",
"dataset_id": "evaldataset_...",
"dataset_version_id": "datasetv_...",
"data": [
{
"input": "Inline row",
"expected_output": "Optional expected output",
"context": "Optional context",
"metadata": {}
}
],
"candidates": [
{
"id": "baseline",
"name": "Baseline",
"model": "openai/gpt-5.4-mini",
"system": "Optional system message",
"prompt": "{{input}}",
"config": {
"temperature": 0
}
}
],
"scorer_ids": ["scorer_..."],
"settings": {
"store_content": true,
"redaction": false,
"trace_group": "optional-group",
"judge_model": "openai/gpt-5.4-mini"
}
}
Use one data source:
dataset_id
dataset_version_id
- inline
data or items
Do not pass both dataset_id and dataset_version_id.
For inline experiments, content storage must be enabled because the experiment needs row snapshots to run asynchronously.
Candidate fields:
| Field | Description |
|---|
id | Optional. Defaults to candidate_1, candidate_2, and so on. |
name | Optional candidate name. |
model | Required model ID. |
system | Optional system message. |
prompt | Optional. Defaults to {{input}}. |
config | Optional generation config. model, messages, and stream are ignored if included. |
The endpoint returns 202 Accepted and an experiment object.
Get experiment
GET /api/v1/evals/experiments/{experiment_id}
Use this endpoint to poll status and progress.
Cancel experiment
POST /api/v1/evals/experiments/{experiment_id}/cancel
Only queued or in-progress experiments can be cancelled.
Rerun experiment
POST /api/v1/evals/experiments/{experiment_id}/rerun
Creates a new experiment from the original experiment snapshot and schedules it asynchronously.
List experiment output items
GET /api/v1/evals/experiments/{experiment_id}/output_items
Query parameters:
| Parameter | Description |
|---|
redact_content | Set to true to hide stored input and output content in the response. |
Response:
{
"experiment": {},
"items": [],
"data": [
{
"id": "trace_...",
"object": "eval.trace",
"status": "completed",
"output_content": "Stored output if store_content was true",
"scores": [
{
"id": "score_...",
"scorer_id": "scorer_...",
"score": 0.8,
"reasoning": "Good answer",
"status": "completed"
}
]
}
]
}
Traces
List traces
Query parameters:
| Parameter | Description |
|---|
project_id | Optional project filter. |
experiment_id | Optional experiment filter. |
model | Optional model filter. |
provider | Optional routing label. Public responses use generic NanoGPT routing labels rather than internal provider names. |
status | Optional status filter. |
source | Optional source filter. |
limit | Default 50, maximum 200. |
Get trace
GET /api/v1/evals/traces/{trace_id}
Returns the trace plus attached scores.
Dashboard
GET /api/v1/evals/dashboard
Query parameters:
| Parameter | Description |
|---|
project_id | Optional project filter. |
experiment_id | Optional experiment filter. |
Returns aggregate metrics:
{
"trace_count": 100,
"cost_usd": 0.42,
"prompt_tokens": 10000,
"completion_tokens": 5000,
"avg_latency_ms": 1200,
"p50_latency_ms": 900,
"p95_latency_ms": 2400,
"error_count": 2,
"error_rate": 0.02,
"model_provider_breakdown": [],
"scorer_trends": []
}
Opt-in Chat Completion Tracing
Normal /v1/chat/completions requests do not create eval traces.
To trace a normal API request, add metadata.nanogpt_eval_trace: true to the chat completion request.
Example:
curl -X POST "https://nano-gpt.com/v1/chat/completions" \
-H "Authorization: Bearer $NANOGPT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-5.4-mini",
"messages": [
{ "role": "user", "content": "Explain API rate limits." }
],
"metadata": {
"nanogpt_eval_trace": true,
"nanogpt_eval_project_id": "project_...",
"nanogpt_eval_trace_group": "docs-example",
"nanogpt_eval_store_content": false,
"customer_request_id": "kept-and-forwarded"
}
}'
Supported eval metadata keys:
| Key | Description |
|---|
nanogpt_eval_trace | Boolean. Must be true to create a trace. |
nanogpt_eval_project_id | Optional project ID. |
nanogpt_eval_experiment_id | Optional experiment ID. |
nanogpt_eval_trace_group | Optional group ID. |
nanogpt_eval_store_content | Boolean. Stores prompt and output content only when true. |
NanoGPT strips only metadata.nanogpt_eval_* keys before provider dispatch. Other metadata keys remain untouched.
If nanogpt_eval_store_content is omitted or false, the trace stores metadata, usage, cost, latency, status, and errors, but not prompt or output content.
If nanogpt_eval_store_content is true but the request does not produce output text available to the trace recorder, NanoGPT keeps the trace metadata-only and records content_suppressed_reason: "output_content_unavailable" in trace metadata.
Legacy Evaluator Endpoints
The original evaluator API remains available for compatibility.
List legacy evaluators
Create legacy evaluator
Body:
{
"id": "eval_optional_custom_id",
"name": "Helpfulness",
"description": "Optional",
"prompt": "Evaluate this response. Return JSON with score and reasoning.",
"judge_model": "openai/gpt-5.4-mini"
}
Custom evaluator IDs must start with eval_ and contain 6 to 80 letters, numbers, underscores, or dashes after the prefix.
Run legacy evaluator
POST /api/v1/evals/{eval_id}/runs
Body:
{
"dataset_id": "evaldataset_...",
"data": [
{
"input": "Question",
"output": "Candidate answer",
"expected_output": "Expected answer",
"context": "Optional context",
"metadata": {}
}
],
"store": true,
"judge_model": "openai/gpt-5.4-mini",
"redaction": false,
"concurrency": 4,
"metadata": {}
}
Use either dataset_id or inline data.
Get stored legacy run
GET /api/v1/evals/{eval_id}/runs/{run_id}
Get stored legacy run output items
GET /api/v1/evals/{eval_id}/runs/{run_id}/output_items
Retention
Default retention is 30 days. Trace records, stored trace content, jobs, and old experiment artifacts are cleaned up by the eval cleanup job. Projects, datasets, dataset versions, and scorers are durable until deleted.
Error Responses
Validation errors return HTTP 400:
{
"error": "name is required"
}
Missing resources return HTTP 404:
{
"error": "Experiment not found"
}
Rate limits return HTTP 429:
{
"error": {
"message": "Too many eval runs. Please slow down and try again later.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}
Unexpected server failures return HTTP 500:
{
"error": "Internal Server Error"
}
Operational Notes
The eval platform stores durable project, dataset, scorer, experiment, trace, and score records in dedicated tables, including:
eval_projects
eval_dataset_versions
eval_scorer_versions
eval_experiments
eval_experiment_items
eval_traces
eval_trace_scores
The migration also ensures the legacy eval tables exist.
Async experiment execution uses NanoGPT’s background scheduler. Stale queued or in-progress experiments are retried, and the cleanup job removes expired legacy runs, traces, and experiments.