What is a Golden Prompt Set?

A ‘golden prompt set’ is a new term introduced by OpenAI in October 2024 in tandem with their launch of Apps within ChatGPT. It’s an essential tool for ChatGPT app metadata optimization and OpenAI app discovery.

A golden prompt set includes three types of prompts that users might type into ChatGPT:

Direct Prompts - where users ask for your app or product by name
Indirect Prompts - where users ask for something your app or product is designed to help with
Negative Prompts - where users ask for something that your app or product shouldn’t be used for, and where default tools (e.g. web search) should be used instead.

Why Create a Golden Prompt Set for ChatGPT App Metadata Optimization?

A golden prompt set is a critical evaluation tool that developers and marketing teams use when building OpenAI Apps to understand how well their ChatGPT app will perform in discovery and metadata optimization.

A golden prompt set allows you to evaluate how well your OpenAI App will be discovered by users within ChatGPT (and eventually, in similar AI tools like Claude or Perplexity).

By testing each prompt from your ‘golden prompt set’ in ChatGPT, you can evaluate if your app triggers when it should (e.g. for Direct Prompts and Indirect Prompts) and doesn’t get called when it shouldn’t be (Negative Prompts).

Here’s a summary of what that might look like for the Spotify ChatGPT App:

Type	Description	Expected Behavior
direct	User explicitly names your app/tool (“Use Spotify to create a TSwift playlist”)	Model must call that tool
indirect	User implies a need your tool covers (“Create a Taylor Swift playlist”)	Model should infer to call the right connector
negative	User asks something outside the connector’s scope (“Who is Taylor Swift’s producer?”)	Model should not call the tool — use text or web search instead

How to Create a Golden Prompt Set for Your ChatGPT App

A golden prompt set is an evaluation dataset (also called an “eval”) that helps you optimize your ChatGPT app metadata. You use your golden prompt set to define the ‘ground truth’ for when your OpenAI App should ideally be invoked, and which tools should be used when.

The typical format for golden prompt set eval datasets is a jsonl file that creates structured data for ChatGPT app metadata optimization. The exact structure depends on how you’ve set up your evaluation script, but here’s an example of what that might look like.

The query is the prompt that the user types into ChatGPT, the answer is roughly what you want the response to say, and the ideal defines if a tool should be called and, if so, what tool we expect. Finally, the type tags it as Direct, Indirect, or Negative to create summary statistics on the results.

Indirect Prompt Example

{"query":"Find me flights from Boston to Santiago, Chile on December 14.",
"answer":"Returns flight options BOS→SCL for Dec 14.",
"ideal":{"should_call_tool":true,"expected_tool":"expedia.search_flights"},
"type":"indirect"}

Direct Prompt Example

{"query":"Use Expedia to search flights BOS→SCL for Dec 14.",
"answer":"Returns flight options BOS→SCL for Dec 14 via Booking.",
"ideal":{"should_call_tool":true,"expected_tool":"expedia.search_flights"},
"type":"direct"}

Negative Prompt Example

{"query":"What's the population of Chile?",
"answer":"Gives population figure as text.",
"ideal":{"should_call_tool":false},
"type":"negative"}

To create a golden prompt set dataset for your ChatGPT App, you can use AI tools to generate prompts by providing examples of when you want your tool used, when you want your tool not to be used, and the example format shown above.

Create golden prompt set examples for each possible action or tool that your ChatGPT App includes. This comprehensive approach ensures your OpenAI app metadata optimization covers all use cases.

How to Evaluate ChatGPT App Performance with a Golden Prompt Set

Let’s say you’re using your golden prompt set to evaluate how well the model knows when to call Expedia’s flight search action for your ChatGPT App, and when not to. This evaluation is crucial for OpenAI app metadata optimization.

Each example in your golden prompt set eval dataset is labeled with the correct action (e.g., expedia.search_flights, none, or another tool).

Create a script (or use the OpenAI Eval tool) to run the model against each query in your golden prompt set. Then, compare the results against your ideal output to measure your ChatGPT app metadata optimization performance.

Golden Prompt Set Evals: Key Terms

Term	Meaning
True Positive (TP)	The model did call `expedia.search_flights`, and that was correct.
False Positive (FP)	The model called `expedia.search_flights`, but shouldn’t have.
False Negative (FN)	The model didn’t call `expedia.search_flights`, but should have.
True Negative (TN)	The model correctly didn’t call this tool when another tool (or none) was right.

Interpreting ChatGPT App Action-Calling Metrics

Using your true and false positives and negatives, you can compute summary statistics about how your tool performed:

Metric	Formula	What It Tells You
Precision	TP ÷ (TP + FP)	Of all the times the model used Expedia search flights, how many were correct?
Recall	TP ÷ (TP + FN)	Of all the times Expedia search flights should’ve been used, how often did the model actually use it?
F1 Score	2 × (Precision × Recall) ÷ (Precision + Recall)	A balanced measure of being accurate and complete.
Accuracy	(TP + TN) ÷ (TP + FP + FN + TN)	Overall correctness; how often the model got it right.

Example Metric Calculation

Suppose you test 10 prompts:

Situation	Count
Should have called Expedia, and did	4
Should have called Expedia, but didn’t	1
Shouldn’t have called Expedia, but did	2
Shouldn’t have called Expedia, and didn’t	3

Compute:

Precision = 4 ÷ (4 + 2) = 0.67
Recall = 4 ÷ (4 + 1) = 0.80
F1 = 2 × (0.67 × 0.80) ÷ (0.67 + 0.80) ≈ 0.73
Accuracy = (4 + 3) ÷ 10 = 0.70

Here, we see that precision may be a little low. This tells us we should consider adjusting our Expedia flight search tool metadata to be more focused for better ChatGPT app discovery.

We can then look at specific prompts from our golden prompt set to inform updates to our OpenAI app metadata and tool descriptions, then run this same evaluation again to see if our ChatGPT app metadata optimization metrics improve.

Interpreting Golden Prompt Set Evals

High precision, low recall: The model only calls Expedia when it’s really sure, but misses valid flight requests.
Low precision, high recall: The model calls Expedia often, even when it shouldn’t.
High F1: It’s striking the right balance.
High accuracy: It’s overall good at deciding when to use Expedia vs. another tool.

In short

Precision = How careful is the model when using Expedia?
Recall = How good is the model at spotting flight requests?
F1 = Does the model balance both well?
Accuracy = How often is the model right overall?

Optimizing ChatGPT App Metadata with Golden Prompt Set Metrics

Once you have these golden prompt set metrics in hand, you can make adjustments to your OpenAI App metadata to better optimize your ChatGPT app discovery performance.

For example, if your ChatGPT App has low precision and high recall, your tool is being called even when it shouldn’t be. You should review your OpenAI app metadata and potentially add ‘DO NOT USE’ or ‘DO NOT CALL’ caveats to improve your ChatGPT app metadata optimization. On the other hand, if you see the opposite (high precision, low recall), you should consider including fewer limits or using more keyword terms in your app description to improve discovery.

P.S. The Bullseye is building out a tool for metadata optimization based on this process; if that sounds interesting, join our waitlist.

Additional Considerations for ChatGPT App Metadata Optimization

In addition to golden prompt set metrics like accuracy and precision, you should also monitor for tool errors and latency when optimizing your OpenAI app metadata. Consider how to optimize these metrics with the shortest descriptions possible; token efficiency matters for ChatGPT app metadata optimization, so making descriptions longer isn’t always the best way to improve performance.

Finally, golden prompt set evaluation should be paired with UI testing and evaluation to ensure your ChatGPT App is responsive and provides a good user experience.

Making a Golden Prompt Set for your ChatGPT App