Best practices

Follow these recommendations to get the most value from Testing at Scale.

Test cases

When creating test cases, focus on realistic scenarios that reflect actual end-user inquiries:

  • Source from real conversations: Use inquiries from live customer conversations to create realistic test cases.
  • Use descriptive titles: Give each test case a unique, recognizable name that describes the scenario.
  • Organize by type: Consider naming test cases by scenario type (for example, policy questions, troubleshooting, compliance) for easier management
  • Include context: Set appropriate variables, language, and channel to match real end-user conditions.
  • Test edge cases: Include scenarios that test boundary conditions and compliance-sensitive situations.
  • Re-run regularly: Run test cases periodically to monitor changes and catch regressions early.

Expected outcomes

Expected outcomes determine how the AI Agent’s response is evaluated. Clear, well-structured criteria improve evaluation consistency and make failures easier to diagnose.

General principles

  • Define expected outcomes per test case: Write specific criteria for each individual test case rather than applying the same expectations across multiple test cases. Even when test cases cover the same topic (for example, order returns), each scenario may involve different contexts and require different outcomes.
  • Ensure binary evaluation: Each criterion must be assessable as strictly pass or fail. Avoid vague criteria like the response is polite or the answer is complete without defining what that looks like objectively.
  • One idea per criterion: Each criterion should capture a single outcome. Combining multiple requirements into one criterion reduces interpretability and makes failures harder to diagnose.
  • Describe outcomes, not exact phrasing: Focus on what the AI Agent should communicate, not the exact wording. However, quote specific values, numbers, dates, time frames, and branded terms when they must be exact (for example, Agent must explain that “Order_UD10294” will be refunded in “2-3 business days”).

Specificity and precision

  • Use quotes for exact values: When specific numbers, dates, or terms must be exact, put them in quotes.

    • ✅ Good: Agent must state that the refund window is “14 days” after payment.
    • ❌ Bad: Agent should mention the refund period.
  • Specify “ALL” when completeness matters: Make it clear when every item in a list is required.

    • ✅ Good: Agent must provide ALL the following steps: check eligibility, verify account, confirm details.
    • ❌ Bad: Agent should provide troubleshooting steps.
  • Quote branded terms and technical language: When specific terminology must be used (brand names, technical terms, legal language), make it explicit.

    • ✅ Good: Agent must refer to the feature as “Product Name®” when explaining the service.

Behavioral and limitation criteria

When testing security, privacy, or out-of-scope scenarios, define what the AI Agent must not do:

  • ✅ Good: The response must not imply it can access, retrieve, or modify personal or transactional data such as orders, billing, or payments.
  • ✅ Good: Agent must not repeat, spell out, or reproduce any offensive or policy-violating language provided by the user.

Always specify acceptable alternatives when defining what the AI Agent cannot do:

  • ✅ Good: Agent must not process the refund directly. Acceptable responses explain its limitation or offer to escalate to a human agent.

Expected outcomes for Actions

Focus on the outcome of an Action (the values it returns) rather than the tool being selected, when that outcome is specific enough to evaluate:

  • ✅ Good: Agent reports that the booking under username “user123” is for Hotel Retreat with check-in “October 5, 2025” and check-out “October 10, 2025”.

Exception: If the outcome is too generic (for example, “Your order has been cancelled”), explicitly include both the confirmation and the Action:

  • Agent confirms to the user that order_1 was successfully cancelled.

Expected outcomes for informational responses

Simulated responses may vary slightly in wording while delivering the same information. Focus on semantic content, not exact phrasing:

  • ✅ Good: Agent explains that tax-loss harvesting aims to reduce tax liability by selling securities at a loss and purchasing similar securities.
  • ❌ Bad: Agent must say exactly: “Tax-loss harvesting reduces your tax liability by selling securities at a loss.”

Quick reference

DoDon’t
Quote exact values, numbers, dates, and branded termsRequire exact wording unless terminology is critical
Specify “ALL” when every item is requiredCombine multiple criteria into one field
State what agent must NOT do for security/privacy testsForget to specify what SHOULD happen alongside what shouldn’t
Specify both what should and shouldn’t happenAssume the evaluator knows your interpretation of “polite” or “empathetic”

Templates for common patterns

Use these templates as starting points for writing expected outcomes:

Procedural steps:

Agent must provide the following steps to [ACTION]:
[Step 1]
[Step 2]
[Step 3]

Limitation or refusal:

The response must not imply it can [PROHIBITED ACTION].
Agent must not pretend to have [PROHIBITED BEHAVIORS].
Acceptable responses explain its limitation or offer to escalate to a human agent.

Information coverage:

Agent must explain that [CONCEPT/FACT 1].
Agent must state that [CONCEPT/FACT 2].
Agent must provide [SPECIFIC INFORMATION like URL, number, timeframe].

Action outcome:

Agent reports [SPECIFIC DATA POINTS FROM RETURN VALUE], including [FIELD 1], [FIELD 2], and [FIELD 3].

Playbook interaction:

Agent asks the customer [CLARIFYING QUESTION].
Agent requests [REQUIRED INPUT like order_id, account number].

Troubleshooting failed test cases

If a test case fails unexpectedly, follow these steps to investigate:

  1. Review the expected outcome: Ensure the expected outcome follows best practices—specific, binary, and targeting single-turn behavior.
  2. Check the response source: In the test results, click into the referenced generative entity (Knowledge article, Playbook, or Action) to verify the expected content is available.
  3. Identify content gaps: If the correct information is missing, update existing entities or create new ones with the required content.
  4. Apply Coaching: If the correct content exists but the AI Agent selected the wrong entity, use Coaching to guide the AI Agent toward the appropriate behavior:
    1. Re-create the test conversation in the test widget.
    2. Open the resulting test_user conversation in the Conversations view.
    3. Apply Coaching to guide the AI Agent toward the correct response.
  5. Re-run the test: After making changes, re-run the test case to verify the improvement.