🎥 Watch the step-by-step demo: Evaluating your App
In this lab, you’ll use the built-in evaluation harness in gpt-rag orchestrator template to measure how well your agent answers real questions from your knowledge base (e.g., the Contoso Electronics Employee Handbook). The template includes an evaluations folder with a Python script (evaluate.py) and a test dataset (golden-dataset.jsonl). You will run that script from PowerShell (or a shell), inspect the JSON output locally, then review the same results in the Azure AI Foundry portal under your AI Foundry project’s Evaluations tab.
- Locate and inspect evaluation code and data in your cloned repository
- Install all Python dependencies and load environment variables correctly
- Run the evaluation script.
- Understand the key metrics produced (e.g., similarity scores)
- Open the AI Foundry project in Azure Portal and view the evaluation entry
Click to expand prerequisites
-
Bootstrap: Complete the bootstrapping lab and have a running environment.
-
Prototyping and Building: Finish the prototyping and building labs.
-
Tools:
- Visual Studio Code (or your preferred editor).
- Azure CLI installed and authenticated (
az login). - PowerShell (Windows) or a compatible shell.
Before running evaluations, update your agent’s chat model to ensure consistency with the lab assumptions. In the AI Foundry portal, update the model deployment used by your agent to chat.
- In the Azure Portal, navigate to your AI Foundry project.
- Go to the Agents page and select the agent you created.
- On the agent page, select the chat model deployment and save it.
-
Open a terminal and navigate to your local repository:
cd workspace/contoso-orchestrator -
Ensure you are on the correct branch (e.g.,
genaiops-workshop):git switch genaiops-workshop
-
Open the project in Visual Studio Code (or your editor):
code . -
Verify you see:
evaluations/generate_eval_input.py: generates evaluation input data.evaluations/evaluate.py: runs the AI Foundry evaluation.evaluations/dataset/golden-dataset.jsonl: JSON Lines file with sample queries and ground truth.
-
Open these files to familiarize yourself with the data format and evaluation logic.
- In
generate_eval_input.py, inspect how test queries and expected responses are formatted and prepared. - In
evaluate.py, review which evaluators and metrics are used (e.g., similarity evaluators).
Tip: Understanding the metrics helps interpret results and adjust your agent if needed.
-
Azure CLI Login: Ensure you are authenticated:
az login
-
Set environment variables: Export the App Configuration endpoint that the evaluation scripts expect.
In PowerShell (Windows):
$Env:APP_CONFIG_ENDPOINT = "https://<your-app-config-name>.azconfig.io"Or in Bash (macOS/Linux):
export APP_CONFIG_ENDPOINT="https://<your-app-config-name>.azconfig.io"-
Run the evaluation script:
-
In PowerShell (Windows):
.\evaluations\evaluate.ps1
-
In Bash (macOS/Linux):
./evaluations/evaluate.sh
-
-
The script will generate input, submit the evaluation to AI Foundry.
-
The script typically prints a URL or ID to view the evaluation run in the AI Foundry portal.
Note: Evaluation may take a few minutes. Be patient and monitor console output for any errors.
-
Portal Inspection:
- Click the link provided by the script output to go directly to the evaluation page in AI Foundry Portal.
or
- In the Azure Portal, navigate to your AI Foundry project.
- Select the Evaluations tab.
- Locate the latest evaluation entry (by timestamp or ID printed by the script).
- Click into the entry and review summary metrics.
- Go to the Data (or Details) section to see per-query results and any additional diagnostic information.
Congratulations! You have successfully run an evaluation of your GenAI App and examined the results both locally and in the AI Foundry Portal. Next up: Lab – Automating Deployment with CI/CD.
