Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Visualizing the evaluations should look different from the promptflow traces, should provide some kind of data visualization #3492

Open
tyler-suard-parker opened this issue Jul 2, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@tyler-suard-parker
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Right now, when we visualize the evaluations, it is not easy to understand the results. For example the result of visualizing in this notebook promptflow\examples\flex-flows\chat-async-stream\chat-stream-with-async-flex-flow.ipynb looks like this:

image

It is not easy to see which evals failed and which succeeded, or a proportion of successes vs failures.

Describe the solution you'd like
It would be nice to have a clearer visualization for the evaluations, because their purpose is different from the traces. For an evaluation we usually just want a simple pass/fail, whereas with a trace we want the full details. Here is an example:

eval report.zip

@tyler-suard-parker tyler-suard-parker added the enhancement New feature or request label Jul 2, 2024
@zhengfeiwang
Copy link
Contributor

Thank you for your suggestion! Add screenshot of your example below:

image

@tyler-suard-parker one thing I'd like to confirm: in which step you get the above trace UI page? I see there are two runs in the url, so I guess you are getting this from the line pf.visualize([base_run, eval_run])?

If so, how about changing it to pf.visualize(base_run) and see if it looks better? The evaluation run's result will be appended to the corresponding line - maybe we should update our notebook there, pf.visualize is something different before and switch to leverage trace UI recently

@tyler-suard-parker
Copy link
Contributor Author

Yes, I am getting this during the line pf.visualize([base_run, eval_run]). I will try using pf.visualize(base_run) and let you know what happens.

I'm glad you like my suggestion, note that you can click on each question to expand it. Having the traces like you already have is nice, and it would be helpful to have some kind of quick summary I can look at just to make sure all my evaluations came out ok. For example, a bar chart for each input-output pair showing correctness, etc. and you can get an explanation if you click on a bar.

@tyler-suard-parker
Copy link
Contributor Author

tyler-suard-parker commented Jul 3, 2024

I tried running pf.visualize(base_run) and I got this. When I enabled the metrics column it looks a little better, but there is still a lot of information I don't need if I'm doing an evaluation:
image

I use evaluations as unit tests for my prompt engineering. I have 10 standard questions I ask. Every time I change one of my agent prompts, I run those 10 standard questions again as part of my CI/CD tests and I look at the test report, just to make sure none of my changes caused any incorrect answers. When I'm doing this with every commit, I don't have time to read through all the traces. It would be nice to just have a single diagram that shows me how the entire evaluation batch went.

@tyler-suard-parker
Copy link
Contributor Author

Something similar to this:
image

@zhengfeiwang
Copy link
Contributor

zhengfeiwang commented Jul 4, 2024

Thank you for your trial, and the description of your scenario! Yes, I think something like a report shall help you better, and more intuitive; while trace UI page does not support it well for now.

Engage PM Chenlu @jiaochenlu on this topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
3 participants