[Feature Request] Visualizing the evaluations should look different from the promptflow traces, should provide some kind of data visualization #3492

tyler-suard-parker · 2024-07-02T18:54:14Z

Is your feature request related to a problem? Please describe.
Right now, when we visualize the evaluations, it is not easy to understand the results. For example the result of visualizing in this notebook promptflow\examples\flex-flows\chat-async-stream\chat-stream-with-async-flex-flow.ipynb looks like this:

It is not easy to see which evals failed and which succeeded, or a proportion of successes vs failures.

Describe the solution you'd like
It would be nice to have a clearer visualization for the evaluations, because their purpose is different from the traces. For an evaluation we usually just want a simple pass/fail, whereas with a trace we want the full details. Here is an example:

eval report.zip

zhengfeiwang · 2024-07-03T06:30:54Z

Thank you for your suggestion! Add screenshot of your example below:

@tyler-suard-parker one thing I'd like to confirm: in which step you get the above trace UI page? I see there are two runs in the url, so I guess you are getting this from the line pf.visualize([base_run, eval_run])?

If so, how about changing it to pf.visualize(base_run) and see if it looks better? The evaluation run's result will be appended to the corresponding line - maybe we should update our notebook there, pf.visualize is something different before and switch to leverage trace UI recently

tyler-suard-parker · 2024-07-03T16:48:36Z

Yes, I am getting this during the line pf.visualize([base_run, eval_run]). I will try using pf.visualize(base_run) and let you know what happens.

I'm glad you like my suggestion, note that you can click on each question to expand it. Having the traces like you already have is nice, and it would be helpful to have some kind of quick summary I can look at just to make sure all my evaluations came out ok. For example, a bar chart for each input-output pair showing correctness, etc. and you can get an explanation if you click on a bar.

tyler-suard-parker · 2024-07-03T16:59:03Z

I tried running pf.visualize(base_run) and I got this. When I enabled the metrics column it looks a little better, but there is still a lot of information I don't need if I'm doing an evaluation:

I use evaluations as unit tests for my prompt engineering. I have 10 standard questions I ask. Every time I change one of my agent prompts, I run those 10 standard questions again as part of my CI/CD tests and I look at the test report, just to make sure none of my changes caused any incorrect answers. When I'm doing this with every commit, I don't have time to read through all the traces. It would be nice to just have a single diagram that shows me how the entire evaluation batch went.

tyler-suard-parker · 2024-07-03T17:09:42Z

Something similar to this:

zhengfeiwang · 2024-07-04T07:16:02Z

Thank you for your trial, and the description of your scenario! Yes, I think something like a report shall help you better, and more intuitive; while trace UI page does not support it well for now.

Engage PM Chenlu @jiaochenlu on this topic.

tyler-suard-parker added the enhancement New feature or request label Jul 2, 2024

0mza987 assigned zhengfeiwang Jul 3, 2024

zhengfeiwang assigned jiaochenlu Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Visualizing the evaluations should look different from the promptflow traces, should provide some kind of data visualization #3492

[Feature Request] Visualizing the evaluations should look different from the promptflow traces, should provide some kind of data visualization #3492

tyler-suard-parker commented Jul 2, 2024

zhengfeiwang commented Jul 3, 2024

tyler-suard-parker commented Jul 3, 2024

tyler-suard-parker commented Jul 3, 2024 •

edited

Loading

tyler-suard-parker commented Jul 3, 2024

zhengfeiwang commented Jul 4, 2024 •

edited

Loading

[Feature Request] Visualizing the evaluations should look different from the promptflow traces, should provide some kind of data visualization #3492

[Feature Request] Visualizing the evaluations should look different from the promptflow traces, should provide some kind of data visualization #3492

Comments

tyler-suard-parker commented Jul 2, 2024

zhengfeiwang commented Jul 3, 2024

tyler-suard-parker commented Jul 3, 2024

tyler-suard-parker commented Jul 3, 2024 • edited Loading

tyler-suard-parker commented Jul 3, 2024

zhengfeiwang commented Jul 4, 2024 • edited Loading

tyler-suard-parker commented Jul 3, 2024 •

edited

Loading

zhengfeiwang commented Jul 4, 2024 •

edited

Loading