[BUG] Evaluate on test dataset using evaluate() with SimilarityEvaluator returns NaN #3381

bhonris · 2024-06-06T17:48:03Z

Describe the bug
When running the evaluation a dataset using evaluate() using the similarity evaluator I have come across some scenarios where the result is not a number.
How To Reproduce the bug
Model config
{azure_deployment= "gpt4-turbo-preview", api_version="2024-02-01"}
jsonl file
{"Question":"How can you get the version of the Kubernetes cluster?","Answer":"{\"code\": \"kubectl version\" }","output":"{code: kubectl version --output=json}"}
Evaluate Config

result = evaluate(
    data="testdata2.jsonl",
    evaluators={
        "similarity": SimilarityEvaluator(model_config)
    },
    evaluator_config={
        "default": {
            "question": "${data.Question}",
            "answer": "${data.output}",
            "ground_truth": "${data.Answer}"
        }
    }
)

Expected behavior
Value returned is number

Running Information(please complete the following information):

Promptflow Package Version using pf -v:

{
 "promptflow": "1.1.1",
 "promptflow-azure": "1.11.0",
 "promptflow-core": "1.11.0",
 "promptflow-devkit": "1.11.0",
 "promptflow-evals": "0.3.0",
 "promptflow-tracing": "1.11.0"
}

Operating System: Windows 11
Python Version using python --version: 3.10.11

Additional context

Checking the actual logged value in _similarity.py suggests the actual returned value is the string 'The'.
I notice that this issue usually occurs when the answer does not match what the LLM response based on the question would be. For example, {Question: What is the capital of France?, Answer: Washington DC, }

The text was updated successfully, but these errors were encountered:

bhonris · 2024-06-06T18:12:08Z

I have added to similarity.prompty the following text: "You will respond with a single digit number between 1 and 5. You will include no other text or information", and this seems to fix the issue.

brynn-code · 2024-06-07T02:50:26Z

Hi @singankit and @luigiw , could you please help take a look at this issue?

luigiw · 2024-06-14T21:10:37Z

@bhonris , thank you for reporting the issue and sharing a workaround. It is a known issue that some preview OpenAI models will cause NaN results. Please also try with stable version models.

bhonris added the bug Something isn't working label Jun 6, 2024

brynn-code assigned singankit and luigiw Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Evaluate on test dataset using evaluate() with SimilarityEvaluator returns NaN #3381

[BUG] Evaluate on test dataset using evaluate() with SimilarityEvaluator returns NaN #3381

bhonris commented Jun 6, 2024

bhonris commented Jun 6, 2024

brynn-code commented Jun 7, 2024

luigiw commented Jun 14, 2024

[BUG] Evaluate on test dataset using evaluate() with SimilarityEvaluator returns NaN #3381

[BUG] Evaluate on test dataset using evaluate() with SimilarityEvaluator returns NaN #3381

Comments

bhonris commented Jun 6, 2024

bhonris commented Jun 6, 2024

brynn-code commented Jun 7, 2024

luigiw commented Jun 14, 2024