Part 1: Evaluating VLMs Accuracy on P&IDs Extraction - Metrics for Quantitative Evaluation

Author: Law May, Co-founder of Rosary Labs

Date: 22nd January 2026

Summary

Improving VLMs for P&ID equipment tag extraction from prototype to production requires systematic evaluation. We demonstrate how classification metrics such as Accuracy, Precision, Recall, and F1-Score may be applied to engineering diagram analysis by treating each extracted tag as binary classification.

The critical challenge here is to balance False Positives (hallucinated extraction) against False Negatives (missing desired extraction). Through simple reasoning, we arrived at using F1 score with an inclination to correct for precision being the preferred metrics for quantitative evaluation.

How do we evaluate the effectiveness of VLM?

Following our previous post on Can Vision-Language Models Reliably Extract Equipment Tags from P&IDs?, we shared very briefly on how evaluation was performed. In this article, we deep dive into 3 parts of evaluation. Part 1: Metrics for Quantitative Evaluation, Part 2: Implementing Quantitative Evaluation, and Part 3: Qualitative Evaluation.

Consider this scenario: You've uploaded your Piping & Instrumentation Diagrams (P&IDs) into ChatGPT, but you realised that the output is not great. You keep tweaking your prompts, but how do you properly track which prompt is working and which is not?

Alternatively, you decided to perform pre-processing to the PDFs before uploading to ChatGPT, be it cropping specific regions or increasing image contrast. You've put together a simple prototype for P&ID extraction. But how do you advance from a 10% functional prototype to a 90% functioning tool?

There are typically two stages of evaluation:

Quantitative evaluation - benchmarking the outcome of the tool against an expected outcome
Qualitative evaluation - user testing to evaluate the tool more holistically

Let's dive in!

Metrics for Quantitative evaluation

Let's consider a classic classification problem in computer vision: image classification to predict cats and dogs images.

Traditionally, we use a simple confusion matrix to categorise positive (dogs, because we're dog lovers) or negative (cat) predictions against the actual cats and dogs category.

Cats-dogs image classification — Source: Medium

True Positive: Dog is predicted and the image is an actual dog
True Negative: Cat is predicted and the image is an actual cat
False Positive: Dog is wrongly predicted when the image is a cat
False Negative: Cat is wrongly predicted when the image is a dog

This matrix can be applied directly to P&ID extraction output, where each extracted tag is classified as correct or incorrect.

Common classification metrics and the trade-off

The confusion matrix unlocks multiple metrics:

1. Accuracy

Measures how many predictions the model got right out of all predictions.

Accuracy = \frac{\text{correct classifications}}{\text{total classifications}} = \frac{TP + TN}{TP + TN + FP + FN}

When to use:

Balanced dataset, where positive and negative classes are equal

When NOT to use:

Imbalanced datasets such as anomaly detection

2. Precision

Measures correct positive predictions out of all positive predictions.

Precision = \frac{\text{correct classified actual positives}}{\text{everything classified as positive}} = \frac{TP}{TP + FP}

When to use:

When FP should be minimized in applications like email spam detection (marking legitimate emails as spam)

When NOT to use:

When FN proves more critical than FP in scenarios like cancer detection. Missing a cancer case (FN) is worse than a false alarm (FP)

3. Recall

Measures true positives detected out of all actual positive instances.

Recall= \frac{\text{correctly classified actual positives}}{\text{all actual positives}} = \frac{TP}{TP + FN}

When to use:

When we want to capture all positive cases in critical applications like fraud detection. Missing a fraudulent transaction (FN) has more consequence than a false alarm

When NOT to use:

When FP proves more critical than FN in scenarios like judicial decisions. False accusations (FP) can ruin lives

4. F1-Score

The harmonic mean of precision and recall that delivers a single score to balance both metrics, giving equal weight to false positives and false negatives.

F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} = \frac{2TP}{2TP+FP+FN}

When to use:

When we want to achieve balance between precision and recall on imbalanced datasets. F1-score is more informative than accuracy for imbalanced classes

When NOT to use:

When FP or FN have very different costs, for example missing a real threat either cancer, fraud, or Covid testing (FN) is far worse than a false alarm; OR wrongfully accusing a person or labelling an email as spam is more problematic, then it is better to prioritize precision or recall individually.

Which metrics should I use for evaluation?

In the case of P&ID extraction using a simple VLM-powered tool:

True Positive: The equipment tag exists, and the tool extracts the correct tag.
True Negative: The equipment tag does not exist, and the tool does not extract it.
False Positive: The equipment tag does not exist, but the tool hallucinates and extracts phantom tags.
False Negative: The equipment tag exists, but the tool misses it.

Let's consider these questions:

Which category do we desire to have high and low values?
Between FP and FN, which is worse if the value is high?

Let's examine these slowly:

Which category do we desire to have high and low values?

In our use case, there is no TN, as there can be an infinite number of equipment tagging that does not exist, and the tool does not extract them.
Ideally, we want high TP and low FP and FN.

Between FP and FN, which is worse if the value is high?

FP: VLM hallucinates and extracted non-existent tagging
FN: VLM did not extract the intended tagging.
Between the two, the former will pose a greater risk, but the extraction will be equally unreliable if we have a high FN.

Based on these two questions, we know that the risk is higher with higher FP (hallucinated tags) than FN (missing tags). However, a high FN is undesirable either. For this reason, we built Rosary Vision with our evaluation focused on F1-score for a balance between FP and FN, with slightly more emphasis to correct for Precision.

Key Takeaways:

Traditional classification metrics can be applied to P&ID extraction tasks:

P&ID extraction can be evaluated using traditional classification metrics (Accuracy, Precision, Recall, F1-Score) by treating each extracted item as either correct or incorrect.
The confusion matrix framework (TP, TN, FP, FN) quantifies performance and identifies specific extraction weaknesses.

Quantitative metrics selection depends on the relative cost of errors (which is worse if the value is high - False Positive or False Negative?)

For P&ID extraction using Vision-Language Models, F1-Score is the preferred metric because both false positives (hallucinated tags) and false negatives (missed tags) are problematic, though false positives create slightly more risk.

In the next article, we will uncover how quantitative evaluation can be implemented into the P&ID extraction pipeline.

Part 1: Evaluating VLMs Accuracy on P&IDs Extraction - Metrics for Quantitative Evaluation

Summary

How do we evaluate the effectiveness of VLM?

Metrics for Quantitative evaluation

Common classification metrics and the trade-off

1. Accuracy

2. Precision

3. Recall

4. F1-Score

Which metrics should I use for evaluation?

Key Takeaways:

Get in Touch