Skip to content

CVE-CWE Benchmark Comparison Algorithm Requirements

Overview

This section outlines the requirements for the algorithm that evaluates automated CWE assignments against the gold-standard benchmark dataset.

Table of Contents

Algorithm Core Requirements

REQ_ALGO_MULTI_LABEL: The comparison algorithm MUST treat CWE assignment as a multi-label classification problem.

REQ_ALGO_HIERARCHICAL: The comparison algorithm MUST incorporate the hierarchical nature of CWEs in its matching logic.

REQ_ALGO_PARTIAL_CREDIT: The comparison algorithm MUST support partial credit for related but non-exact CWE matches.

REQ_ALGO_EXPLAINABLE: The comparison algorithm MUST provide explanations for why specific matches were scored in a particular way.

CWE Matching Requirements

REQ_MATCH_EXACT: The algorithm MUST identify and give full credit (score = 1.0) for exact CWE ID matches.

REQ_MATCH_UNRELATED: The algorithm MUST identify unrelated CWEs (different branches with no close common ancestor) and give no credit (score = 0).

REQ_MATCH_CUSTOMIZABLE: The algorithm MUST allow customizable weights for each degree of match to compute similarity scores.

REQ_MATCH_CWE_GRAPH: The algorithm MUST use the official CWE 1000 View graph to determine relationships between CWEs.

Multiple CWE Handling Requirements

REQ_MULTI_LABEL_BASED: The algorithm MUST support multi-label-based metrics for evaluating individual CWE predictions across all CVEs.

Abstraction Level Requirements

REQ_ABSTRACT_TRACKING: The algorithm MUST track the abstraction level (Pillar, Class, Base, Variant) of each CWE.

REQ_ABSTRACT_FILTERING: The algorithm MUST support filtering or grouping results by abstraction level.

REQ_ABSTRACT_SEPARATE_METRICS: The algorithm MUST calculate metrics separately for different abstraction levels when requested.

REQ_ABSTRACT_MISMATCH: The algorithm MUST identify and report when predictions match at a different abstraction level than the gold standard.

REQ_ABSTRACT_PREFERENCE: The algorithm SHOULD implement a preference for Base/Variant level matches over Class/Pillar matches, in accordance with CWE mapping guidance.

Evaluation Metrics Requirements

REQ_METRIC_EXACT_MATCH: The algorithm MUST calculate and report the Exact Match Rate (EMR) - the proportion of CVEs where predicted CWE sets exactly match gold CWE sets.

REQ_METRIC_AT_LEAST_ONE: The algorithm MUST calculate and report the At Least One Match Rate - the proportion of CVEs where at least one gold CWE was correctly predicted.

REQ_METRIC_PRECISION: The algorithm MUST calculate and report Precision (Positive Predictive Value) based on true and false positives.

REQ_METRIC_RECALL: The algorithm MUST calculate and report Recall (Sensitivity) based on true positives and false negatives.

REQ_METRIC_F1: The algorithm MUST calculate and report the F1 Score as the harmonic mean of precision and recall.

REQ_METRIC_BALANCED_ACCURACY: The algorithm MUST calculate and report Balanced Accuracy, accounting for class imbalance by averaging recall across CWE classes.

REQ_METRIC_SOFT_VERSIONS: The algorithm MUST support "soft" versions of precision, recall, and F1 that incorporate partial match scores.

REQ_METRIC_BREAKDOWN: The algorithm SHOULD provide breakdowns of metrics by CWE frequency, abstraction level, or other relevant criteria.

REQ_METRIC_MACRO_MICRO: The algorithm SHOULD calculate both micro and macro averages for precision, recall, and F1 when appropriate.

REQ_METRIC_CONFIDENCE: The algorithm SHOULD incorporate confidence intervals or uncertainty measures for the reported metrics when sample sizes permit.

Grounding and Low-Information Requirements

REQ_GROUND_HALLUCINATION: The algorithm MUST implement detection for "hallucinated" CWE assignments that have no apparent support in the CVE description.

REQ_GROUND_KEYWORD: The algorithm SHOULD use keyword/term matching to check if a CWE's characteristic terms appear in the CVE description.

REQ_GROUND_SEMANTIC: The algorithm SHOULD use semantic similarity techniques to assess if predictions are related to the CVE text content.

REQ_GROUND_SEPARATE_METRICS: The algorithm MUST report the percentage of predictions flagged as potential hallucinations.

REQ_LOW_INFO_EXCLUDE: The algorithm MUST support excluding low-information CVEs from primary evaluation metrics.

REQ_LOW_INFO_SEPARATE: The algorithm MUST report metrics with and without low-information CVEs to show their impact on overall performance.

Implementation Requirements

REQ_IMPL_PERFORMANCE: The algorithm MUST be efficient enough to process large CVE datasets (tens of thousands of entries) in a reasonable time.

REQ_IMPL_OUTPUT_FORMAT: The algorithm MUST produce structured output in a machine-readable format (e.g., JSON) with clear organization of metrics and results.

REQ_IMPL_DETAILED_RESULTS: The algorithm MUST provide detailed per-CVE results showing matches, scores, and classification decisions.

REQ_IMPL_SUMMARY_STATS: The algorithm MUST generate summary statistics and aggregate metrics for the entire evaluation.

REQ_IMPL_VERSION_INFO: The algorithm output report MUST include version information of: 1) the benchmark dataset used, 2) the comparison algorithm version, and 3) a timestamp indicating when the comparison was run.

REQ_IMPL_LOGGING: The algorithm MUST produce a detailed log file capturing the execution process, including configuration settings used, processing steps, any warnings or errors encountered, and summary of results. This log file MUST be separate from the main output report and should provide sufficient information for debugging and audit purposes.

REQ_IMPL_VISUALIZATION: The algorithm SHOULD generate visualizations of results such as confusion matrices, precision-recall curves, or match distributions.

REQ_IMPL_DOCUMENTABILITY: The algorithm MUST be well-documented, with clear explanations of all metrics, weighting schemes, and decision processes.

REQ_IMPL_REPRODUCIBILITY: The algorithm MUST produce reproducible results given the same input data and configuration parameters.

REQ_IMPL_EXTENSIBILITY: The algorithm SHOULD be designed to be extensible for future CWE versions, additional metrics, or enhanced matching techniques.

REQ_IMPL_SEMVER: The algorithm implementation and its scripts MUST use Semantic Versioning to clearly indicate compatibility and feature changes between releases.

References

  1. CWE - CVE → CWE Mapping "Root Cause Mapping" Guidance
  2. RFC 2119 for definitions of "SHOULD", "MUST", etc.
  3. QualiTagger: Automating software quality detection in issue trackers
  4. balanced_accuracy_score — scikit-learn 1.6.1 documentation