CVE-CWE Root Cause Mapping Dataset Requirements¶

Overview

This section outlines the requirements for creating a comprehensive benchmark dataset for CVE-CWE Root Cause mapping.

Table of Contents¶

CVE-CWE Root Cause Mapping Dataset Requirements
Table of Contents
Coverage Requirements
Quality Requirements
Data Sources
- Datasets
- CVE Info
Validation
References
ToDo

Coverage Requirements¶

REQ_COVERAGE_PRACTICAL_CWES: The dataset MUST include all CWEs used in practice i.e. a CWE is used in at least 1 published CVE(s) (approximately 400-500).

REQ_COVERAGE_CWE1003_VIEW: The dataset MUST include all CWEs in CWE-1003 view (approximately 130).

REQ_COVERAGE_TOP25_CWES: The dataset MUST include all CWEs in https://cwe.mitre.org/top25/ from 2019 to current 2024 CWE Top 25.

See https://github.com/CWE-CVE-Benchmark/CWE_analysis/blob/main/data_in/Top25_per_year.tsv that lists these Top25 per year.

REQ_COVERAGE_ALL_CWES: The dataset MAY include all CWEs (approximately 1000).

REQ_COVERAGE_RECENT_CVES: The dataset SHOULD include mostly CVEs from recent CVE publication years to reflect current reality. The dataset should be expanded in reverse chronological order until sufficient examples are collected for each targeted CWE class.

REQ_COVERAGE_MIN_ENTRIES_PER_CWE: The dataset MUST include at least 5 entries per CWE (if such CVE examples exist).

REQ_COVERAGE_DIVERSE_DESCRIPTIONS: For a given CWE, the dataset MUST include CVE Descriptions that are sufficiently different per CVE Description, measured quantitatively using fuzzy and semantic similarity against a threshold.

REQ_COVERAGE_MULTIPLE_CWES: The dataset MUST include CVEs with multiple CWEs assigned, as reflected by the Top25 datasets.

REQ_COVERAGE_REF_CONTENT_RAW: The dataset MUST include CVE Reference Link Content as the raw content in text format.

REQ_COVERAGE_REF_CONTENT_SUMMARY: The dataset MUST include CVE Reference Link Content as a summary (e.g., as in https://github.com/CyberSecAI/cve_info_refs).

REQ_COVERAGE_LICENSE: The dataset MUST include a LICENSE as agreed by MITRE CWE.

REQ_COVERAGE_README: The dataset MUST include a README.

REQ_COVERAGE_MAINTENANCE_PLAN: The dataset MUST include a documented Maintenance Plan.

REQ_COVERAGE_CHANGE_TRACKING: The dataset MUST include change tracking and full history (e.g., via GitHub).

REQ_COVERAGE_VALIDATION_PROCESS: The dataset MUST include a documented validation process/script that automatically validates the dataset meets the requirements.

REQ_COVERAGE_FEEDBACK_PROCESS: The dataset MUST include a documented feedback process.

REQ_COVERAGE_SEMVER: The dataset MUST use Semantic Versioning.

REQ_COVERAGE_KEYPHRASES: The dataset SHOULD include the CVE Description Vulnerability KeyPhrases (e.g., as in https://github.com/CyberSecAI/cve_info).

REQ_COVERAGE_NO_REJECTED_CVES: The dataset SHOULD NOT include CVEs that are Rejected at the time of creating the dataset. Over time, CVEs in the dataset may become REJECTED, and these should be removed or identified/labeled.

REQ_COVERAGE_NO_DEPRECATED_CWES: The dataset SHOULD NOT include CWEs that are Deprecated, Obsolete, or Prohibited at the time of creating the dataset. Over time, CWEs in the dataset may become deprecated or obsolete, and these should be removed or identified/labeled.

REQ_COVERAGE_TOOL_AGNOSTIC: The dataset MUST be tool/solution agnostic. E.g. it MUST NOT include prompts per https://huggingface.co/datasets/AI4Sec/cti-bench/viewer/cti-rcm?views%5B%5D=cti_rcm&row=1

REQ_CONSENSUS_CNA_SCORE: The dataset SHOULD include a consensus score that reflects the level of agreement between CNAs in assigning CWEs to a CVE (e.g., if Red Hat, Microsoft, and IBM all select the same CWE ID for a specific CVE, then the score is 100).

REQ_CONSENSUS_DESC_SCORE: The dataset SHOULD include a consensus score that reflects the level of agreement between CWEs assigned for similar CVE descriptions. See https://github.com/CyberSecAI/cve_dedup/.

Quality Requirements¶

REQ_QUALITY_KNOWN_GOOD: The dataset MUST be known-good (i.e., the assigned CWE(s) is correct). Observed examples provide high-quality mappings but should be limited to those published after 2015 to ensure consistency with modern CVE description standards.

REQ_QUALITY_UTF8: The dataset MUST be UTF-8 only.

REQ_QUALITY_SCHEMA_VALIDATOR: The dataset MUST have a schema and validator script that is part of the dataset.

REQ_QUALITY_LOW_QUALITY_CVES: The dataset MUST include and identify low-quality CVEs. A sufficient number of low-info CVEs are labeled such that the results for those CVEs can be isolated to e.g., test for hallucinations/grounding.

REQ_QUALITY_DATA_SPLITS: The dataset MUST have train/validate/test splits.

REQ_QUALITY_PRESERVE_ISSUES: The dataset SHOULD NOT fix existing typos or other issues in the CVEs.

REQ_QUALITY_ABSTRACTION_CHECK: The dataset MUST check abstraction levels of CWEs and verify that high-level CWEs (such as Pillars) are only included where no lower-level CWE exists.

Data Sources¶

Datasets¶

REQ_DATASRC_TOP25_2023: The dataset SHOULD use Top25 2023 (approximately 7K entries) as a source.

REQ_DATASRC_TOP25_2022: The dataset SHOULD use Top25 2022 (approximately 7K entries) as a source.

REQ_DATASRC_NO_OBSERVED_EXAMPLES: The dataset SHOULD NOT use CWE Observed Examples published before 2015, as many are worded differently from modern CVE descriptions. Observed examples from 2015 onwards may be included as they provide high-quality mappings.

CVE Info¶

REQ_CVEINFO_JSON_5_0: https://github.com/CVEProject/cvelistV5 is the canonical source for CVE descriptions & references.

Validation¶

REQ_VALIDATION_SCHEMA: JSON schema validation must pass with schema.json.

REQ_VALIDATION_REQUIREMENTS: The dataset must be validated against the requirements.

REQ_VALIDATION_STATISTICS: Basic statistics (token lengths, keyphrase counts) must be generated.

REQ_VALIDATION_CONSENSUS_SCORES: Consensus scores for both CNA agreement and similar CVE descriptions must be calculated and validated.

References¶

RFC 2119 for definitions of "SHOULD", "MUST", etc.
CWE-1003 View for maintaining consistency in CWE assignments
CWE Top 25 for priority CWE coverage
Developer View (View-699) for comprehensive coverage approach

ToDo¶

Define Known-Good. Specify the validation standard and process.
Define Quantitative Metrics: Specify thresholds/methods for sufficiently different
Define Qualitative Labels: Specify criteria for low-quality
Specify methodology for calculating consensus scores (both CNA and description-based)
Define minimum consensus score thresholds for creating high-quality benchmark subsets