Skip to content

CVE-CWE Root Cause Mapping Dataset Requirements

Overview

This section outlines the requirements for creating a comprehensive benchmark dataset for CVE-CWE Root Cause mapping.

Table of Contents

Coverage Requirements

REQ_COVERAGE_PRACTICAL_CWES: The dataset MUST include all CWEs used in practice i.e. a CWE is used in at least 1 published CVE(s) (approximately 400-500).

REQ_COVERAGE_CWE1003_VIEW: The dataset MUST include all CWEs in CWE-1003 view (approximately 130).

REQ_COVERAGE_TOP25_CWES: The dataset MUST include all CWEs in https://cwe.mitre.org/top25/ from 2019 to current 2024 CWE Top 25.

REQ_COVERAGE_ALL_CWES: The dataset MAY include all CWEs (approximately 1000).

REQ_COVERAGE_RECENT_CVES: The dataset SHOULD include mostly CVEs from recent CVE publication years to reflect current reality. The dataset should be expanded in reverse chronological order until sufficient examples are collected for each targeted CWE class.

REQ_COVERAGE_MIN_ENTRIES_PER_CWE: The dataset MUST include at least 5 entries per CWE (if such CVE examples exist).

REQ_COVERAGE_DIVERSE_DESCRIPTIONS: For a given CWE, the dataset MUST include CVE Descriptions that are sufficiently different per CVE Description, measured quantitatively using fuzzy and semantic similarity against a threshold.

REQ_COVERAGE_MULTIPLE_CWES: The dataset MUST include CVEs with multiple CWEs assigned, as reflected by the Top25 datasets.

REQ_COVERAGE_REF_CONTENT_RAW: The dataset MUST include CVE Reference Link Content as the raw content in text format.

REQ_COVERAGE_REF_CONTENT_SUMMARY: The dataset MUST include CVE Reference Link Content as a summary (e.g., as in https://github.com/CyberSecAI/cve_info_refs).

REQ_COVERAGE_LICENSE: The dataset MUST include a LICENSE as agreed by MITRE CWE.

REQ_COVERAGE_README: The dataset MUST include a README.

REQ_COVERAGE_MAINTENANCE_PLAN: The dataset MUST include a documented Maintenance Plan.

REQ_COVERAGE_CHANGE_TRACKING: The dataset MUST include change tracking and full history (e.g., via GitHub).

REQ_COVERAGE_VALIDATION_PROCESS: The dataset MUST include a documented validation process/script that automatically validates the dataset meets the requirements.

REQ_COVERAGE_FEEDBACK_PROCESS: The dataset MUST include a documented feedback process.

REQ_COVERAGE_SEMVER: The dataset MUST use Semantic Versioning.

REQ_COVERAGE_KEYPHRASES: The dataset SHOULD include the CVE Description Vulnerability KeyPhrases (e.g., as in https://github.com/CyberSecAI/cve_info).

REQ_COVERAGE_NO_REJECTED_CVES: The dataset SHOULD NOT include CVEs that are Rejected at the time of creating the dataset. Over time, CVEs in the dataset may become REJECTED, and these should be removed or identified/labeled.

REQ_COVERAGE_NO_DEPRECATED_CWES: The dataset SHOULD NOT include CWEs that are Deprecated, Obsolete, or Prohibited at the time of creating the dataset. Over time, CWEs in the dataset may become deprecated or obsolete, and these should be removed or identified/labeled.

REQ_COVERAGE_TOOL_AGNOSTIC: The dataset MUST be tool/solution agnostic. E.g. it MUST NOT include prompts per https://huggingface.co/datasets/AI4Sec/cti-bench/viewer/cti-rcm?views%5B%5D=cti_rcm&row=1

REQ_CONSENSUS_CNA_SCORE: The dataset SHOULD include a consensus score that reflects the level of agreement between CNAs in assigning CWEs to a CVE (e.g., if Red Hat, Microsoft, and IBM all select the same CWE ID for a specific CVE, then the score is 100).

REQ_CONSENSUS_DESC_SCORE: The dataset SHOULD include a consensus score that reflects the level of agreement between CWEs assigned for similar CVE descriptions. See https://github.com/CyberSecAI/cve_dedup/.

Quality Requirements

REQ_QUALITY_KNOWN_GOOD: The dataset MUST be known-good (i.e., the assigned CWE(s) is correct). Observed examples provide high-quality mappings but should be limited to those published after 2015 to ensure consistency with modern CVE description standards.

REQ_QUALITY_UTF8: The dataset MUST be UTF-8 only.

REQ_QUALITY_SCHEMA_VALIDATOR: The dataset MUST have a schema and validator script that is part of the dataset.

REQ_QUALITY_LOW_QUALITY_CVES: The dataset MUST include and identify low-quality CVEs. A sufficient number of low-info CVEs are labeled such that the results for those CVEs can be isolated to e.g., test for hallucinations/grounding.

REQ_QUALITY_DATA_SPLITS: The dataset MUST have train/validate/test splits.

REQ_QUALITY_PRESERVE_ISSUES: The dataset SHOULD NOT fix existing typos or other issues in the CVEs.

REQ_QUALITY_ABSTRACTION_CHECK: The dataset MUST check abstraction levels of CWEs and verify that high-level CWEs (such as Pillars) are only included where no lower-level CWE exists.

Data Sources

Datasets

REQ_DATASRC_TOP25_2023: The dataset SHOULD use Top25 2023 (approximately 7K entries) as a source.

REQ_DATASRC_TOP25_2022: The dataset SHOULD use Top25 2022 (approximately 7K entries) as a source.

REQ_DATASRC_NO_OBSERVED_EXAMPLES: The dataset SHOULD NOT use CWE Observed Examples published before 2015, as many are worded differently from modern CVE descriptions. Observed examples from 2015 onwards may be included as they provide high-quality mappings.

CVE Info

REQ_CVEINFO_JSON_5_0: https://github.com/CVEProject/cvelistV5 is the canonical source for CVE descriptions & references.

Validation

REQ_VALIDATION_SCHEMA: JSON schema validation must pass with schema.json.

REQ_VALIDATION_REQUIREMENTS: The dataset must be validated against the requirements.

REQ_VALIDATION_STATISTICS: Basic statistics (token lengths, keyphrase counts) must be generated.

REQ_VALIDATION_CONSENSUS_SCORES: Consensus scores for both CNA agreement and similar CVE descriptions must be calculated and validated.

References

  1. RFC 2119 for definitions of "SHOULD", "MUST", etc.
  2. CWE-1003 View for maintaining consistency in CWE assignments
  3. CWE Top 25 for priority CWE coverage
  4. Developer View (View-699) for comprehensive coverage approach

ToDo

  1. Define Known-Good. Specify the validation standard and process.
  2. Define Quantitative Metrics: Specify thresholds/methods for sufficiently different
  3. Define Qualitative Labels: Specify criteria for low-quality
  4. Specify methodology for calculating consensus scores (both CNA and description-based)
  5. Define minimum consensus score thresholds for creating high-quality benchmark subsets