Skip to content

CVE-CWE Dataset Schema and Example Layout

This document defines the JSON schema, example record, and directory structure used in the CVE-CWE Root Cause Mapping dataset. It serves as the "how" counterpart to the requirements document.


JSON Schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "CVE_CWE_Record",
  "type": "object",
  "required": ["cve_id", "cwe_ids", "year", "description", "ref_links"],
  "properties": {
    "cve_id": {
      "type": "string",
      "pattern": "^CVE-\\d{4}-\\d{4,}$"
    },
    "cwe_ids": {
      "type": "array",
      "items": {
        "type": "string",
        "pattern": "^CWE-\\d+$"
      },
      "minItems": 1
    },
    "year": {
      "type": "integer",
      "minimum": 1999
    },
    "description": {
      "type": "string"
    },
    "ref_links": {
      "type": "array",
      "items": {
        "type": "string",
        "format": "uri"
      }
    },
    "ref_summaries": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "default": []
    },
    "keyphrases": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "default": []
    },
    "quality_flags": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "default": []
    },
    "consensus_cna_score": {
      "type": "number",
      "minimum": 0,
      "maximum": 100
    },
    "consensus_desc_score": {
      "type": "number",
      "minimum": 0,
      "maximum": 100
    }
  }
}

Example Record

{
  "cve_id": "CVE-2024-29009",
  "cwe_ids": ["CWE-352"],
  "year": 2024,
  "description": "CWE-352 Cross-Site Request Forgery (CSRF)",
  "ref_links": [
    "https://nvd.nist.gov/vuln/detail/CVE-2024-29009"
  ],
  "ref_summaries": [
    "Summary of reference content goes here"
  ],
  "keyphrases": [
    "cross-site request forgery",
    "unauthenticated action"
  ],
  "quality_flags": [],
  "consensus_cna_score": 100,
  "consensus_desc_score": 92
}

Example Directory Layout

top_level/
├── full.jsonl              # All records, one per line (gold labels)
├── schema.json             # JSON Schema (Draft 2020‑12)
├── scripts/
   ├── fetch.py            # Pull raw NVD entries
   ├── clean.py            # Convert to schema format; deduplication, keyphrases
   ├── benchmark.py        # Compare mapping predictions against benchmark
   └── utils.py            # Shared helpers (e.g., semantic similarity)
├── README.md
├── LICENSE
├── Feedback.md
├── MaintenancePlan.md
└── validations/
    ├── schema_validator.py     # Validates schema compliance
    ├── requirements_check.py   # Verifies dataset meets core REQs
    └── stats_summary.py        # Reports on token lengths, distribution, etc.

Notes

  • Fields like keyphrases, quality_flags, and ref_summaries are optional but recommended.
  • consensus_*_score fields are optional but align with REQ_CONSENSUS_CNA_SCORE and REQ_CONSENSUS_DESC_SCORE.
  • All code and schema validation must use Draft 2020-12.

This structure supports both ease of use and future extensibility, including automation for quality validation and benchmarking.