CVE-CWE Dataset Schema and Example Layout¶
This document defines the JSON schema, example record, and directory structure used in the CVE-CWE Root Cause Mapping dataset. It serves as the "how" counterpart to the requirements document.
JSON Schema¶
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "CVE_CWE_Record",
"type": "object",
"required": ["cve_id", "cwe_ids", "year", "description", "ref_links"],
"properties": {
"cve_id": {
"type": "string",
"pattern": "^CVE-\\d{4}-\\d{4,}$"
},
"cwe_ids": {
"type": "array",
"items": {
"type": "string",
"pattern": "^CWE-\\d+$"
},
"minItems": 1
},
"year": {
"type": "integer",
"minimum": 1999
},
"description": {
"type": "string"
},
"ref_links": {
"type": "array",
"items": {
"type": "string",
"format": "uri"
}
},
"ref_summaries": {
"type": "array",
"items": {
"type": "string"
},
"default": []
},
"keyphrases": {
"type": "array",
"items": {
"type": "string"
},
"default": []
},
"quality_flags": {
"type": "array",
"items": {
"type": "string"
},
"default": []
},
"consensus_cna_score": {
"type": "number",
"minimum": 0,
"maximum": 100
},
"consensus_desc_score": {
"type": "number",
"minimum": 0,
"maximum": 100
}
}
}
Example Record¶
{
"cve_id": "CVE-2024-29009",
"cwe_ids": ["CWE-352"],
"year": 2024,
"description": "CWE-352 Cross-Site Request Forgery (CSRF)",
"ref_links": [
"https://nvd.nist.gov/vuln/detail/CVE-2024-29009"
],
"ref_summaries": [
"Summary of reference content goes here"
],
"keyphrases": [
"cross-site request forgery",
"unauthenticated action"
],
"quality_flags": [],
"consensus_cna_score": 100,
"consensus_desc_score": 92
}
Example Directory Layout¶
top_level/
├── full.jsonl # All records, one per line (gold labels)
├── schema.json # JSON Schema (Draft 2020‑12)
├── scripts/
│ ├── fetch.py # Pull raw NVD entries
│ ├── clean.py # Convert to schema format; deduplication, keyphrases
│ ├── benchmark.py # Compare mapping predictions against benchmark
│ └── utils.py # Shared helpers (e.g., semantic similarity)
├── README.md
├── LICENSE
├── Feedback.md
├── MaintenancePlan.md
└── validations/
├── schema_validator.py # Validates schema compliance
├── requirements_check.py # Verifies dataset meets core REQs
└── stats_summary.py # Reports on token lengths, distribution, etc.
Notes¶
- Fields like
keyphrases
,quality_flags
, andref_summaries
are optional but recommended. consensus_*_score
fields are optional but align withREQ_CONSENSUS_CNA_SCORE
andREQ_CONSENSUS_DESC_SCORE
.- All code and schema validation must use Draft 2020-12.
This structure supports both ease of use and future extensibility, including automation for quality validation and benchmarking.