Dataset

The CTINEXUS dataset is a comprehensive collection of annotated Cyber Threat Intelligence (CTI) reports designed to evaluate end-to-end knowledge graph construction systems. Unlike existing benchmarks that focus solely on triplet extraction from outdated reports, this dataset encompasses the complete pipeline including cybersecurity triplet extraction, hierarchical entity alignment, and long-distance relation prediction.

Table of contents

  1. Dataset Statistics
  2. Source Distribution
  3. Dataset Format
  4. Features

Dataset Statistics

  • 150 CTI reports published from May 2023 onwards
  • Sourced from 10 cybersecurity organizations (approximately 15 reports per source)
  • Publishers include Trend Micro, Symantec, The Hacker News, and others
  • Contains 4,292 mentions, 2,528 entities, and 2,503 relations

Source Distribution

Reports were collected from reputable cybersecurity sources to ensure diversity and quality of threat intelligence:

Dataset Format

The dataset follows a JSON structure that captures the progressive enrichment of CTI information:

{
  "CTI": {
    "text": "Original report content...",
    "source": "Publisher name"
  },
  "IE": {
    "triplets": [
      {"subject": "Entity1", "predicate": "relation", "object": "Entity2"},
      ...
    ]
  },
  "EA": {
    "aligned_triplets": [
      {
        "subject": {"mention_id": 0, "mention_text": "Entity1", "entity_id": 1, ...},
        "predicate": "relation",
        "object": {"mention_id": 1, "mention_text": "Entity2", "entity_id": 2, ...}
      },
      ...
    ]
  },
  "LP": {
    "predicted_links": [
      {"subject": {...}, "relation": "implicit_relation", "object": {...}},
      ...
    ]
  }
}

Features

This dataset enables researchers and practitioners to:

  • Evaluate end-to-end Cyber Security Knowledge Graph (CSKG) construction systems 🚀
  • Benchmark performance on modern threat intelligence (post-May 2023) 🚀
  • Test capabilities across multiple KG construction tasks rather than just triplet extraction 🚀
  • Develop more robust CTI analysis tools leveraging knowledge graph approaches 🚀