Information Extraction (IE)
The IE module automatically processes unstructured CTI reports and extracts structured information in the form of triplets (subject-predicate-object). It uses demonstration-based learning to improve extraction accuracy and incorporates multiple LLM backends.
Table of contents
- Architecture
- Technical Components
- Configuration
- Usage Instructions
- Output Structure
- Key Features
- Extension Points
- Dependencies
Architecture
IE/
├── main.py # Pipeline entry point
├── LLMAnnotator.py # Core annotation processor
├── promptConstructor.py # Builds prompts using templates
├── demoRetriever.py # Retrieves relevant demonstration examples
├── LLMcaller.py # Interfaces with different LLMs
├── responseParser.py # Parses and structures LLM responses
├── usageCalculator.py # Calculates API usage and costs
├── instructionLoader.py # Loads instruction templates
└── config/ # Configuration directory
└── example.yaml # Default configuration file
Technical Components
Main Pipeline (main.py
)
The main pipeline orchestrates the entire extraction process using Hydra for configuration management:
@hydra.main(config_path="config", config_name="example", version_base="1.2")
def run(config: DictConfig):
for CTI_Source in os.listdir(config.inSet):
annotatedCTICource = [dir for dir in os.listdir(config.outSet)]
if CTI_Source in annotatedCTICource:
continue
# Process files in each source directory
FolderPath = os.path.join(config.inSet, CTI_Source)
for JSONFile in os.listdir(FolderPath):
LLMAnnotator(config, CTI_Source, JSONFile).annotate()
LLM Annotation Process (LLMAnnotator.py
)
The LLMAnnotator
class coordinates the complete annotation workflow:
- Loads the input CTI report
- Retrieves relevant demonstrations (if configured)
- Constructs prompts using templates
- Calls the LLM API
- Parses responses and structures the output
- Saves results with metadata
Demonstration Retrieval (demoRetriever.py
)
Supports multiple strategies for selecting demonstration examples:
- kNN: Finds semantically similar examples using TF-IDF vectorization and distance metrics
- Random: Provides random examples from the demonstration set
- Fixed examples can also be specified
Prompt Construction (promptConstructor.py
)
Uses Jinja2 templating to build prompts with a flexible structure:
def generate_prompt(self):
env = Environment(loader=FileSystemLoader(self.config.ie_prompt_set))
DymTemplate = self.templ
template_source = env.loader.get_source(env, DymTemplate)[0]
parsed_content = env.parse(template_source)
variables = meta.find_undeclared_variables(parsed_content)
# Load and render template with appropriate variables
template = env.get_template(DymTemplate)
# ...
LLM Integration (LLMcaller.py
)
Supports multiple language model backends:
- OpenAI Models: GPT-4 and variants
- Llama: Via local Ollama API or Hugging Face
- Qwen: Via local Ollama API
Response Processing (responseParser.py
)
Parses LLM responses and structures the extracted information:
def parse(self):
self.output = {
"CTI": self.query,
"annotator": self.JSONResp if get_char_before_hyphen(self.config.model) == "gpt" else {"triplets": self.JSONResp, "triples_count": len(self.JSONResp)},
"link": self.link,
"usage": UsageCalculator(self.llm_response).calculate() if get_char_before_hyphen(self.config.model) == "gpt" else None,
"prompt": self.prompt,
}
# Calculate triplet counts and additional metadata
# ...
Usage Statistics (usageCalculator.py
)
Calculates and tracks token usage and costs for API-based models:
def calculate(self):
with open (model_price_menu, "r") as f:
data = json.load(f)
iprice = data[self.model]["input"]
oprice = data[self.model]["output"]
# Calculate input, output, and total costs
# ...
Configuration
The module uses Hydra for configuration management. Key parameters in example.yaml:
config_name: <*> # Configuration profile name
inSet: <*> # Input directory for CTI sources
outSet: <*> # Output directory for results
model: <*> # LLM model identifier
retriever:
type: <*> # Demonstration retrieval method
permutation: <*> # Order of retrieved examples
shot: <*> # Number of demonstrations to include
demo_set: <*> # Directory with demonstration examples
ie_prompt_set: <*> # Directory with prompt templates
templ: <*> # Template file to use
ie_prompt_store: <*> # Storage for used prompts
Usage Instructions
- Setup Configuration:
- Modify
example.yaml
to specify input/output paths - Set your LLM model and API key (or use environment variables)
- Configure demonstration parameters (shot count, retriever type)
- Select appropriate prompt template
- Modify
- Run the Pipeline:
cd IE python main.py
Output Structure
For each processed CTI report, the module generates:
- Structured JSON output:
{ "CTI": "Original CTI text...", "IE": { "triplets": [ {"subject": "...", "relation": "...", "object": "..."}, ... ], "triples_count": ..., "cost": { "model": "...", "input": {"tokens": ..., "cost": ...}, "output": {"tokens": ..., "cost": ...}, "total": {"tokens": ..., "cost": ...} }, "time": ..., "Prompt": { "constructed_prompt": "...", "prompt_template": "...", "demo_retriever": "...", "demos": ["...", ...], "demo_number": ..., "permutation": "..." } } }
- Prompt Archives:
- Each prompt is saved for reproducibility and debugging
Key Features
- Multi-model Support: Works with OpenAI GPT models, Llama, and Qwen 🚀
- Few-Shot Learning: Uses relevant examples to guide extraction 🚀
- Flexible Templating: Uses Jinja2 for adaptive prompt construction 🚀
- Smart Demo Retrieval: kNN-based selection of similar examples 🚀
- Usage Tracking: Calculates and tracks token usage and API costs 🚀
- Reproducibility: Saves all prompts and configuration details 🚀
Extension Points
- Custom Templates: Add new templates to
IE_Prompts
- New Models: Extend
LLMcaller.py
to support additional LLMs - Retrieval Methods: Implement alternatives to kNN in
demoRetriever.py
- Output Formats: Modify
responseParser.py
for different output structures
Dependencies
- Python 3.8+
- Hydra
- Jinja2
- OpenAI API (for GPT models)
- Ollama (for local Llama/Qwen deployment)
- scikit-learn (for kNN retrieval)
- NLTK (for text processing)