Knowledge Graph Configuration Guide

Overview

This guide covers all configuration options for the AIECS Knowledge Graph system, including storage backends, feature flags, and performance tuning.

Table of Contents

  1. Storage Configuration

  2. Feature Flags

  3. Knowledge Fusion Configuration

  4. Reranking Configuration

  5. Cache Configuration

  6. Query Optimization

  7. Environment Variables

  8. Configuration Examples

Storage Configuration

Backend Selection

Choose the appropriate storage backend based on your use case:

# In-memory (default) - Fast, no persistence
KG_STORAGE_BACKEND=inmemory

# SQLite - File-based persistence, single-user
KG_STORAGE_BACKEND=sqlite

# PostgreSQL - Production-ready, multi-user
KG_STORAGE_BACKEND=postgresql

In-Memory Configuration

# Maximum number of nodes (default: 100000)
KG_INMEMORY_MAX_NODES=100000

Use Cases:

  • Development and testing

  • Temporary graphs

  • Small to medium datasets (<100K nodes)

SQLite Configuration

# Database file path (default: ./storage/knowledge_graph.db)
KG_SQLITE_DB_PATH=./storage/knowledge_graph.db

Use Cases:

  • Single-user applications

  • File-based persistence

  • Medium datasets (<1M nodes)

PostgreSQL Configuration

# PostgreSQL connection settings
KG_POSTGRES_HOST=localhost
KG_POSTGRES_PORT=5432
KG_POSTGRES_USER=postgres
KG_POSTGRES_PASSWORD=your_password
KG_POSTGRES_DATABASE=knowledge_graph

# Connection pool settings
KG_MIN_POOL_SIZE=5
KG_MAX_POOL_SIZE=20

# Enable pgvector for optimized vector search (requires pgvector extension)
KG_ENABLE_PGVECTOR=false

Use Cases:

  • Production deployments

  • Multi-user applications

  • Large datasets (>1M nodes)

  • High concurrency

Feature Flags

Control which features are enabled in your deployment:

Runnable Pattern

# Enable Runnable pattern for composable operations (default: true)
KG_ENABLE_RUNNABLE_PATTERN=true

Benefits:

  • Composable graph operations

  • Pipeline chaining

  • Async/sync compatibility

When to disable:

  • Legacy code compatibility

  • Simplified debugging

Knowledge Fusion

# Enable cross-document entity merging (default: true)
KG_ENABLE_KNOWLEDGE_FUSION=true

Benefits:

  • Merge duplicate entities across documents

  • Resolve property conflicts

  • Track provenance

When to disable:

  • Single-document graphs

  • No duplicate entities expected

Result Reranking

# Enable search result reranking (default: true)
KG_ENABLE_RERANKING=true

Benefits:

  • Improved search relevance

  • Multiple ranking signals

  • Better precision

When to disable:

  • Performance-critical applications

  • Simple search requirements

Logical Queries

# Enable logical query parsing (default: true)
KG_ENABLE_LOGICAL_QUERIES=true

Benefits:

  • Natural language to structured queries

  • Query validation

  • Execution planning

When to disable:

  • Simple query patterns only

  • No NLP requirements

Structured Data Import

# Enable CSV/JSON import (default: true)
KG_ENABLE_STRUCTURED_IMPORT=true

Benefits:

  • Import from CSV/JSON files

  • Schema mapping

  • Bulk data loading

When to disable:

  • Text-only extraction

  • No structured data sources

Knowledge Fusion Configuration

Similarity Threshold

# Similarity threshold for entity fusion (0.0-1.0, default: 0.85)
KG_FUSION_SIMILARITY_THRESHOLD=0.85

Guidelines:

  • 0.95-1.0: Very strict, only near-identical entities

  • 0.85-0.95: Balanced (recommended)

  • 0.70-0.85: Lenient, more merges

  • <0.70: Very lenient, risk of false positives

Conflict Resolution Strategy

# Strategy for resolving property conflicts (default: most_complete)
KG_FUSION_CONFLICT_RESOLUTION=most_complete

Available Strategies:

  1. most_complete: Prefer non-empty, longer values (default)

    • Best for: General use, data enrichment

  2. most_recent: Prefer values from most recent timestamp

    • Best for: Time-sensitive data, news articles

  3. most_confident: Prefer values from most confident sources

    • Best for: Weighted sources, quality-ranked data

  4. longest: Prefer longest string values

    • Best for: Descriptions, detailed text

  5. keep_all: Keep all conflicting values as a list

    • Best for: Preserving all information, manual review

Reranking Configuration

Default Strategy

# Default reranking strategy (default: hybrid)
KG_RERANKING_DEFAULT_STRATEGY=hybrid

Available Strategies:

  1. text: BM25-based text similarity

    • Fast, keyword-focused

  2. semantic: Deep semantic similarity

    • Slower, meaning-focused

  3. structural: Graph importance signals

    • Graph-aware, centrality-based

  4. hybrid: Combines all signals (recommended)

    • Best results, slightly slower

Top-K Configuration

# Number of results to fetch before reranking (default: 100)
KG_RERANKING_TOP_K=100

Guidelines:

  • Higher values: Better recall, slower

  • Lower values: Faster, may miss relevant results

  • Recommended: 2-10x your final result count

Cache Configuration

Query Cache

# Enable query result caching (default: true)
KG_ENABLE_QUERY_CACHE=true

# Cache TTL in seconds (default: 300 = 5 minutes)
KG_CACHE_TTL_SECONDS=300

Benefits:

  • Faster repeated queries

  • Reduced database load

  • Better performance

When to disable:

  • Real-time data requirements

  • Frequently changing graphs

Schema Cache

# Enable schema caching (default: true)
KG_ENABLE_SCHEMA_CACHE=true

# Schema cache TTL in seconds (default: 3600 = 1 hour)
KG_SCHEMA_CACHE_TTL_SECONDS=3600

Benefits:

  • Faster schema operations

  • Reduced metadata queries

  • Better type inference

When to disable:

  • Frequently changing schemas

  • Development/testing

Query Optimization

Enable Optimization

# Enable query optimization (default: true)
KG_ENABLE_QUERY_OPTIMIZATION=true

Benefits:

  • Faster query execution

  • Better resource utilization

  • Automatic query planning

Optimization Strategy

# Optimization strategy (default: balanced)
KG_QUERY_OPTIMIZATION_STRATEGY=balanced

Available Strategies:

  1. cost: Minimize computational cost

    • Best for: Resource-constrained environments

  2. latency: Minimize query latency

    • Best for: Real-time applications

  3. balanced: Balance cost and latency (recommended)

    • Best for: General use

Environment Variables

Complete Reference

# =====================================
# Storage Configuration
# =====================================
KG_STORAGE_BACKEND=inmemory
KG_SQLITE_DB_PATH=./storage/knowledge_graph.db
KG_POSTGRES_HOST=localhost
KG_POSTGRES_PORT=5432
KG_POSTGRES_USER=postgres
KG_POSTGRES_PASSWORD=your_password
KG_POSTGRES_DATABASE=knowledge_graph
KG_MIN_POOL_SIZE=5
KG_MAX_POOL_SIZE=20
KG_ENABLE_PGVECTOR=false
KG_INMEMORY_MAX_NODES=100000

# =====================================
# Vector and Query Configuration
# =====================================
KG_VECTOR_DIMENSION=1536
KG_DEFAULT_SEARCH_LIMIT=10
KG_MAX_TRAVERSAL_DEPTH=5

# =====================================
# Cache Configuration
# =====================================
KG_ENABLE_QUERY_CACHE=true
KG_CACHE_TTL_SECONDS=300
KG_ENABLE_SCHEMA_CACHE=true
KG_SCHEMA_CACHE_TTL_SECONDS=3600

# =====================================
# Feature Flags
# =====================================
KG_ENABLE_RUNNABLE_PATTERN=true
KG_ENABLE_KNOWLEDGE_FUSION=true
KG_ENABLE_RERANKING=true
KG_ENABLE_LOGICAL_QUERIES=true
KG_ENABLE_STRUCTURED_IMPORT=true

# =====================================
# Knowledge Fusion Configuration
# =====================================
KG_FUSION_SIMILARITY_THRESHOLD=0.85
KG_FUSION_CONFLICT_RESOLUTION=most_complete

# =====================================
# Reranking Configuration
# =====================================
KG_RERANKING_DEFAULT_STRATEGY=hybrid
KG_RERANKING_TOP_K=100

# =====================================
# Query Optimization
# =====================================
KG_ENABLE_QUERY_OPTIMIZATION=true
KG_QUERY_OPTIMIZATION_STRATEGY=balanced

Configuration Examples

Development Setup

Fast iteration with in-memory storage:

# .env.development
KG_STORAGE_BACKEND=inmemory
KG_INMEMORY_MAX_NODES=50000
KG_ENABLE_QUERY_CACHE=false
KG_ENABLE_SCHEMA_CACHE=false
KG_ENABLE_QUERY_OPTIMIZATION=false

Testing Setup

File-based persistence for reproducible tests:

# .env.test
KG_STORAGE_BACKEND=sqlite
KG_SQLITE_DB_PATH=./test_data/test_graph.db
KG_ENABLE_QUERY_CACHE=true
KG_CACHE_TTL_SECONDS=60
KG_ENABLE_QUERY_OPTIMIZATION=true

Production Setup

PostgreSQL with all optimizations:

# .env.production
KG_STORAGE_BACKEND=postgresql
KG_POSTGRES_HOST=db.example.com
KG_POSTGRES_PORT=5432
KG_POSTGRES_USER=kg_user
KG_POSTGRES_PASSWORD=secure_password
KG_POSTGRES_DATABASE=knowledge_graph
KG_MIN_POOL_SIZE=10
KG_MAX_POOL_SIZE=50
KG_ENABLE_PGVECTOR=true

# Enable all features
KG_ENABLE_RUNNABLE_PATTERN=true
KG_ENABLE_KNOWLEDGE_FUSION=true
KG_ENABLE_RERANKING=true
KG_ENABLE_LOGICAL_QUERIES=true
KG_ENABLE_STRUCTURED_IMPORT=true

# Optimize for production
KG_ENABLE_QUERY_CACHE=true
KG_CACHE_TTL_SECONDS=600
KG_ENABLE_SCHEMA_CACHE=true
KG_SCHEMA_CACHE_TTL_SECONDS=7200
KG_ENABLE_QUERY_OPTIMIZATION=true
KG_QUERY_OPTIMIZATION_STRATEGY=balanced

# Reranking for best results
KG_RERANKING_DEFAULT_STRATEGY=hybrid
KG_RERANKING_TOP_K=200

# Fusion for data quality
KG_FUSION_SIMILARITY_THRESHOLD=0.85
KG_FUSION_CONFLICT_RESOLUTION=most_complete

High-Performance Setup

Optimized for speed:

# .env.performance
KG_STORAGE_BACKEND=postgresql
KG_ENABLE_PGVECTOR=true
KG_MAX_POOL_SIZE=100

# Aggressive caching
KG_ENABLE_QUERY_CACHE=true
KG_CACHE_TTL_SECONDS=1800
KG_ENABLE_SCHEMA_CACHE=true
KG_SCHEMA_CACHE_TTL_SECONDS=14400

# Latency optimization
KG_ENABLE_QUERY_OPTIMIZATION=true
KG_QUERY_OPTIMIZATION_STRATEGY=latency

# Disable expensive features
KG_ENABLE_RERANKING=false
KG_ENABLE_KNOWLEDGE_FUSION=false

Data Quality Setup

Optimized for accuracy:

# .env.quality
KG_STORAGE_BACKEND=postgresql

# Enable all quality features
KG_ENABLE_KNOWLEDGE_FUSION=true
KG_ENABLE_RERANKING=true
KG_ENABLE_LOGICAL_QUERIES=true

# Strict fusion
KG_FUSION_SIMILARITY_THRESHOLD=0.90
KG_FUSION_CONFLICT_RESOLUTION=most_confident

# Best reranking
KG_RERANKING_DEFAULT_STRATEGY=hybrid
KG_RERANKING_TOP_K=500

# Balanced optimization
KG_QUERY_OPTIMIZATION_STRATEGY=balanced

Best Practices

1. Start Simple

Begin with default settings and adjust based on your needs:

# Minimal configuration
KG_STORAGE_BACKEND=inmemory

2. Monitor Performance

Track key metrics:

  • Query latency

  • Cache hit rate

  • Fusion merge rate

  • Reranking impact

3. Tune Gradually

Adjust one parameter at a time:

  1. Choose storage backend

  2. Enable/disable features

  3. Tune cache settings

  4. Optimize queries

4. Environment-Specific Configs

Use different configurations for different environments:

  • .env.development - Fast iteration

  • .env.test - Reproducible tests

  • .env.staging - Production-like

  • .env.production - Optimized for scale

5. Security Considerations

  • Never commit .env files to version control

  • Use strong passwords for PostgreSQL

  • Restrict database access

  • Enable SSL for production databases

Troubleshooting

Slow Queries

Problem: Queries are taking too long

Solutions:

  1. Enable query optimization: KG_ENABLE_QUERY_OPTIMIZATION=true

  2. Increase cache TTL: KG_CACHE_TTL_SECONDS=600

  3. Use PostgreSQL with pgvector: KG_ENABLE_PGVECTOR=true

  4. Reduce reranking top-K: KG_RERANKING_TOP_K=50

High Memory Usage

Problem: Application using too much memory

Solutions:

  1. Switch to SQLite or PostgreSQL: KG_STORAGE_BACKEND=sqlite

  2. Reduce in-memory max nodes: KG_INMEMORY_MAX_NODES=50000

  3. Disable caching: KG_ENABLE_QUERY_CACHE=false

  4. Reduce cache TTL: KG_CACHE_TTL_SECONDS=60

Too Many Duplicate Entities

Problem: Fusion is merging too many entities

Solutions:

  1. Increase similarity threshold: KG_FUSION_SIMILARITY_THRESHOLD=0.90

  2. Change conflict resolution: KG_FUSION_CONFLICT_RESOLUTION=keep_all

  3. Review entity extraction quality

Poor Search Results

Problem: Search results are not relevant

Solutions:

  1. Enable reranking: KG_ENABLE_RERANKING=true

  2. Use hybrid strategy: KG_RERANKING_DEFAULT_STRATEGY=hybrid

  3. Increase reranking top-K: KG_RERANKING_TOP_K=200

  4. Adjust vector dimension: KG_VECTOR_DIMENSION=1536

See Also