# Knowledge Graph Configuration Guide

## Overview

This guide covers all configuration options for the AIECS Knowledge Graph system, including storage backends, feature flags, and performance tuning.

## Table of Contents

1. [Storage Configuration](#storage-configuration)
2. [Feature Flags](#feature-flags)
3. [Knowledge Fusion Configuration](#knowledge-fusion-configuration)
4. [Reranking Configuration](#reranking-configuration)
5. [Cache Configuration](#cache-configuration)
6. [Query Optimization](#query-optimization)
7. [Environment Variables](#environment-variables)
8. [Configuration Examples](#configuration-examples)

## Storage Configuration

### Backend Selection

Choose the appropriate storage backend based on your use case:

```bash
# In-memory (default) - Fast, no persistence
KG_STORAGE_BACKEND=inmemory

# SQLite - File-based persistence, single-user
KG_STORAGE_BACKEND=sqlite

# PostgreSQL - Production-ready, multi-user
KG_STORAGE_BACKEND=postgresql
```

### In-Memory Configuration

```bash
# Maximum number of nodes (default: 100000)
KG_INMEMORY_MAX_NODES=100000
```

**Use Cases:**
- Development and testing
- Temporary graphs
- Small to medium datasets (<100K nodes)

### SQLite Configuration

```bash
# Database file path (default: ./storage/knowledge_graph.db)
KG_SQLITE_DB_PATH=./storage/knowledge_graph.db
```

**Use Cases:**
- Single-user applications
- File-based persistence
- Medium datasets (<1M nodes)

### PostgreSQL Configuration

```bash
# PostgreSQL connection settings
KG_POSTGRES_HOST=localhost
KG_POSTGRES_PORT=5432
KG_POSTGRES_USER=postgres
KG_POSTGRES_PASSWORD=your_password
KG_POSTGRES_DATABASE=knowledge_graph

# Connection pool settings
KG_MIN_POOL_SIZE=5
KG_MAX_POOL_SIZE=20

# Enable pgvector for optimized vector search (requires pgvector extension)
KG_ENABLE_PGVECTOR=false
```

**Use Cases:**
- Production deployments
- Multi-user applications
- Large datasets (>1M nodes)
- High concurrency

## Feature Flags

Control which features are enabled in your deployment:

### Runnable Pattern

```bash
# Enable Runnable pattern for composable operations (default: true)
KG_ENABLE_RUNNABLE_PATTERN=true
```

**Benefits:**
- Composable graph operations
- Pipeline chaining
- Async/sync compatibility

**When to disable:**
- Legacy code compatibility
- Simplified debugging

### Knowledge Fusion

```bash
# Enable cross-document entity merging (default: true)
KG_ENABLE_KNOWLEDGE_FUSION=true
```

**Benefits:**
- Merge duplicate entities across documents
- Resolve property conflicts
- Track provenance

**When to disable:**
- Single-document graphs
- No duplicate entities expected

### Result Reranking

```bash
# Enable search result reranking (default: true)
KG_ENABLE_RERANKING=true
```

**Benefits:**
- Improved search relevance
- Multiple ranking signals
- Better precision

**When to disable:**
- Performance-critical applications
- Simple search requirements

### Logical Queries

```bash
# Enable logical query parsing (default: true)
KG_ENABLE_LOGICAL_QUERIES=true
```

**Benefits:**
- Natural language to structured queries
- Query validation
- Execution planning

**When to disable:**
- Simple query patterns only
- No NLP requirements

### Structured Data Import

```bash
# Enable CSV/JSON import (default: true)
KG_ENABLE_STRUCTURED_IMPORT=true
```

**Benefits:**
- Import from CSV/JSON files
- Schema mapping
- Bulk data loading

**When to disable:**
- Text-only extraction
- No structured data sources

## Knowledge Fusion Configuration

### Similarity Threshold

```bash
# Similarity threshold for entity fusion (0.0-1.0, default: 0.85)
KG_FUSION_SIMILARITY_THRESHOLD=0.85
```

**Guidelines:**
- **0.95-1.0**: Very strict, only near-identical entities
- **0.85-0.95**: Balanced (recommended)
- **0.70-0.85**: Lenient, more merges
- **<0.70**: Very lenient, risk of false positives

### Conflict Resolution Strategy

```bash
# Strategy for resolving property conflicts (default: most_complete)
KG_FUSION_CONFLICT_RESOLUTION=most_complete
```

**Available Strategies:**

1. **most_complete**: Prefer non-empty, longer values (default)
   - Best for: General use, data enrichment
   
2. **most_recent**: Prefer values from most recent timestamp
   - Best for: Time-sensitive data, news articles
   
3. **most_confident**: Prefer values from most confident sources
   - Best for: Weighted sources, quality-ranked data
   
4. **longest**: Prefer longest string values
   - Best for: Descriptions, detailed text
   
5. **keep_all**: Keep all conflicting values as a list
   - Best for: Preserving all information, manual review

## Reranking Configuration

### Default Strategy

```bash
# Default reranking strategy (default: hybrid)
KG_RERANKING_DEFAULT_STRATEGY=hybrid
```

**Available Strategies:**

1. **text**: BM25-based text similarity
   - Fast, keyword-focused
   
2. **semantic**: Deep semantic similarity
   - Slower, meaning-focused
   
3. **structural**: Graph importance signals
   - Graph-aware, centrality-based
   
4. **hybrid**: Combines all signals (recommended)
   - Best results, slightly slower

### Top-K Configuration

```bash
# Number of results to fetch before reranking (default: 100)
KG_RERANKING_TOP_K=100
```

**Guidelines:**
- Higher values: Better recall, slower
- Lower values: Faster, may miss relevant results
- Recommended: 2-10x your final result count

## Cache Configuration

### Query Cache

```bash
# Enable query result caching (default: true)
KG_ENABLE_QUERY_CACHE=true

# Cache TTL in seconds (default: 300 = 5 minutes)
KG_CACHE_TTL_SECONDS=300
```

**Benefits:**
- Faster repeated queries
- Reduced database load
- Better performance

**When to disable:**
- Real-time data requirements
- Frequently changing graphs

### Schema Cache

```bash
# Enable schema caching (default: true)
KG_ENABLE_SCHEMA_CACHE=true

# Schema cache TTL in seconds (default: 3600 = 1 hour)
KG_SCHEMA_CACHE_TTL_SECONDS=3600
```

**Benefits:**
- Faster schema operations
- Reduced metadata queries
- Better type inference

**When to disable:**
- Frequently changing schemas
- Development/testing

## Query Optimization

### Enable Optimization

```bash
# Enable query optimization (default: true)
KG_ENABLE_QUERY_OPTIMIZATION=true
```

**Benefits:**
- Faster query execution
- Better resource utilization
- Automatic query planning

### Optimization Strategy

```bash
# Optimization strategy (default: balanced)
KG_QUERY_OPTIMIZATION_STRATEGY=balanced
```

**Available Strategies:**

1. **cost**: Minimize computational cost
   - Best for: Resource-constrained environments

2. **latency**: Minimize query latency
   - Best for: Real-time applications

3. **balanced**: Balance cost and latency (recommended)
   - Best for: General use

## Environment Variables

### Complete Reference

```bash
# =====================================
# Storage Configuration
# =====================================
KG_STORAGE_BACKEND=inmemory
KG_SQLITE_DB_PATH=./storage/knowledge_graph.db
KG_POSTGRES_HOST=localhost
KG_POSTGRES_PORT=5432
KG_POSTGRES_USER=postgres
KG_POSTGRES_PASSWORD=your_password
KG_POSTGRES_DATABASE=knowledge_graph
KG_MIN_POOL_SIZE=5
KG_MAX_POOL_SIZE=20
KG_ENABLE_PGVECTOR=false
KG_INMEMORY_MAX_NODES=100000

# =====================================
# Vector and Query Configuration
# =====================================
KG_VECTOR_DIMENSION=1536
KG_DEFAULT_SEARCH_LIMIT=10
KG_MAX_TRAVERSAL_DEPTH=5

# =====================================
# Cache Configuration
# =====================================
KG_ENABLE_QUERY_CACHE=true
KG_CACHE_TTL_SECONDS=300
KG_ENABLE_SCHEMA_CACHE=true
KG_SCHEMA_CACHE_TTL_SECONDS=3600

# =====================================
# Feature Flags
# =====================================
KG_ENABLE_RUNNABLE_PATTERN=true
KG_ENABLE_KNOWLEDGE_FUSION=true
KG_ENABLE_RERANKING=true
KG_ENABLE_LOGICAL_QUERIES=true
KG_ENABLE_STRUCTURED_IMPORT=true

# =====================================
# Knowledge Fusion Configuration
# =====================================
KG_FUSION_SIMILARITY_THRESHOLD=0.85
KG_FUSION_CONFLICT_RESOLUTION=most_complete

# =====================================
# Reranking Configuration
# =====================================
KG_RERANKING_DEFAULT_STRATEGY=hybrid
KG_RERANKING_TOP_K=100

# =====================================
# Query Optimization
# =====================================
KG_ENABLE_QUERY_OPTIMIZATION=true
KG_QUERY_OPTIMIZATION_STRATEGY=balanced
```

## Configuration Examples

### Development Setup

Fast iteration with in-memory storage:

```bash
# .env.development
KG_STORAGE_BACKEND=inmemory
KG_INMEMORY_MAX_NODES=50000
KG_ENABLE_QUERY_CACHE=false
KG_ENABLE_SCHEMA_CACHE=false
KG_ENABLE_QUERY_OPTIMIZATION=false
```

### Testing Setup

File-based persistence for reproducible tests:

```bash
# .env.test
KG_STORAGE_BACKEND=sqlite
KG_SQLITE_DB_PATH=./test_data/test_graph.db
KG_ENABLE_QUERY_CACHE=true
KG_CACHE_TTL_SECONDS=60
KG_ENABLE_QUERY_OPTIMIZATION=true
```

### Production Setup

PostgreSQL with all optimizations:

```bash
# .env.production
KG_STORAGE_BACKEND=postgresql
KG_POSTGRES_HOST=db.example.com
KG_POSTGRES_PORT=5432
KG_POSTGRES_USER=kg_user
KG_POSTGRES_PASSWORD=secure_password
KG_POSTGRES_DATABASE=knowledge_graph
KG_MIN_POOL_SIZE=10
KG_MAX_POOL_SIZE=50
KG_ENABLE_PGVECTOR=true

# Enable all features
KG_ENABLE_RUNNABLE_PATTERN=true
KG_ENABLE_KNOWLEDGE_FUSION=true
KG_ENABLE_RERANKING=true
KG_ENABLE_LOGICAL_QUERIES=true
KG_ENABLE_STRUCTURED_IMPORT=true

# Optimize for production
KG_ENABLE_QUERY_CACHE=true
KG_CACHE_TTL_SECONDS=600
KG_ENABLE_SCHEMA_CACHE=true
KG_SCHEMA_CACHE_TTL_SECONDS=7200
KG_ENABLE_QUERY_OPTIMIZATION=true
KG_QUERY_OPTIMIZATION_STRATEGY=balanced

# Reranking for best results
KG_RERANKING_DEFAULT_STRATEGY=hybrid
KG_RERANKING_TOP_K=200

# Fusion for data quality
KG_FUSION_SIMILARITY_THRESHOLD=0.85
KG_FUSION_CONFLICT_RESOLUTION=most_complete
```

### High-Performance Setup

Optimized for speed:

```bash
# .env.performance
KG_STORAGE_BACKEND=postgresql
KG_ENABLE_PGVECTOR=true
KG_MAX_POOL_SIZE=100

# Aggressive caching
KG_ENABLE_QUERY_CACHE=true
KG_CACHE_TTL_SECONDS=1800
KG_ENABLE_SCHEMA_CACHE=true
KG_SCHEMA_CACHE_TTL_SECONDS=14400

# Latency optimization
KG_ENABLE_QUERY_OPTIMIZATION=true
KG_QUERY_OPTIMIZATION_STRATEGY=latency

# Disable expensive features
KG_ENABLE_RERANKING=false
KG_ENABLE_KNOWLEDGE_FUSION=false
```

### Data Quality Setup

Optimized for accuracy:

```bash
# .env.quality
KG_STORAGE_BACKEND=postgresql

# Enable all quality features
KG_ENABLE_KNOWLEDGE_FUSION=true
KG_ENABLE_RERANKING=true
KG_ENABLE_LOGICAL_QUERIES=true

# Strict fusion
KG_FUSION_SIMILARITY_THRESHOLD=0.90
KG_FUSION_CONFLICT_RESOLUTION=most_confident

# Best reranking
KG_RERANKING_DEFAULT_STRATEGY=hybrid
KG_RERANKING_TOP_K=500

# Balanced optimization
KG_QUERY_OPTIMIZATION_STRATEGY=balanced
```

## Best Practices

### 1. Start Simple

Begin with default settings and adjust based on your needs:

```bash
# Minimal configuration
KG_STORAGE_BACKEND=inmemory
```

### 2. Monitor Performance

Track key metrics:
- Query latency
- Cache hit rate
- Fusion merge rate
- Reranking impact

### 3. Tune Gradually

Adjust one parameter at a time:
1. Choose storage backend
2. Enable/disable features
3. Tune cache settings
4. Optimize queries

### 4. Environment-Specific Configs

Use different configurations for different environments:
- `.env.development` - Fast iteration
- `.env.test` - Reproducible tests
- `.env.staging` - Production-like
- `.env.production` - Optimized for scale

### 5. Security Considerations

- Never commit `.env` files to version control
- Use strong passwords for PostgreSQL
- Restrict database access
- Enable SSL for production databases

## Troubleshooting

### Slow Queries

**Problem**: Queries are taking too long

**Solutions:**
1. Enable query optimization: `KG_ENABLE_QUERY_OPTIMIZATION=true`
2. Increase cache TTL: `KG_CACHE_TTL_SECONDS=600`
3. Use PostgreSQL with pgvector: `KG_ENABLE_PGVECTOR=true`
4. Reduce reranking top-K: `KG_RERANKING_TOP_K=50`

### High Memory Usage

**Problem**: Application using too much memory

**Solutions:**
1. Switch to SQLite or PostgreSQL: `KG_STORAGE_BACKEND=sqlite`
2. Reduce in-memory max nodes: `KG_INMEMORY_MAX_NODES=50000`
3. Disable caching: `KG_ENABLE_QUERY_CACHE=false`
4. Reduce cache TTL: `KG_CACHE_TTL_SECONDS=60`

### Too Many Duplicate Entities

**Problem**: Fusion is merging too many entities

**Solutions:**
1. Increase similarity threshold: `KG_FUSION_SIMILARITY_THRESHOLD=0.90`
2. Change conflict resolution: `KG_FUSION_CONFLICT_RESOLUTION=keep_all`
3. Review entity extraction quality

### Poor Search Results

**Problem**: Search results are not relevant

**Solutions:**
1. Enable reranking: `KG_ENABLE_RERANKING=true`
2. Use hybrid strategy: `KG_RERANKING_DEFAULT_STRATEGY=hybrid`
3. Increase reranking top-K: `KG_RERANKING_TOP_K=200`
4. Adjust vector dimension: `KG_VECTOR_DIMENSION=1536`

## See Also

- [Storage Backend Guide](../../developer/knowledge_graph/storage/SQLITE_BACKEND.md)
- [Performance Guide](./performance/PERFORMANCE_GUIDE.md)
- [Reranking Strategies Guide](./reasoning/reranking-strategies-guide.md)
- [Schema Caching Guide](./reasoning/schema-caching-guide.md)
- [Production Deployment](./deployment/PRODUCTION_DEPLOYMENT.md)