# Knowledge Graph Configuration Guide This guide explains how to configure the knowledge graph capabilities in AIECS. ## Table of Contents 1. [Quick Start](#quick-start) 2. [Storage Backends](#storage-backends) 3. [Environment Variables](#environment-variables) 4. [Configuration Properties](#configuration-properties) 5. [Backend-Specific Configuration](#backend-specific-configuration) 6. [Query Configuration](#query-configuration) 7. [Cache Configuration](#cache-configuration) 8. [Validation](#validation) 9. [Examples](#examples) ## Quick Start ### Minimal Configuration (In-Memory) No configuration needed! The default settings use in-memory storage: ```python from aiecs.config import get_settings settings = get_settings() # Uses inmemory backend by default ``` ### Development Configuration (SQLite) Add to your `.env` file: ```bash KG_STORAGE_BACKEND=sqlite KG_SQLITE_DB_PATH=./storage/knowledge_graph.db ``` ### Production Configuration (PostgreSQL) Add to your `.env` file: ```bash KG_STORAGE_BACKEND=postgresql # Use main database (default) # OR use a separate database: KG_DB_HOST=localhost KG_DB_PORT=5432 KG_DB_USER=kg_user KG_DB_PASSWORD=your_password KG_DB_NAME=aiecs_knowledge_graph ``` ## Storage Backends AIECS supports three storage backends for knowledge graphs: ### 1. In-Memory (Default) - **Use Case**: Development, testing, small graphs - **Pros**: Fast, no setup required - **Cons**: Data lost on restart, limited by RAM - **Max Nodes**: 100,000 (configurable) ```bash KG_STORAGE_BACKEND=inmemory ``` ### 2. SQLite - **Use Case**: Development, embedded applications, file-based persistence - **Pros**: Simple, portable, ACID transactions - **Cons**: Single-writer, limited concurrency - **Best For**: Single-user applications, up to ~1M nodes ```bash KG_STORAGE_BACKEND=sqlite KG_SQLITE_DB_PATH=./storage/knowledge_graph.db ``` ### 3. PostgreSQL (Recommended for Production) - **Use Case**: Production, multi-user, large-scale graphs - **Pros**: Scalable, concurrent, ACID transactions, connection pooling - **Cons**: Requires database setup - **Best For**: Production applications, millions of nodes ```bash KG_STORAGE_BACKEND=postgresql ``` ## Environment Variables ### Core Configuration | Variable | Type | Default | Description | |----------|------|---------|-------------| | `KG_STORAGE_BACKEND` | string | `inmemory` | Storage backend: `inmemory`, `sqlite`, or `postgresql` | ### SQLite Configuration | Variable | Type | Default | Description | |----------|------|---------|-------------| | `KG_SQLITE_DB_PATH` | string | `./storage/knowledge_graph.db` | Path to SQLite database file | ### PostgreSQL Configuration | Variable | Type | Default | Description | |----------|------|---------|-------------| | `KG_POSTGRES_URL` | string | `""` | PostgreSQL connection string (DSN) | | `KG_DB_HOST` | string | `""` | Database host (falls back to main `DB_HOST`) | | `KG_DB_PORT` | int | `5432` | Database port | | `KG_DB_USER` | string | `""` | Database user (falls back to main `DB_USER`) | | `KG_DB_PASSWORD` | string | `""` | Database password (falls back to main `DB_PASSWORD`) | | `KG_DB_NAME` | string | `""` | Database name (default: `aiecs_knowledge_graph`) | | `KG_MIN_POOL_SIZE` | int | `5` | Minimum connection pool size | | `KG_MAX_POOL_SIZE` | int | `20` | Maximum connection pool size | | `KG_ENABLE_PGVECTOR` | bool | `false` | Enable pgvector extension for optimized vector search | ### In-Memory Configuration | Variable | Type | Default | Description | |----------|------|---------|-------------| | `KG_INMEMORY_MAX_NODES` | int | `100000` | Maximum number of nodes for in-memory storage | ### Query Configuration | Variable | Type | Default | Description | |----------|------|---------|-------------| | `KG_DEFAULT_SEARCH_LIMIT` | int | `10` | Default number of results to return in searches | | `KG_MAX_TRAVERSAL_DEPTH` | int | `5` | Maximum depth for graph traversal queries (1-10) | | `KG_VECTOR_DIMENSION` | int | `1536` | Dimension of embedding vectors (OpenAI ada-002 default) | ### Cache Configuration | Variable | Type | Default | Description | |----------|------|---------|-------------| | `KG_ENABLE_QUERY_CACHE` | bool | `true` | Enable caching of query results | | `KG_CACHE_TTL_SECONDS` | int | `300` | Time-to-live for cached query results (seconds) | ## Configuration Properties Access configuration programmatically: ```python from aiecs.config import get_settings settings = get_settings() # Get database configuration for current backend db_config = settings.kg_database_config # Returns different config based on backend: # - PostgreSQL: {"host": ..., "port": ..., "user": ..., etc.} # - SQLite: {"db_path": ...} # - In-memory: {"max_nodes": ...} # Get query configuration query_config = settings.kg_query_config # Returns: { # "default_search_limit": 10, # "max_traversal_depth": 5, # "vector_dimension": 1536 # } # Get cache configuration cache_config = settings.kg_cache_config # Returns: { # "enable_query_cache": True, # "cache_ttl_seconds": 300 # } ``` ## Backend-Specific Configuration ### PostgreSQL: Using Main Database By default, if you don't set KG-specific database parameters, the knowledge graph uses your main AIECS database: ```bash # Main database config DB_HOST=localhost DB_PORT=5432 DB_USER=postgres DB_PASSWORD=your_password DB_NAME=aiecs # Knowledge graph uses main DB KG_STORAGE_BACKEND=postgresql ``` The knowledge graph creates its own tables (`graph_entities`, `graph_relations`) within the main database. ### PostgreSQL: Separate Database For better isolation, use a separate database: ```bash # Main database config DB_HOST=localhost DB_USER=postgres DB_PASSWORD=your_password DB_NAME=aiecs # Separate knowledge graph database KG_STORAGE_BACKEND=postgresql KG_DB_HOST=localhost KG_DB_USER=kg_user KG_DB_PASSWORD=kg_password KG_DB_NAME=aiecs_knowledge_graph ``` ### PostgreSQL: Cloud Database (Connection String) For cloud databases (e.g., Google Cloud SQL, AWS RDS): ```bash KG_STORAGE_BACKEND=postgresql KG_POSTGRES_URL=postgresql://user:password@host:5432/dbname?sslmode=require ``` ### PostgreSQL: Connection Pooling Optimize for your workload: ```bash # For high-concurrency applications KG_MIN_POOL_SIZE=10 KG_MAX_POOL_SIZE=50 # For low-concurrency applications KG_MIN_POOL_SIZE=2 KG_MAX_POOL_SIZE=10 ``` ### PostgreSQL: pgvector Extension Enable optimized vector search (requires pgvector installed): ```bash KG_ENABLE_PGVECTOR=true ``` **Prerequisites**: 1. Install pgvector extension in your PostgreSQL database 2. The extension will be automatically used for vector similarity search ### SQLite: Memory vs. File ```bash # File-based persistence (recommended) KG_SQLITE_DB_PATH=./storage/knowledge_graph.db # In-memory SQLite (no persistence) KG_SQLITE_DB_PATH=:memory: ``` ## Query Configuration ### Search Limits Control the number of results returned: ```bash # Return more results (e.g., for comprehensive search) KG_DEFAULT_SEARCH_LIMIT=50 # Return fewer results (e.g., for quick queries) KG_DEFAULT_SEARCH_LIMIT=5 ``` ### Traversal Depth Control how deep graph traversals can go: ```bash # Shallow traversals (faster, less comprehensive) KG_MAX_TRAVERSAL_DEPTH=3 # Deep traversals (slower, more comprehensive) KG_MAX_TRAVERSAL_DEPTH=7 ``` **Warning**: Values > 10 may cause performance issues. ### Vector Dimensions Match your embedding model: ```bash # OpenAI ada-002 (default) KG_VECTOR_DIMENSION=1536 # OpenAI text-embedding-3-small KG_VECTOR_DIMENSION=1536 # OpenAI text-embedding-3-large KG_VECTOR_DIMENSION=3072 # Sentence Transformers (various) KG_VECTOR_DIMENSION=384 # all-MiniLM-L6-v2 KG_VECTOR_DIMENSION=768 # all-mpnet-base-v2 ``` ## Cache Configuration ### Enable/Disable Caching ```bash # Enable caching (recommended for production) KG_ENABLE_QUERY_CACHE=true # Disable caching (for development/debugging) KG_ENABLE_QUERY_CACHE=false ``` ### Cache TTL Control how long cached results remain valid: ```bash # Short TTL (frequently changing data) KG_CACHE_TTL_SECONDS=60 # Long TTL (stable data) KG_CACHE_TTL_SECONDS=3600 ``` ## Validation ### Automatic Validation Configuration is automatically validated when settings are loaded: ```python from aiecs.config import get_settings try: settings = get_settings() except ValueError as e: print(f"Configuration error: {e}") ``` ### Manual Validation Validate configuration for specific operations: ```python from aiecs.config import validate_required_settings # Validate knowledge graph configuration try: validate_required_settings("knowledge_graph") print("Knowledge graph configuration is valid") except ValueError as e: print(f"Missing configuration: {e}") ``` ### Validation Rules 1. **KG_STORAGE_BACKEND**: Must be `inmemory`, `sqlite`, or `postgresql` 2. **KG_SQLITE_DB_PATH**: Parent directory is automatically created 3. **KG_MAX_TRAVERSAL_DEPTH**: Must be ≥ 1; warning if > 10 4. **KG_VECTOR_DIMENSION**: Must be ≥ 1; warning if not a common dimension 5. **PostgreSQL**: At least one of KG_POSTGRES_URL, KG_DB_HOST, or main DB_PASSWORD must be set ## Examples ### Example 1: Development Setup (SQLite) `.env`: ```bash # SQLite for development KG_STORAGE_BACKEND=sqlite KG_SQLITE_DB_PATH=./dev_knowledge_graph.db # Disable caching for development KG_ENABLE_QUERY_CACHE=false # More verbose search KG_DEFAULT_SEARCH_LIMIT=20 ``` ### Example 2: Production Setup (PostgreSQL) `.env`: ```bash # PostgreSQL for production KG_STORAGE_BACKEND=postgresql KG_POSTGRES_URL=postgresql://kg_user:password@db.example.com:5432/aiecs_kg?sslmode=require # Optimize connection pooling KG_MIN_POOL_SIZE=10 KG_MAX_POOL_SIZE=50 # Enable pgvector KG_ENABLE_PGVECTOR=true # Production query settings KG_DEFAULT_SEARCH_LIMIT=10 KG_MAX_TRAVERSAL_DEPTH=5 # Enable caching KG_ENABLE_QUERY_CACHE=true KG_CACHE_TTL_SECONDS=600 ``` ### Example 3: Testing Setup (In-Memory) `.env.test`: ```bash # In-memory for fast tests KG_STORAGE_BACKEND=inmemory KG_INMEMORY_MAX_NODES=10000 # Disable caching for predictable tests KG_ENABLE_QUERY_CACHE=false ``` ### Example 4: Programmatic Configuration ```python from aiecs.infrastructure.graph_storage import ( InMemoryGraphStore, SQLiteGraphStore, PostgresGraphStore ) from aiecs.config import get_settings settings = get_settings() # Create store based on backend configuration if settings.kg_storage_backend == "inmemory": store = InMemoryGraphStore() elif settings.kg_storage_backend == "sqlite": config = settings.kg_database_config store = SQLiteGraphStore(db_path=config["db_path"]) elif settings.kg_storage_backend == "postgresql": config = settings.kg_database_config store = PostgresGraphStore(**config) await store.initialize() # Use the store # ... await store.close() ``` ### Example 5: Multi-Environment Setup Use different `.env` files for different environments: **`.env.development`**: ```bash KG_STORAGE_BACKEND=sqlite KG_SQLITE_DB_PATH=./dev_kg.db ``` **`.env.staging`**: ```bash KG_STORAGE_BACKEND=postgresql KG_POSTGRES_URL=postgresql://user:pass@staging-db:5432/aiecs_kg ``` **`.env.production`**: ```bash KG_STORAGE_BACKEND=postgresql KG_POSTGRES_URL=postgresql://user:pass@prod-db:5432/aiecs_kg KG_MIN_POOL_SIZE=20 KG_MAX_POOL_SIZE=100 KG_ENABLE_PGVECTOR=true ``` Load the appropriate file: ```bash # Development export ENV_FILE=.env.development python -m aiecs # Staging export ENV_FILE=.env.staging python -m aiecs # Production export ENV_FILE=.env.production python -m aiecs ``` ## Troubleshooting ### Issue: PostgreSQL connection fails **Solution**: Check your connection parameters: ```python from aiecs.config import get_settings settings = get_settings() print(settings.kg_database_config) # Verify host, port, user, password, database are correct ``` ### Issue: SQLite file not found **Solution**: The parent directory is automatically created, but ensure the path is writable: ```bash mkdir -p ./storage chmod 755 ./storage ``` ### Issue: Vector search returns no results **Solution**: Check vector dimensions match your embeddings: ```bash # If using OpenAI ada-002 KG_VECTOR_DIMENSION=1536 # If using different model, adjust accordingly ``` ### Issue: Queries are slow **Solution**: Optimize configuration: ```bash # Reduce traversal depth KG_MAX_TRAVERSAL_DEPTH=3 # Enable caching KG_ENABLE_QUERY_CACHE=true # For PostgreSQL: enable pgvector KG_ENABLE_PGVECTOR=true ``` ## Best Practices 1. **Use PostgreSQL for production**: Scalable, concurrent, reliable 2. **Use SQLite for development**: Simple, portable, fast iteration 3. **Use in-memory for testing**: Fast, isolated, reproducible 4. **Enable caching in production**: Improves performance 5. **Match vector dimensions to your embedding model**: Prevents dimension mismatches 6. **Set reasonable traversal depth**: Balance comprehensiveness vs. performance 7. **Use separate database for KG in production**: Better isolation and resource management 8. **Monitor connection pool usage**: Adjust min/max based on workload 9. **Enable pgvector for large-scale vector search**: Significantly faster than brute-force ## Migration When changing backends, use the migration tools: ```python from aiecs.infrastructure.graph_storage.migration import migrate_sqlite_to_postgres # Migrate from SQLite to PostgreSQL await migrate_sqlite_to_postgres( sqlite_path="./dev_kg.db", postgres_config=None, # Uses config from settings batch_size=1000, show_progress=True ) ``` See the [Knowledge Graph README](./README.md) for more details. ## See Also - [API Reference](./API_REFERENCE.md)