Schema Caching Guide
Version: 1.0
Date: 2025-11-14
Phase: 3.5 - Documentation and Benchmarks
Overview
This guide explains how to use schema caching in the knowledge graph system to improve performance. Schema caching reduces redundant schema lookups and validation operations, significantly improving query performance.
What is Schema Caching?
Schema caching stores frequently accessed schema information (entity types, relation types, properties) in memory to avoid repeated lookups. The SchemaManager uses an LRU (Least Recently Used) cache to manage memory efficiently.
Benefits
Faster Queries: Reduce schema lookup time by 80-95%
Lower Latency: Cached lookups are ~100x faster than database queries
Reduced Load: Fewer database queries for schema information
Memory Efficient: LRU eviction keeps memory usage bounded
Enabling Schema Caching
Default Configuration
Schema caching is enabled by default with sensible defaults:
from aiecs.domain.knowledge_graph.schema.schema_manager import SchemaManager
# Create schema manager (caching enabled by default)
schema_manager = SchemaManager(schema)
# Caching is automatically enabled with:
# - Entity type cache: 100 entries
# - Relation type cache: 100 entries
# - Property cache: 500 entries
Custom Cache Configuration
Configure cache sizes based on your schema:
# Custom cache sizes
schema_manager = SchemaManager(
schema,
enable_cache=True,
entity_type_cache_size=200, # Larger entity type cache
relation_type_cache_size=150, # Larger relation type cache
property_cache_size=1000 # Larger property cache
)
Disabling Cache
Disable caching for testing or debugging:
# Disable all caching
schema_manager = SchemaManager(
schema,
enable_cache=False
)
Cache Sizing Guidelines
Small Schema (< 50 types)
schema_manager = SchemaManager(
schema,
entity_type_cache_size=50,
relation_type_cache_size=50,
property_cache_size=200
)
Medium Schema (50-200 types)
schema_manager = SchemaManager(
schema,
entity_type_cache_size=100, # Default
relation_type_cache_size=100, # Default
property_cache_size=500 # Default
)
Large Schema (200+ types)
schema_manager = SchemaManager(
schema,
entity_type_cache_size=300,
relation_type_cache_size=300,
property_cache_size=1500
)
Cache Operations
Warming the Cache
Pre-populate cache with frequently used types:
# Warm cache with common entity types
common_types = ["Person", "Paper", "Company", "Project"]
for entity_type in common_types:
schema_manager.get_entity_type(entity_type)
# Warm cache with common relation types
common_relations = ["WORKS_FOR", "AUTHORED_BY", "PUBLISHED_IN"]
for relation_type in common_relations:
schema_manager.get_relation_type(relation_type)
Clearing the Cache
Clear cache when schema changes:
# Clear all caches
schema_manager.clear_cache()
# Clear specific cache
schema_manager.clear_entity_type_cache()
schema_manager.clear_relation_type_cache()
schema_manager.clear_property_cache()
Cache Metrics
Monitor cache performance:
# Get cache metrics
metrics = schema_manager.get_cache_metrics()
print(f"Entity Type Cache:")
print(f" Hits: {metrics['entity_type_cache']['hits']}")
print(f" Misses: {metrics['entity_type_cache']['misses']}")
print(f" Hit Rate: {metrics['entity_type_cache']['hit_rate']:.2%}")
print(f" Size: {metrics['entity_type_cache']['size']}")
print(f"\nRelation Type Cache:")
print(f" Hits: {metrics['relation_type_cache']['hits']}")
print(f" Misses: {metrics['relation_type_cache']['misses']}")
print(f" Hit Rate: {metrics['relation_type_cache']['hit_rate']:.2%}")
print(f"\nProperty Cache:")
print(f" Hits: {metrics['property_cache']['hits']}")
print(f" Misses: {metrics['property_cache']['misses']}")
print(f" Hit Rate: {metrics['property_cache']['hit_rate']:.2%}")
Reset Metrics
Reset metrics for benchmarking:
# Reset all metrics
schema_manager.reset_cache_metrics()
Performance Impact
Benchmark Results
Without Caching:
Entity type lookup: ~1.2ms
Relation type lookup: ~1.5ms
Property lookup: ~0.8ms
With Caching (warm cache):
Entity type lookup: ~0.01ms (120x faster)
Relation type lookup: ~0.01ms (150x faster)
Property lookup: ~0.005ms (160x faster)
Overall Query Performance:
Simple queries: 15-20% faster
Complex queries: 30-40% faster
Validation-heavy queries: 50-60% faster
Best Practices
1. Enable Caching in Production
Always enable caching in production:
# Production configuration
schema_manager = SchemaManager(
schema,
enable_cache=True, # Always enable
entity_type_cache_size=100,
relation_type_cache_size=100,
property_cache_size=500
)
2. Size Caches Appropriately
Set cache sizes based on your schema:
# Count types in your schema
num_entity_types = len(schema.get_entity_type_names())
num_relation_types = len(schema.get_relation_type_names())
# Size caches to fit all types
schema_manager = SchemaManager(
schema,
entity_type_cache_size=num_entity_types,
relation_type_cache_size=num_relation_types,
property_cache_size=num_entity_types * 10 # ~10 properties per type
)
3. Warm Cache on Startup
Pre-populate cache with common types:
def warm_schema_cache(schema_manager):
"""Warm schema cache on application startup"""
# Load all entity types
for entity_type_name in schema_manager.schema.get_entity_type_names():
schema_manager.get_entity_type(entity_type_name)
# Load all relation types
for relation_type_name in schema_manager.schema.get_relation_type_names():
schema_manager.get_relation_type(relation_type_name)
# Call on startup
warm_schema_cache(schema_manager)
4. Monitor Cache Performance
Track cache hit rates:
def log_cache_metrics(schema_manager):
"""Log cache metrics for monitoring"""
metrics = schema_manager.get_cache_metrics()
for cache_name, cache_metrics in metrics.items():
hit_rate = cache_metrics['hit_rate']
if hit_rate < 0.8: # Less than 80% hit rate
logger.warning(
f"{cache_name} hit rate is low: {hit_rate:.2%}. "
f"Consider increasing cache size."
)
5. Clear Cache on Schema Updates
Clear cache when schema changes:
def update_schema(schema_manager, new_schema):
"""Update schema and clear cache"""
# Update schema
schema_manager.schema = new_schema
# Clear cache to avoid stale data
schema_manager.clear_cache()
# Optionally warm cache
warm_schema_cache(schema_manager)
Troubleshooting
Low Hit Rate
Problem: Cache hit rate < 80%
Solutions:
Increase cache size
Warm cache on startup
Check for schema changes during runtime
High Memory Usage
Problem: Cache using too much memory
Solutions:
Reduce cache sizes
Use LRU eviction (automatic)
Clear cache periodically
Stale Cache Data
Problem: Cache contains outdated schema information
Solutions:
Clear cache after schema updates
Implement cache TTL (time-to-live)
Use cache versioning
Conclusion
Schema caching is a simple but effective optimization that can significantly improve query performance. Enable it in production, size caches appropriately, and monitor performance to ensure optimal results.