Knowledge Graph Troubleshooting Guide

Common Issues and Solutions

Import Issues

Problem: CSV Import Fails with “Missing Column” Error

Symptoms:

Error: Column 'name' not found in CSV file

Solution:

Check that column names in schema mapping match CSV headers exactly
Verify CSV file has a header row
Check for extra spaces in column names

Example:

# ❌ Wrong - column name doesn't match
property_mappings={"name": "full_name"}  # CSV has "fullname" not "full_name"

# ✅ Correct
property_mappings={"name": "fullname"}

Problem: Import is Very Slow

Symptoms:

Import takes >10 seconds for 1000 rows
Throughput <50 rows/second

Solutions:

Increase batch size:

pipeline = StructuredDataPipeline(
    mapping=schema_mapping,
    graph_store=store,
    batch_size=500  # Increase from default 50
)

Use PostgreSQL for large datasets:

# Switch from InMemory to PostgreSQL
from aiecs.infrastructure.graph_storage.postgresql import PostgreSQLGraphStore
store = PostgreSQLGraphStore(connection_string="postgresql://...")

Enable skip_errors for faster processing:

pipeline = StructuredDataPipeline(
    mapping=schema_mapping,
    graph_store=store,
    skip_errors=True  # Skip malformed rows
)

Problem: JSON Import Fails with “Invalid JSON”

Symptoms:

Error: Expecting value: line 1 column 1 (char 0)

Solutions:

Validate JSON format:

python -m json.tool data.json

Check for:
- Missing commas between objects
- Trailing commas
- Single quotes instead of double quotes
- Unescaped special characters
Use newline-delimited JSON for large files:

{"id": "1", "name": "Alice"}
{"id": "2", "name": "Bob"}

Search and Reranking Issues

Problem: Search Returns No Results

Symptoms:

Query returns empty list
Expected entities not found

Solutions:

Check entity properties match query:

# Verify entities have searchable text
entity = await store.get_entity("e1")
print(entity.properties)  # Should have text fields

Try different search modes:

# Try vector search
result = await tool.run(mode="vector", query="...")

# Try graph search
result = await tool.run(mode="graph", seed_entity_ids=["e1"])

# Try hybrid
result = await tool.run(mode="hybrid", query="...")

Check embeddings are present:

entity = await store.get_entity("e1")
print(entity.embedding)  # Should not be None

Problem: Reranking is Too Slow

Symptoms:

Search takes >1 second
Latency >500ms

Solutions:

Use faster reranking strategy:

# ❌ Slow - hybrid reranking
rerank_strategy="hybrid"  # 150-300ms

# ✅ Fast - text reranking
rerank_strategy="text"  # 50-100ms

Reduce top_k:

# ❌ Slow - reranking 200 results
top_k=200

# ✅ Fast - reranking 20 results
top_k=20

Disable reranking for simple queries:

result = await tool.run(
    query="...",
    enable_reranking=False  # Skip reranking
)

Knowledge Fusion Issues

Problem: Too Many Entities Being Merged

Symptoms:

Unrelated entities are merged
Merge count is unexpectedly high

Solutions:

Increase similarity threshold:

# ❌ Too lenient - merges too many
fusion = KnowledgeFusion(store, similarity_threshold=0.70)

# ✅ More strict - fewer merges
fusion = KnowledgeFusion(store, similarity_threshold=0.90)

Filter by entity type:

# Only merge specific types
stats = await fusion.fuse_cross_document_entities(
    entity_types=["Person"]  # Don't merge other types
)

Review merge results:

# Check what was merged
provenance = await fusion.track_entity_provenance("e1")
print(f"Entity came from: {provenance}")

Problem: Fusion is Too Slow

Symptoms:

Fusion takes >30 seconds for 200 entities
Throughput <10 entities/second

Solutions:

Increase similarity threshold (fewer comparisons):

fusion = KnowledgeFusion(store, similarity_threshold=0.90)

Run fusion periodically, not on every update:

# ❌ Slow - fusion after every import
await pipeline.import_from_csv("data.csv")
await fusion.fuse_cross_document_entities()

# ✅ Fast - fusion once at the end
await pipeline.import_from_csv("data1.csv")
await pipeline.import_from_csv("data2.csv")
await pipeline.import_from_csv("data3.csv")
await fusion.fuse_cross_document_entities()  # Once

Use faster conflict resolution:

# ❌ Slower
conflict_resolution_strategy="most_confident"

# ✅ Faster
conflict_resolution_strategy="most_complete"

Performance Issues

Problem: High Memory Usage

Symptoms:

Application using >2GB RAM
Out of memory errors

Solutions:

Switch to SQLite or PostgreSQL:

# ❌ High memory - InMemory
from aiecs.infrastructure.graph_storage.in_memory import InMemoryGraphStore
store = InMemoryGraphStore()

# ✅ Low memory - SQLite
from aiecs.infrastructure.graph_storage.sqlite import SQLiteGraphStore
store = SQLiteGraphStore(db_path="graph.db")

Reduce cache sizes:

# Reduce schema cache
schema_manager = SchemaManager(
    cache_size=100,  # Reduce from 1000
    ttl_seconds=300
)

Process data in batches:

# Process large files in chunks
for chunk in pd.read_csv("large.csv", chunksize=1000):
    await pipeline.import_from_dataframe(chunk)

Problem: Slow Query Performance

Symptoms:

Queries take >500ms
Search is slow

Solutions:

Enable query optimization:

# Enable in configuration
KG_ENABLE_QUERY_OPTIMIZATION=true
KG_QUERY_OPTIMIZATION_STRATEGY=balanced

Enable schema caching:

KG_ENABLE_SCHEMA_CACHE=true
KG_SCHEMA_CACHE_TTL_SECONDS=3600

Use PostgreSQL with pgvector:

KG_STORAGE_BACKEND=postgresql
KG_ENABLE_PGVECTOR=true

Add indexes (PostgreSQL):

CREATE INDEX idx_entity_type ON entities(entity_type);
CREATE INDEX idx_relation_type ON relations(relation_type);

Configuration Issues

Problem: Configuration Not Loading

Symptoms:

Settings not applied
Using default values

Solutions:

Check .env file location:

# Should be in project root
ls -la .env

Verify environment variables:

# Check if variables are set
env | grep KG_

Use explicit configuration:

from aiecs.config import Settings

settings = Settings(
    kg_storage_backend="postgresql",
    kg_enable_reranking=True
)

Tool Issues

Problem: Tool Returns “Unsupported Operation” Error

Symptoms:

Error: Unsupported operation: kg_builder

Solutions:

Use correct operation name:

# ❌ Wrong operation name
await tool.run(op="kg_builder", ...)

# ✅ Correct - use tool's registered operations
await tool.run(op="build_from_text", ...)

Check available operations:

print(tool.input_schema())  # Shows available operations

Getting Help

If you’re still experiencing issues:

Check the API Reference
Review Configuration Guide
See Performance Guide
Open an issue on GitHub with:
- Error message
- Minimal reproduction code
- Environment details (Python version, OS)
- Configuration settings

Performance Benchmarks

Expected performance for reference:

CSV Import: 100-300 rows/second
JSON Import: 100-250 records/second
Text Reranking: 50-100ms
Hybrid Reranking: 150-300ms
Schema Cache Hit: <1ms
Query Optimization: 40-70% improvement
Knowledge Fusion: 10-40 entities/second

If your performance is significantly worse, review the solutions above.