# Knowledge Graph Troubleshooting Guide

## Common Issues and Solutions

### Import Issues

#### Problem: CSV Import Fails with "Missing Column" Error

**Symptoms:**
```
Error: Column 'name' not found in CSV file
```

**Solution:**
1. Check that column names in schema mapping match CSV headers exactly
2. Verify CSV file has a header row
3. Check for extra spaces in column names

**Example:**
```python
# ❌ Wrong - column name doesn't match
property_mappings={"name": "full_name"}  # CSV has "fullname" not "full_name"

# ✅ Correct
property_mappings={"name": "fullname"}
```

#### Problem: Import is Very Slow

**Symptoms:**
- Import takes >10 seconds for 1000 rows
- Throughput <50 rows/second

**Solutions:**
1. Increase batch size:
```python
pipeline = StructuredDataPipeline(
    mapping=schema_mapping,
    graph_store=store,
    batch_size=500  # Increase from default 50
)
```

2. Use PostgreSQL for large datasets:
```python
# Switch from InMemory to PostgreSQL
from aiecs.infrastructure.graph_storage.postgresql import PostgreSQLGraphStore
store = PostgreSQLGraphStore(connection_string="postgresql://...")
```

3. Enable skip_errors for faster processing:
```python
pipeline = StructuredDataPipeline(
    mapping=schema_mapping,
    graph_store=store,
    skip_errors=True  # Skip malformed rows
)
```

#### Problem: JSON Import Fails with "Invalid JSON"

**Symptoms:**
```
Error: Expecting value: line 1 column 1 (char 0)
```

**Solutions:**
1. Validate JSON format:
```bash
python -m json.tool data.json
```

2. Check for:
   - Missing commas between objects
   - Trailing commas
   - Single quotes instead of double quotes
   - Unescaped special characters

3. Use newline-delimited JSON for large files:
```json
{"id": "1", "name": "Alice"}
{"id": "2", "name": "Bob"}
```

### Search and Reranking Issues

#### Problem: Search Returns No Results

**Symptoms:**
- Query returns empty list
- Expected entities not found

**Solutions:**
1. Check entity properties match query:
```python
# Verify entities have searchable text
entity = await store.get_entity("e1")
print(entity.properties)  # Should have text fields
```

2. Try different search modes:
```python
# Try vector search
result = await tool.run(mode="vector", query="...")

# Try graph search
result = await tool.run(mode="graph", seed_entity_ids=["e1"])

# Try hybrid
result = await tool.run(mode="hybrid", query="...")
```

3. Check embeddings are present:
```python
entity = await store.get_entity("e1")
print(entity.embedding)  # Should not be None
```

#### Problem: Reranking is Too Slow

**Symptoms:**
- Search takes >1 second
- Latency >500ms

**Solutions:**
1. Use faster reranking strategy:
```python
# ❌ Slow - hybrid reranking
rerank_strategy="hybrid"  # 150-300ms

# ✅ Fast - text reranking
rerank_strategy="text"  # 50-100ms
```

2. Reduce top_k:
```python
# ❌ Slow - reranking 200 results
top_k=200

# ✅ Fast - reranking 20 results
top_k=20
```

3. Disable reranking for simple queries:
```python
result = await tool.run(
    query="...",
    enable_reranking=False  # Skip reranking
)
```

### Knowledge Fusion Issues

#### Problem: Too Many Entities Being Merged

**Symptoms:**
- Unrelated entities are merged
- Merge count is unexpectedly high

**Solutions:**
1. Increase similarity threshold:
```python
# ❌ Too lenient - merges too many
fusion = KnowledgeFusion(store, similarity_threshold=0.70)

# ✅ More strict - fewer merges
fusion = KnowledgeFusion(store, similarity_threshold=0.90)
```

2. Filter by entity type:
```python
# Only merge specific types
stats = await fusion.fuse_cross_document_entities(
    entity_types=["Person"]  # Don't merge other types
)
```

3. Review merge results:
```python
# Check what was merged
provenance = await fusion.track_entity_provenance("e1")
print(f"Entity came from: {provenance}")
```

#### Problem: Fusion is Too Slow

**Symptoms:**
- Fusion takes >30 seconds for 200 entities
- Throughput <10 entities/second

**Solutions:**
1. Increase similarity threshold (fewer comparisons):
```python
fusion = KnowledgeFusion(store, similarity_threshold=0.90)
```

2. Run fusion periodically, not on every update:
```python
# ❌ Slow - fusion after every import
await pipeline.import_from_csv("data.csv")
await fusion.fuse_cross_document_entities()

# ✅ Fast - fusion once at the end
await pipeline.import_from_csv("data1.csv")
await pipeline.import_from_csv("data2.csv")
await pipeline.import_from_csv("data3.csv")
await fusion.fuse_cross_document_entities()  # Once
```

3. Use faster conflict resolution:
```python
# ❌ Slower
conflict_resolution_strategy="most_confident"

# ✅ Faster
conflict_resolution_strategy="most_complete"
```

### Performance Issues

#### Problem: High Memory Usage

**Symptoms:**
- Application using >2GB RAM
- Out of memory errors

**Solutions:**
1. Switch to SQLite or PostgreSQL:
```python
# ❌ High memory - InMemory
from aiecs.infrastructure.graph_storage.in_memory import InMemoryGraphStore
store = InMemoryGraphStore()

# ✅ Low memory - SQLite
from aiecs.infrastructure.graph_storage.sqlite import SQLiteGraphStore
store = SQLiteGraphStore(db_path="graph.db")
```

2. Reduce cache sizes:
```python
# Reduce schema cache
schema_manager = SchemaManager(
    cache_size=100,  # Reduce from 1000
    ttl_seconds=300
)
```

3. Process data in batches:
```python
# Process large files in chunks
for chunk in pd.read_csv("large.csv", chunksize=1000):
    await pipeline.import_from_dataframe(chunk)
```

#### Problem: Slow Query Performance

**Symptoms:**
- Queries take >500ms
- Search is slow

**Solutions:**
1. Enable query optimization:
```python
# Enable in configuration
KG_ENABLE_QUERY_OPTIMIZATION=true
KG_QUERY_OPTIMIZATION_STRATEGY=balanced
```

2. Enable schema caching:
```python
KG_ENABLE_SCHEMA_CACHE=true
KG_SCHEMA_CACHE_TTL_SECONDS=3600
```

3. Use PostgreSQL with pgvector:
```python
KG_STORAGE_BACKEND=postgresql
KG_ENABLE_PGVECTOR=true
```

4. Add indexes (PostgreSQL):
```sql
CREATE INDEX idx_entity_type ON entities(entity_type);
CREATE INDEX idx_relation_type ON relations(relation_type);
```

### Configuration Issues

#### Problem: Configuration Not Loading

**Symptoms:**
- Settings not applied
- Using default values

**Solutions:**
1. Check .env file location:
```bash
# Should be in project root
ls -la .env
```

2. Verify environment variables:
```bash
# Check if variables are set
env | grep KG_
```

3. Use explicit configuration:
```python
from aiecs.config import Settings

settings = Settings(
    kg_storage_backend="postgresql",
    kg_enable_reranking=True
)
```

### Tool Issues

#### Problem: Tool Returns "Unsupported Operation" Error

**Symptoms:**
```
Error: Unsupported operation: kg_builder
```

**Solutions:**
1. Use correct operation name:
```python
# ❌ Wrong operation name
await tool.run(op="kg_builder", ...)

# ✅ Correct - use tool's registered operations
await tool.run(op="build_from_text", ...)
```

2. Check available operations:
```python
print(tool.input_schema())  # Shows available operations
```

## Getting Help

If you're still experiencing issues:

1. Check the [API Reference](./API_REFERENCE.md)
2. Review [Configuration Guide](./CONFIGURATION_GUIDE.md)
3. See [Performance Guide](./PERFORMANCE_GUIDE.md)
4. Open an issue on GitHub with:
   - Error message
   - Minimal reproduction code
   - Environment details (Python version, OS)
   - Configuration settings

## Performance Benchmarks

Expected performance for reference:

- **CSV Import**: 100-300 rows/second
- **JSON Import**: 100-250 records/second
- **Text Reranking**: 50-100ms
- **Hybrid Reranking**: 150-300ms
- **Schema Cache Hit**: <1ms
- **Query Optimization**: 40-70% improvement
- **Knowledge Fusion**: 10-40 entities/second

If your performance is significantly worse, review the solutions above.