# Schema Mapping Configuration Guide This guide explains how to configure schema mappings for importing structured data (CSV, JSON) into knowledge graphs. ## Table of Contents 1. [Overview](#overview) 2. [Basic Concepts](#basic-concepts) 3. [Entity Mapping](#entity-mapping) 4. [Relation Mapping](#relation-mapping) 5. [Property Transformations](#property-transformations) 6. [Complete Examples](#complete-examples) 7. [Best Practices](#best-practices) ## Overview Schema mapping allows you to declaratively map structured data columns to knowledge graph entities and relations. This eliminates the need for custom code for each data source. **Key Benefits:** - **Declarative**: Define mappings in configuration, not code - **Flexible**: Support complex transformations (rename, type cast, compute) - **Reusable**: Same mapping works across multiple data sources - **Type-safe**: Validation ensures data consistency ## Basic Concepts ### SchemaMapping The `SchemaMapping` class is the container for all mappings: ```python from aiecs.application.knowledge_graph.builder.schema_mapping import SchemaMapping mapping = SchemaMapping( entity_mappings=[...], # How to create entities relation_mappings=[...], # How to create relations description="My data mapping" ) ``` ### EntityMapping Maps source columns to entity types: ```python from aiecs.application.knowledge_graph.builder.schema_mapping import EntityMapping entity_mapping = EntityMapping( source_columns=["id", "name", "age"], entity_type="Person", property_mapping={"id": "id", "name": "name", "age": "age"}, id_column="id" ) ``` ### RelationMapping Maps source columns to relations between entities: ```python from aiecs.application.knowledge_graph.builder.schema_mapping import RelationMapping relation_mapping = RelationMapping( source_columns=["emp_id", "dept_id"], relation_type="WORKS_IN", source_entity_column="emp_id", target_entity_column="dept_id" ) ``` ## Entity Mapping ### Simple Entity Mapping Map columns directly to entity properties: ```python EntityMapping( source_columns=["id", "name", "email"], entity_type="Person", property_mapping={ "id": "id", "name": "name", "email": "email" }, id_column="id" ) ``` ### Entity Mapping with ID Column Specify which column to use as entity ID: ```python EntityMapping( source_columns=["employee_id", "full_name", "department"], entity_type="Employee", property_mapping={ "employee_id": "id", "full_name": "name", "department": "dept" }, id_column="employee_id" # Use employee_id as entity ID ) ``` ### Multiple Entity Types from Same Row You can create multiple entities from a single row: ```python mapping = SchemaMapping( entity_mappings=[ # Create Employee entity EntityMapping( source_columns=["emp_id", "emp_name"], entity_type="Employee", property_mapping={"emp_id": "id", "emp_name": "name"}, id_column="emp_id" ), # Create Department entity from same row EntityMapping( source_columns=["dept_id", "dept_name"], entity_type="Department", property_mapping={"dept_id": "id", "dept_name": "name"}, id_column="dept_id" ) ] ) ``` ## Relation Mapping ### Basic Relation Mapping Create relations between entities: ```python RelationMapping( source_columns=["person_id", "company_id"], relation_type="WORKS_FOR", source_entity_column="person_id", target_entity_column="company_id" ) ``` ### Relation with Properties Add properties to relations: ```python RelationMapping( source_columns=["person_id", "company_id", "role", "since"], relation_type="WORKS_FOR", source_entity_column="person_id", target_entity_column="company_id", property_mapping={ "role": "position", "since": "start_date" } ) ``` ## Property Transformations Transformations allow you to modify values during import. ### Transformation Types 1. **RENAME**: Rename a column to a property 2. **TYPE_CAST**: Convert value to different type 3. **COMPUTE**: Compute value from multiple columns 4. **CONSTANT**: Use a constant value 5. **SKIP**: Skip this column ### RENAME Transformation Simply rename a column: ```python from aiecs.application.knowledge_graph.builder.schema_mapping import ( PropertyTransformation, TransformationType ) transformation = PropertyTransformation( transformation_type=TransformationType.RENAME, source_column="full_name", target_property="name" ) ``` ### TYPE_CAST Transformation Convert string to integer, float, boolean, etc.: ```python from aiecs.domain.knowledge_graph.schema.property_schema import PropertyType transformation = PropertyTransformation( transformation_type=TransformationType.TYPE_CAST, source_column="age_str", target_property="age", target_type=PropertyType.INTEGER ) ``` **Supported Types:** - `PropertyType.STRING` - `PropertyType.INTEGER` - `PropertyType.FLOAT` - `PropertyType.BOOLEAN` - `PropertyType.LIST` (from JSON string or comma-separated) - `PropertyType.DICT` (from JSON string) - `PropertyType.ANY` ### COMPUTE Transformation Compute values from multiple columns: ```python # Concatenate first and last name transformation = PropertyTransformation( transformation_type=TransformationType.COMPUTE, source_column="first_name", target_property="full_name", compute_function="concat_space", compute_args=["last_name"] ) # Sum multiple columns transformation = PropertyTransformation( transformation_type=TransformationType.COMPUTE, source_column="price1", target_property="total_price", compute_function="sum", compute_args=["price2", "price3"] ) ``` **Available Compute Functions:** - `concat`: Concatenate strings - `concat_space`: Concatenate with space separator - `concat_comma`: Concatenate with comma separator - `sum`: Sum numeric values - `avg` / `average`: Average numeric values - `max`: Maximum value - `min`: Minimum value ### CONSTANT Transformation Use a constant value: ```python transformation = PropertyTransformation( transformation_type=TransformationType.CONSTANT, target_property="status", constant_value="active" ) ``` ### SKIP Transformation Skip a column (don't import it): ```python transformation = PropertyTransformation( transformation_type=TransformationType.SKIP, target_property="internal_id" ) ``` ## Complete Examples ### Example 1: Employee Data **CSV Structure:** ```csv emp_id,name,email,dept_id,dept_name,role,salary E001,Alice Smith,alice@example.com,D001,Engineering,Engineer,100000 E002,Bob Jones,bob@example.com,D001,Engineering,Manager,120000 ``` **Mapping:** ```python from aiecs.application.knowledge_graph.builder.schema_mapping import ( SchemaMapping, EntityMapping, RelationMapping, PropertyTransformation, TransformationType ) from aiecs.domain.knowledge_graph.schema.property_schema import PropertyType mapping = SchemaMapping( entity_mappings=[ # Employee entity EntityMapping( source_columns=["emp_id", "name", "email", "salary"], entity_type="Employee", property_mapping={ "emp_id": "id", "name": "name", "email": "email" }, transformations=[ PropertyTransformation( transformation_type=TransformationType.TYPE_CAST, source_column="salary", target_property="salary", target_type=PropertyType.INTEGER ) ], id_column="emp_id" ), # Department entity EntityMapping( source_columns=["dept_id", "dept_name"], entity_type="Department", property_mapping={"dept_id": "id", "dept_name": "name"}, id_column="dept_id" ) ], relation_mappings=[ RelationMapping( source_columns=["emp_id", "dept_id", "role"], relation_type="WORKS_IN", source_entity_column="emp_id", target_entity_column="dept_id", property_mapping={"role": "position"} ) ] ) ``` ### Example 2: Product Catalog **JSON Structure:** ```json [ { "product_id": "P001", "product_name": "Laptop", "category": "Electronics", "price": "999.99", "in_stock": "true" } ] ``` **Mapping:** ```python mapping = SchemaMapping( entity_mappings=[ EntityMapping( source_columns=["product_id", "product_name", "category", "price", "in_stock"], entity_type="Product", property_mapping={"product_id": "id", "product_name": "name"}, transformations=[ PropertyTransformation( transformation_type=TransformationType.TYPE_CAST, source_column="price", target_property="price", target_type=PropertyType.FLOAT ), PropertyTransformation( transformation_type=TransformationType.TYPE_CAST, source_column="in_stock", target_property="available", target_type=PropertyType.BOOLEAN ), PropertyTransformation( transformation_type=TransformationType.RENAME, source_column="category", target_property="category" ) ], id_column="product_id" ) ] ) ``` ### Example 3: Complex Transformations **CSV with computed fields:** ```csv first_name,last_name,birth_year,score1,score2,score3 John,Doe,1990,85,90,88 Jane,Smith,1985,92,88,95 ``` **Mapping with computed full name and average score:** ```python mapping = SchemaMapping( entity_mappings=[ EntityMapping( source_columns=["first_name", "last_name", "birth_year", "score1", "score2", "score3"], entity_type="Student", transformations=[ # Compute full name PropertyTransformation( transformation_type=TransformationType.COMPUTE, source_column="first_name", target_property="full_name", compute_function="concat_space", compute_args=["last_name"] ), # Compute average score PropertyTransformation( transformation_type=TransformationType.COMPUTE, source_column="score1", target_property="avg_score", compute_function="avg", compute_args=["score2", "score3"] ), # Calculate age from birth year PropertyTransformation( transformation_type=TransformationType.COMPUTE, source_column="birth_year", target_property="age", compute_function="subtract", # Would need to implement compute_args=["2024"] # Current year ) ], id_column="first_name" # Use first_name as ID (not recommended for production) ) ] ) ``` ## Best Practices ### 1. Always Specify ID Columns ```python # ✅ Good EntityMapping( source_columns=["id", "name"], entity_type="Person", id_column="id" # Explicit ID column ) # ❌ Avoid (uses first column as ID, less clear) EntityMapping( source_columns=["id", "name"], entity_type="Person" ) ``` ### 2. Use Type Casting for Numeric Data ```python # ✅ Good - CSV reads as string, cast to integer PropertyTransformation( transformation_type=TransformationType.TYPE_CAST, source_column="age_str", target_property="age", target_type=PropertyType.INTEGER ) # ❌ Avoid - Leaves as string property_mapping={"age_str": "age"} ``` ### 3. Validate Mappings Before Use ```python mapping = SchemaMapping(...) # Validate before importing errors = mapping.validate() if errors: print(f"Mapping errors: {errors}") # Fix errors before proceeding else: # Safe to use pipeline = StructuredDataPipeline(mapping=mapping, graph_store=store) ``` ### 4. Handle Missing Columns Gracefully The pipeline will skip missing columns, but you can add validation: ```python # Check required columns exist required_columns = set() for entity_mapping in mapping.entity_mappings: required_columns.update(entity_mapping.source_columns) for relation_mapping in mapping.relation_mappings: required_columns.update(relation_mapping.source_columns) # Validate CSV has all required columns csv_columns = set(df.columns) missing = required_columns - csv_columns if missing: raise ValueError(f"Missing required columns: {missing}") ``` ### 5. Use Transformations for Data Cleaning ```python # Clean phone numbers PropertyTransformation( transformation_type=TransformationType.COMPUTE, source_column="phone_raw", target_property="phone", compute_function="clean_phone" # Custom function ) # Normalize text PropertyTransformation( transformation_type=TransformationType.TYPE_CAST, source_column="name_raw", target_property="name", target_type=PropertyType.STRING ) # Then apply lowercase normalization in post-processing ``` ### 6. Document Your Mappings ```python mapping = SchemaMapping( entity_mappings=[...], relation_mappings=[...], description="Employee and department mapping for HR system import" ) ``` ## Common Patterns ### Pattern 1: One Entity Per Row ```python # Simple 1:1 mapping EntityMapping( source_columns=["id", "name"], entity_type="Person", property_mapping={"id": "id", "name": "name"}, id_column="id" ) ``` ### Pattern 2: Multiple Entities Per Row ```python # Create both Employee and Department from same row EntityMapping( source_columns=["emp_id", "emp_name", "dept_id", "dept_name"], entity_type="Employee", ... ), EntityMapping( source_columns=["emp_id", "emp_name", "dept_id", "dept_name"], entity_type="Department", ... ) ``` ### Pattern 3: Relations from Same Row ```python # Create relation between entities created in same row RelationMapping( source_columns=["emp_id", "dept_id"], relation_type="WORKS_IN", source_entity_column="emp_id", target_entity_column="dept_id" ) ``` ### Pattern 4: Nested JSON For nested JSON structures, flatten first or use multiple mappings: ```json { "employee": { "id": "E001", "name": "Alice" }, "department": { "id": "D001", "name": "Engineering" } } ``` Flatten to: ```python # Flatten in preprocessing or use JSON path extraction EntityMapping( source_columns=["employee_id", "employee_name", "dept_id", "dept_name"], ... ) ``` ## Troubleshooting ### Issue: Entities Not Created **Check:** 1. Are source columns present in data? 2. Is `id_column` specified and present? 3. Are transformations failing silently? (Check warnings in ImportResult) ### Issue: Relations Not Created **Check:** 1. Are source and target entity columns present? 2. Do the entity IDs exist in the graph? 3. Are entity mappings creating entities before relations? ### Issue: Type Casting Fails **Check:** 1. Are values in correct format? (e.g., "123" not "abc" for INTEGER) 2. Use `skip_errors=False` to see detailed errors 3. Add data validation before import ### Issue: Computed Values Wrong **Check:** 1. Are all source columns present? 2. Are values numeric for sum/avg/max/min? 3. Check compute function name spelling ## Next Steps - See [StructuredDataPipeline Usage Examples](./STRUCTURED_DATA_PIPELINE.md) for how to use mappings - See [CSV-to-Graph Tutorial](./examples/csv_to_graph_tutorial.md) for complete CSV example - See [JSON-to-Graph Tutorial](./examples/json_to_graph_tutorial.md) for complete JSON example