# Data Transformer Tool Configuration Guide ## Overview The Data Transformer Tool is an advanced data transformation tool that provides comprehensive data transformation capabilities with data cleaning and preprocessing, feature engineering and encoding, normalization and standardization, transformation pipelines, and missing value handling. It can clean and preprocess data, engineer features, transform and normalize data, and build transformation pipelines. The tool integrates with pandas_tool for core operations and supports various transformation types (cleaning operations, transformation operations, encoding operations, feature engineering) and missing value strategies (drop, mean, median, mode, forward_fill, backward_fill, interpolate, constant). The tool can be configured via environment variables using the `DATA_TRANSFORMER_` prefix or through programmatic configuration when initializing the tool. ## Using .env Files in Your Project When using aiecs as a dependency in your project, you can store configuration in a `.env` file for convenience. The Data Transformer Tool reads from environment variables that are already loaded into the process, so you need to load the `.env` file in your application before importing aiecs tools. ### Setting Up .env Files **1. Install python-dotenv:** ```bash pip install python-dotenv ``` **2. Create a `.env` file in your project root:** ```bash # .env file in your project root DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0 DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10 ``` **3. Load the .env file in your application:** ```python # main.py or app.py - at the top of your entry point from dotenv import load_dotenv # Load environment variables from .env file # This must be done BEFORE importing aiecs tools load_dotenv() # Now import and use aiecs tools from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool # The tool will automatically use the environment variables data_transformer = DataTransformerTool() ``` ### Multiple Environment Files You can use different `.env` files for different environments: ```python import os from dotenv import load_dotenv # Load environment-specific configuration env = os.getenv('APP_ENV', 'development') if env == 'production': load_dotenv('.env.production') elif env == 'staging': load_dotenv('.env.staging') else: load_dotenv('.env.development') from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool data_transformer = DataTransformerTool() ``` **Example `.env.production`:** ```bash # Production settings - optimized for robust transformations DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5 DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20 ``` **Example `.env.development`:** ```bash # Development settings - optimized for testing and debugging DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0 DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5 ``` ### Best Practices for .env Files 1. **Never commit .env files to version control** - Add `.env` to your `.gitignore`: ```gitignore # .gitignore .env .env.local .env.*.local .env.production .env.staging ``` 2. **Provide a template** - Create `.env.example` with documented dummy values: ```bash # .env.example # Data Transformer Tool Configuration # Standard deviation threshold for outlier detection DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0 # Default strategy for handling missing values DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean # Whether to enable transformation pipeline caching DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true # Maximum number of categories for one-hot encoding DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10 ``` 3. **Document your variables** - Add comments explaining each setting 4. **Use load_dotenv() early** - Call it at the very top of your entry point, before any aiecs imports 5. **Format values correctly**: - Strings: Plain text: `mean`, `median`, `mode` - Floats: Decimal numbers: `3.0`, `2.5` - Integers: Plain numbers: `10`, `20` - Booleans: `true` or `false` ## Configuration Options ### 1. Outlier STD Threshold **Environment Variable:** `DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD` **Type:** Float **Default:** `3.0` **Description:** Standard deviation threshold for outlier detection. Values beyond this threshold are considered outliers using the Z-score method during data cleaning operations. **Common Values:** - `2.0` - Strict outlier detection (more outliers detected) - `2.5` - Moderate outlier detection - `3.0` - Standard outlier detection (default) - `3.5` - Lenient outlier detection (fewer outliers detected) **Example:** ```bash export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5 ``` **Threshold Note:** Lower values detect more outliers, higher values are more lenient. ### 2. Default Missing Strategy **Environment Variable:** `DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY` **Type:** String **Default:** `"mean"` **Description:** Default strategy for handling missing values when no specific strategy is provided. This determines how missing values are imputed or handled. **Supported Strategies:** - `drop` - Drop rows/columns with missing values - `mean` - Fill with mean value (default) - `median` - Fill with median value - `mode` - Fill with most frequent value - `forward_fill` - Forward fill missing values - `backward_fill` - Backward fill missing values - `interpolate` - Interpolate missing values - `constant` - Fill with constant value **Example:** ```bash export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median ``` **Strategy Note:** Choose based on your data characteristics and domain knowledge. ### 3. Enable Pipeline Caching **Environment Variable:** `DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING` **Type:** Boolean **Default:** `True` **Description:** Whether to enable transformation pipeline caching. Caching improves performance for repeated transformations but uses additional memory. **Values:** - `true` - Enable pipeline caching (default) - `false` - Disable pipeline caching **Example:** ```bash export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true ``` **Caching Note:** Enable for better performance, disable to save memory. ### 4. Max One Hot Categories **Environment Variable:** `DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES` **Type:** Integer **Default:** `10` **Description:** Maximum number of categories for one-hot encoding. Columns with more unique values will use alternative encoding methods to prevent excessive dimensionality. **Common Values:** - `5` - Conservative encoding (few categories) - `10` - Standard encoding (default) - `20` - Liberal encoding (many categories) - `50` - Very liberal encoding (maximum categories) **Example:** ```bash export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20 ``` **Categories Note:** Higher values allow more categories but increase dimensionality. ## Usage Examples ### Example 1: Basic Environment Configuration ```bash # Set basic data transformation parameters export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0 export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10 # Run your application python app.py ``` ### Example 2: Robust Production Configuration ```bash # Optimized for robust data transformations export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5 export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20 ``` ### Example 3: Development Configuration ```bash # Development-friendly settings export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0 export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5 ``` ### Example 4: Programmatic Configuration ```python from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool # Initialize with custom configuration data_transformer = DataTransformerTool(config={ 'outlier_std_threshold': 3.0, 'default_missing_strategy': 'mean', 'enable_pipeline_caching': True, 'max_one_hot_categories': 10 }) ``` ### Example 5: Mixed Configuration Environment variables are used as defaults, but can be overridden programmatically: ```bash # Set environment defaults export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true ``` ```python # Override for specific instance data_transformer = DataTransformerTool(config={ 'default_missing_strategy': 'median', # This overrides the environment variable 'enable_pipeline_caching': False # This overrides the environment variable }) ``` ## Configuration Priority When the Data Transformer Tool is initialized, configuration values are resolved in the following order (highest to lowest priority): 1. **Programmatic config** - Values passed to the constructor 2. **Environment variables** - Values set via `DATA_TRANSFORMER_*` variables 3. **Default values** - Built-in defaults as specified above ## Data Type Parsing ### String Values Strings should be provided as plain text without quotes: ```bash export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mode ``` ### Float Values Floats should be provided as decimal numbers: ```bash export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0 export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5 ``` ### Integer Values Integers should be provided as numeric strings: ```bash export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10 export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20 ``` ### Boolean Values Booleans should be provided as lowercase strings: ```bash export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false ``` ## Validation ### Automatic Type Validation Pydantic automatically validates configuration values: - `outlier_std_threshold` must be a positive float - `default_missing_strategy` must be a valid strategy string - `enable_pipeline_caching` must be a boolean - `max_one_hot_categories` must be a positive integer ### Runtime Validation When transforming data, the tool validates: 1. **Outlier threshold** - Threshold must be reasonable for the data distribution 2. **Missing strategy** - Strategy must be appropriate for the data type 3. **Category limits** - Category limits must be reasonable for encoding 4. **Data compatibility** - Data must be compatible with transformation operations 5. **Memory requirements** - Transformations must not exceed memory limits ## Transformation Types The Data Transformer Tool supports various transformation types: ### Cleaning Operations - **Remove Duplicates** - Remove duplicate rows or columns - **Fill Missing** - Fill missing values using various strategies - **Remove Outliers** - Remove statistical outliers ### Transformation Operations - **Normalize** - Min-max normalization - **Standardize** - Z-score standardization - **Log Transform** - Logarithmic transformation - **Box-Cox** - Box-Cox power transformation ### Encoding Operations - **One-Hot Encode** - One-hot encoding for categorical variables - **Label Encode** - Label encoding for ordinal variables - **Target Encode** - Target encoding for high-cardinality variables ### Feature Engineering - **Polynomial Features** - Generate polynomial features - **Interaction Features** - Create feature interactions - **Binning** - Create bins for continuous variables - **Aggregation** - Aggregate features by groups ## Missing Value Strategies ### Statistical Strategies - **Mean** - Fill with mean value (default) - **Median** - Fill with median value - **Mode** - Fill with most frequent value ### Interpolation Strategies - **Forward Fill** - Forward fill missing values - **Backward Fill** - Backward fill missing values - **Interpolate** - Linear interpolation ### Other Strategies - **Drop** - Drop rows/columns with missing values - **Constant** - Fill with constant value ## Operations Supported The Data Transformer Tool supports comprehensive data transformation operations: ### Basic Transformations - `transform_data` - Apply comprehensive data transformations - `clean_data` - Clean and preprocess data - `handle_missing_values` - Handle missing values - `remove_outliers` - Remove statistical outliers - `remove_duplicates` - Remove duplicate records ### Feature Engineering - `engineer_features` - Engineer new features - `create_polynomial_features` - Create polynomial features - `create_interaction_features` - Create feature interactions - `create_bins` - Create bins for continuous variables - `aggregate_features` - Aggregate features by groups ### Encoding Operations - `encode_categorical` - Encode categorical variables - `one_hot_encode` - One-hot encode categorical variables - `label_encode` - Label encode ordinal variables - `target_encode` - Target encode high-cardinality variables - `handle_high_cardinality` - Handle high-cardinality categorical variables ### Normalization and Scaling - `normalize_data` - Normalize data to [0,1] range - `standardize_data` - Standardize data to mean=0, std=1 - `robust_scale` - Robust scaling using median and IQR - `min_max_scale` - Min-max scaling - `z_score_scale` - Z-score standardization ### Pipeline Operations - `create_pipeline` - Create transformation pipeline - `fit_pipeline` - Fit transformation pipeline - `transform_pipeline` - Apply transformation pipeline - `inverse_transform` - Inverse transform data - `save_pipeline` - Save transformation pipeline - `load_pipeline` - Load transformation pipeline ### Advanced Operations - `log_transform` - Apply logarithmic transformation - `box_cox_transform` - Apply Box-Cox transformation - `power_transform` - Apply power transformation - `quantile_transform` - Apply quantile transformation - `robust_transform` - Apply robust transformation ## Troubleshooting ### Issue: Outlier removal too aggressive **Error:** Too many data points removed as outliers **Solutions:** ```bash # Increase outlier threshold export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.5 # Or use domain-specific outlier detection data_transformer.remove_outliers(data, method='domain_specific') ``` ### Issue: Missing value strategy fails **Error:** Missing value imputation errors **Solutions:** ```bash # Change missing strategy export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median # Or specify strategy per column data_transformer.handle_missing_values(data, strategies={'col1': 'mean', 'col2': 'median'}) ``` ### Issue: One-hot encoding creates too many columns **Error:** Excessive dimensionality from one-hot encoding **Solutions:** ```bash # Reduce max categories export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5 # Or use alternative encoding data_transformer.encode_categorical(data, method='target_encoding') ``` ### Issue: Pipeline caching issues **Error:** Pipeline caching problems **Solutions:** ```bash # Disable caching export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false # Or clear cache data_transformer.clear_pipeline_cache() ``` ### Issue: Transformation performance issues **Error:** Slow transformation operations **Solutions:** 1. Enable pipeline caching 2. Use appropriate data types 3. Process data in chunks 4. Optimize transformation order ### Issue: Memory usage exceeded **Error:** Out of memory during transformations **Solutions:** ```bash # Disable caching export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false # Reduce max categories export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5 # Process data in chunks data_transformer.transform_data(data, chunk_size=10000) ``` ### Issue: Data type compatibility **Error:** Data type incompatibility with transformations **Solutions:** 1. Check data types before transformation 2. Convert data types appropriately 3. Use compatible transformation methods 4. Validate data structure ## Best Practices ### Performance Optimization 1. **Pipeline Caching** - Enable caching for repeated transformations 2. **Category Management** - Set appropriate category limits 3. **Chunk Processing** - Process large datasets in chunks 4. **Memory Management** - Monitor memory usage during transformations 5. **Transformation Order** - Optimize transformation sequence ### Error Handling 1. **Graceful Degradation** - Handle transformation failures gracefully 2. **Validation** - Validate data before transformation 3. **Fallback Strategies** - Provide fallback transformation methods 4. **Error Logging** - Log errors for debugging and monitoring 5. **User Feedback** - Provide clear error messages ### Security 1. **Data Privacy** - Ensure data privacy during transformations 2. **Access Control** - Control access to transformation results 3. **Audit Logging** - Log transformation activities 4. **Data Sanitization** - Sanitize sensitive data 5. **Compliance** - Ensure compliance with data regulations ### Resource Management 1. **Memory Monitoring** - Monitor memory usage during transformations 2. **Processing Time** - Set reasonable timeouts 3. **Storage Optimization** - Optimize result storage 4. **Cleanup** - Clean up temporary files 5. **Resource Limits** - Set appropriate resource limits ### Integration 1. **Tool Dependencies** - Ensure required tools are available 2. **API Compatibility** - Maintain API compatibility 3. **Error Propagation** - Properly propagate errors 4. **Logging Integration** - Integrate with logging systems 5. **Monitoring** - Monitor tool performance and usage ### Development vs Production **Development:** ```bash DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0 DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5 ``` **Production:** ```bash DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5 DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20 ``` ### Error Handling Always wrap transformation operations in try-except blocks: ```python from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool, DataTransformerError, TransformationError data_transformer = DataTransformerTool() try: transformed_data = data_transformer.transform_data( data=df, transformations=['normalize', 'encode_categorical'], missing_strategy='mean' ) except TransformationError as e: print(f"Transformation error: {e}") except DataTransformerError as e: print(f"Data transformer error: {e}") except Exception as e: print(f"Unexpected error: {e}") ``` ## Dependencies ### Core Dependencies ```bash # Install core dependencies pip install pydantic python-dotenv # Install data processing dependencies pip install pandas numpy scikit-learn # Install transformation dependencies pip install scipy statsmodels ``` ### Optional Dependencies ```bash # For advanced transformations pip install category-encoders feature-engine # For feature selection pip install sklearn-feature-selection # For advanced scaling pip install scikit-learn-extra # For pipeline optimization pip install optuna hyperopt ``` ### Verification ```python # Test dependency availability try: import pandas import numpy import sklearn print("Core dependencies available") except ImportError as e: print(f"Missing dependency: {e}") # Test transformation availability try: from sklearn.preprocessing import StandardScaler, MinMaxScaler from sklearn.impute import SimpleImputer print("Transformation dependencies available") except ImportError: print("Transformation dependencies not available") # Test advanced encoding availability try: import category_encoders print("Advanced encoding available") except ImportError: print("Advanced encoding not available") # Test feature engineering availability try: import feature_engine print("Feature engineering available") except ImportError: print("Feature engineering not available") ``` ## Related Documentation - Tool implementation details in the source code - Pandas tool documentation for core data operations - Statistics tool documentation for statistical analysis - Main aiecs documentation for architecture overview ## Support For issues or questions about Data Transformer Tool configuration: - Check the tool source code for implementation details - Review pandas tool documentation for core data operations - Consult the main aiecs documentation for architecture overview - Test with simple datasets first to isolate configuration vs. transformation issues - Verify data compatibility and format requirements - Check transformation parameters and strategies - Ensure proper encoding and scaling settings - Validate data quality and structure requirements