Data Transformer Tool Configuration Guide
Overview
The Data Transformer Tool is an advanced data transformation tool that provides comprehensive data transformation capabilities with data cleaning and preprocessing, feature engineering and encoding, normalization and standardization, transformation pipelines, and missing value handling. It can clean and preprocess data, engineer features, transform and normalize data, and build transformation pipelines. The tool integrates with pandas_tool for core operations and supports various transformation types (cleaning operations, transformation operations, encoding operations, feature engineering) and missing value strategies (drop, mean, median, mode, forward_fill, backward_fill, interpolate, constant). The tool can be configured via environment variables using the DATA_TRANSFORMER_ prefix or through programmatic configuration when initializing the tool.
Using .env Files in Your Project
When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Data Transformer Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.
Setting Up .env Files
1. Install python-dotenv:
pip install python-dotenv
2. Create a .env file in your project root:
# .env file in your project root
DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10
3. Load the .env file in your application:
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv
# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()
# Now import and use aiecs tools
from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool
# The tool will automatically use the environment variables
data_transformer = DataTransformerTool()
Multiple Environment Files
You can use different .env files for different environments:
import os
from dotenv import load_dotenv
# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')
if env == 'production':
load_dotenv('.env.production')
elif env == 'staging':
load_dotenv('.env.staging')
else:
load_dotenv('.env.development')
from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool
data_transformer = DataTransformerTool()
Example .env.production:
# Production settings - optimized for robust transformations
DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20
Example .env.development:
# Development settings - optimized for testing and debugging
DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5
Best Practices for .env Files
Never commit .env files to version control - Add
.envto your.gitignore:# .gitignore .env .env.local .env.*.local .env.production .env.staging
Provide a template - Create
.env.examplewith documented dummy values:# .env.example # Data Transformer Tool Configuration # Standard deviation threshold for outlier detection DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0 # Default strategy for handling missing values DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean # Whether to enable transformation pipeline caching DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true # Maximum number of categories for one-hot encoding DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10
Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format values correctly:
Strings: Plain text:
mean,median,modeFloats: Decimal numbers:
3.0,2.5Integers: Plain numbers:
10,20Booleans:
trueorfalse
Configuration Options
1. Outlier STD Threshold
Environment Variable: DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD
Type: Float
Default: 3.0
Description: Standard deviation threshold for outlier detection. Values beyond this threshold are considered outliers using the Z-score method during data cleaning operations.
Common Values:
2.0- Strict outlier detection (more outliers detected)2.5- Moderate outlier detection3.0- Standard outlier detection (default)3.5- Lenient outlier detection (fewer outliers detected)
Example:
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5
Threshold Note: Lower values detect more outliers, higher values are more lenient.
2. Default Missing Strategy
Environment Variable: DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY
Type: String
Default: "mean"
Description: Default strategy for handling missing values when no specific strategy is provided. This determines how missing values are imputed or handled.
Supported Strategies:
drop- Drop rows/columns with missing valuesmean- Fill with mean value (default)median- Fill with median valuemode- Fill with most frequent valueforward_fill- Forward fill missing valuesbackward_fill- Backward fill missing valuesinterpolate- Interpolate missing valuesconstant- Fill with constant value
Example:
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
Strategy Note: Choose based on your data characteristics and domain knowledge.
3. Enable Pipeline Caching
Environment Variable: DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING
Type: Boolean
Default: True
Description: Whether to enable transformation pipeline caching. Caching improves performance for repeated transformations but uses additional memory.
Values:
true- Enable pipeline caching (default)false- Disable pipeline caching
Example:
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
Caching Note: Enable for better performance, disable to save memory.
4. Max One Hot Categories
Environment Variable: DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES
Type: Integer
Default: 10
Description: Maximum number of categories for one-hot encoding. Columns with more unique values will use alternative encoding methods to prevent excessive dimensionality.
Common Values:
5- Conservative encoding (few categories)10- Standard encoding (default)20- Liberal encoding (many categories)50- Very liberal encoding (maximum categories)
Example:
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20
Categories Note: Higher values allow more categories but increase dimensionality.
Usage Examples
Example 1: Basic Environment Configuration
# Set basic data transformation parameters
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10
# Run your application
python app.py
Example 2: Robust Production Configuration
# Optimized for robust data transformations
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20
Example 3: Development Configuration
# Development-friendly settings
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5
Example 4: Programmatic Configuration
from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool
# Initialize with custom configuration
data_transformer = DataTransformerTool(config={
'outlier_std_threshold': 3.0,
'default_missing_strategy': 'mean',
'enable_pipeline_caching': True,
'max_one_hot_categories': 10
})
Example 5: Mixed Configuration
Environment variables are used as defaults, but can be overridden programmatically:
# Set environment defaults
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
# Override for specific instance
data_transformer = DataTransformerTool(config={
'default_missing_strategy': 'median', # This overrides the environment variable
'enable_pipeline_caching': False # This overrides the environment variable
})
Configuration Priority
When the Data Transformer Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):
Programmatic config - Values passed to the constructor
Environment variables - Values set via
DATA_TRANSFORMER_*variablesDefault values - Built-in defaults as specified above
Data Type Parsing
String Values
Strings should be provided as plain text without quotes:
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mode
Float Values
Floats should be provided as decimal numbers:
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5
Integer Values
Integers should be provided as numeric strings:
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20
Boolean Values
Booleans should be provided as lowercase strings:
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false
Validation
Automatic Type Validation
Pydantic automatically validates configuration values:
outlier_std_thresholdmust be a positive floatdefault_missing_strategymust be a valid strategy stringenable_pipeline_cachingmust be a booleanmax_one_hot_categoriesmust be a positive integer
Runtime Validation
When transforming data, the tool validates:
Outlier threshold - Threshold must be reasonable for the data distribution
Missing strategy - Strategy must be appropriate for the data type
Category limits - Category limits must be reasonable for encoding
Data compatibility - Data must be compatible with transformation operations
Memory requirements - Transformations must not exceed memory limits
Transformation Types
The Data Transformer Tool supports various transformation types:
Cleaning Operations
Remove Duplicates - Remove duplicate rows or columns
Fill Missing - Fill missing values using various strategies
Remove Outliers - Remove statistical outliers
Transformation Operations
Normalize - Min-max normalization
Standardize - Z-score standardization
Log Transform - Logarithmic transformation
Box-Cox - Box-Cox power transformation
Encoding Operations
One-Hot Encode - One-hot encoding for categorical variables
Label Encode - Label encoding for ordinal variables
Target Encode - Target encoding for high-cardinality variables
Feature Engineering
Polynomial Features - Generate polynomial features
Interaction Features - Create feature interactions
Binning - Create bins for continuous variables
Aggregation - Aggregate features by groups
Missing Value Strategies
Statistical Strategies
Mean - Fill with mean value (default)
Median - Fill with median value
Mode - Fill with most frequent value
Interpolation Strategies
Forward Fill - Forward fill missing values
Backward Fill - Backward fill missing values
Interpolate - Linear interpolation
Other Strategies
Drop - Drop rows/columns with missing values
Constant - Fill with constant value
Operations Supported
The Data Transformer Tool supports comprehensive data transformation operations:
Basic Transformations
transform_data- Apply comprehensive data transformationsclean_data- Clean and preprocess datahandle_missing_values- Handle missing valuesremove_outliers- Remove statistical outliersremove_duplicates- Remove duplicate records
Feature Engineering
engineer_features- Engineer new featurescreate_polynomial_features- Create polynomial featurescreate_interaction_features- Create feature interactionscreate_bins- Create bins for continuous variablesaggregate_features- Aggregate features by groups
Encoding Operations
encode_categorical- Encode categorical variablesone_hot_encode- One-hot encode categorical variableslabel_encode- Label encode ordinal variablestarget_encode- Target encode high-cardinality variableshandle_high_cardinality- Handle high-cardinality categorical variables
Normalization and Scaling
normalize_data- Normalize data to [0,1] rangestandardize_data- Standardize data to mean=0, std=1robust_scale- Robust scaling using median and IQRmin_max_scale- Min-max scalingz_score_scale- Z-score standardization
Pipeline Operations
create_pipeline- Create transformation pipelinefit_pipeline- Fit transformation pipelinetransform_pipeline- Apply transformation pipelineinverse_transform- Inverse transform datasave_pipeline- Save transformation pipelineload_pipeline- Load transformation pipeline
Advanced Operations
log_transform- Apply logarithmic transformationbox_cox_transform- Apply Box-Cox transformationpower_transform- Apply power transformationquantile_transform- Apply quantile transformationrobust_transform- Apply robust transformation
Troubleshooting
Issue: Outlier removal too aggressive
Error: Too many data points removed as outliers
Solutions:
# Increase outlier threshold
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.5
# Or use domain-specific outlier detection
data_transformer.remove_outliers(data, method='domain_specific')
Issue: Missing value strategy fails
Error: Missing value imputation errors
Solutions:
# Change missing strategy
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
# Or specify strategy per column
data_transformer.handle_missing_values(data, strategies={'col1': 'mean', 'col2': 'median'})
Issue: One-hot encoding creates too many columns
Error: Excessive dimensionality from one-hot encoding
Solutions:
# Reduce max categories
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5
# Or use alternative encoding
data_transformer.encode_categorical(data, method='target_encoding')
Issue: Pipeline caching issues
Error: Pipeline caching problems
Solutions:
# Disable caching
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false
# Or clear cache
data_transformer.clear_pipeline_cache()
Issue: Transformation performance issues
Error: Slow transformation operations
Solutions:
Enable pipeline caching
Use appropriate data types
Process data in chunks
Optimize transformation order
Issue: Memory usage exceeded
Error: Out of memory during transformations
Solutions:
# Disable caching
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false
# Reduce max categories
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5
# Process data in chunks
data_transformer.transform_data(data, chunk_size=10000)
Issue: Data type compatibility
Error: Data type incompatibility with transformations
Solutions:
Check data types before transformation
Convert data types appropriately
Use compatible transformation methods
Validate data structure
Best Practices
Performance Optimization
Pipeline Caching - Enable caching for repeated transformations
Category Management - Set appropriate category limits
Chunk Processing - Process large datasets in chunks
Memory Management - Monitor memory usage during transformations
Transformation Order - Optimize transformation sequence
Error Handling
Graceful Degradation - Handle transformation failures gracefully
Validation - Validate data before transformation
Fallback Strategies - Provide fallback transformation methods
Error Logging - Log errors for debugging and monitoring
User Feedback - Provide clear error messages
Security
Data Privacy - Ensure data privacy during transformations
Access Control - Control access to transformation results
Audit Logging - Log transformation activities
Data Sanitization - Sanitize sensitive data
Compliance - Ensure compliance with data regulations
Resource Management
Memory Monitoring - Monitor memory usage during transformations
Processing Time - Set reasonable timeouts
Storage Optimization - Optimize result storage
Cleanup - Clean up temporary files
Resource Limits - Set appropriate resource limits
Integration
Tool Dependencies - Ensure required tools are available
API Compatibility - Maintain API compatibility
Error Propagation - Properly propagate errors
Logging Integration - Integrate with logging systems
Monitoring - Monitor tool performance and usage
Development vs Production
Development:
DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5
Production:
DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20
Error Handling
Always wrap transformation operations in try-except blocks:
from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool, DataTransformerError, TransformationError
data_transformer = DataTransformerTool()
try:
transformed_data = data_transformer.transform_data(
data=df,
transformations=['normalize', 'encode_categorical'],
missing_strategy='mean'
)
except TransformationError as e:
print(f"Transformation error: {e}")
except DataTransformerError as e:
print(f"Data transformer error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Dependencies
Core Dependencies
# Install core dependencies
pip install pydantic python-dotenv
# Install data processing dependencies
pip install pandas numpy scikit-learn
# Install transformation dependencies
pip install scipy statsmodels
Optional Dependencies
# For advanced transformations
pip install category-encoders feature-engine
# For feature selection
pip install sklearn-feature-selection
# For advanced scaling
pip install scikit-learn-extra
# For pipeline optimization
pip install optuna hyperopt
Verification
# Test dependency availability
try:
import pandas
import numpy
import sklearn
print("Core dependencies available")
except ImportError as e:
print(f"Missing dependency: {e}")
# Test transformation availability
try:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
print("Transformation dependencies available")
except ImportError:
print("Transformation dependencies not available")
# Test advanced encoding availability
try:
import category_encoders
print("Advanced encoding available")
except ImportError:
print("Advanced encoding not available")
# Test feature engineering availability
try:
import feature_engine
print("Feature engineering available")
except ImportError:
print("Feature engineering not available")
Support
For issues or questions about Data Transformer Tool configuration:
Check the tool source code for implementation details
Review pandas tool documentation for core data operations
Consult the main aiecs documentation for architecture overview
Test with simple datasets first to isolate configuration vs. transformation issues
Verify data compatibility and format requirements
Check transformation parameters and strategies
Ensure proper encoding and scaling settings
Validate data quality and structure requirements