Data Transformer Tool Configuration Guide

Overview

The Data Transformer Tool is an advanced data transformation tool that provides comprehensive data transformation capabilities with data cleaning and preprocessing, feature engineering and encoding, normalization and standardization, transformation pipelines, and missing value handling. It can clean and preprocess data, engineer features, transform and normalize data, and build transformation pipelines. The tool integrates with pandas_tool for core operations and supports various transformation types (cleaning operations, transformation operations, encoding operations, feature engineering) and missing value strategies (drop, mean, median, mode, forward_fill, backward_fill, interpolate, constant). The tool can be configured via environment variables using the DATA_TRANSFORMER_ prefix or through programmatic configuration when initializing the tool.

Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Data Transformer Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.

Setting Up .env Files

1. Install python-dotenv:

pip install python-dotenv

2. Create a .env file in your project root:

# .env file in your project root
DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10

3. Load the .env file in your application:

# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool

# The tool will automatically use the environment variables
data_transformer = DataTransformerTool()

Multiple Environment Files

You can use different .env files for different environments:

import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool
data_transformer = DataTransformerTool()

Example .env.production:

# Production settings - optimized for robust transformations
DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20

Example .env.development:

# Development settings - optimized for testing and debugging
DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5

Best Practices for .env Files

Never commit .env files to version control - Add .env to your .gitignore:

# .gitignore
.env
.env.local
.env.*.local
.env.production
.env.staging

Provide a template - Create .env.example with documented dummy values:

# .env.example
# Data Transformer Tool Configuration

# Standard deviation threshold for outlier detection
DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0

# Default strategy for handling missing values
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean

# Whether to enable transformation pipeline caching
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true

# Maximum number of categories for one-hot encoding
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10

Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format values correctly:
- Strings: Plain text: mean, median, mode
- Floats: Decimal numbers: 3.0, 2.5
- Integers: Plain numbers: 10, 20
- Booleans: true or false

Configuration Options

1. Outlier STD Threshold

Environment Variable: DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD

Type: Float

Default: 3.0

Description: Standard deviation threshold for outlier detection. Values beyond this threshold are considered outliers using the Z-score method during data cleaning operations.

Common Values:

2.0 - Strict outlier detection (more outliers detected)
2.5 - Moderate outlier detection
3.0 - Standard outlier detection (default)
3.5 - Lenient outlier detection (fewer outliers detected)

Example:

export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5

Threshold Note: Lower values detect more outliers, higher values are more lenient.

2. Default Missing Strategy

Environment Variable: DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY

Type: String

Default: "mean"

Description: Default strategy for handling missing values when no specific strategy is provided. This determines how missing values are imputed or handled.

Supported Strategies:

drop - Drop rows/columns with missing values
mean - Fill with mean value (default)
median - Fill with median value
mode - Fill with most frequent value
forward_fill - Forward fill missing values
backward_fill - Backward fill missing values
interpolate - Interpolate missing values
constant - Fill with constant value

Example:

export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median

Strategy Note: Choose based on your data characteristics and domain knowledge.

3. Enable Pipeline Caching

Environment Variable: DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING

Type: Boolean

Default: True

Description: Whether to enable transformation pipeline caching. Caching improves performance for repeated transformations but uses additional memory.

Values:

true - Enable pipeline caching (default)
false - Disable pipeline caching

Example:

export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true

Caching Note: Enable for better performance, disable to save memory.

4. Max One Hot Categories

Environment Variable: DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES

Type: Integer

Default: 10

Description: Maximum number of categories for one-hot encoding. Columns with more unique values will use alternative encoding methods to prevent excessive dimensionality.

Common Values:

5 - Conservative encoding (few categories)
10 - Standard encoding (default)
20 - Liberal encoding (many categories)
50 - Very liberal encoding (maximum categories)

Example:

export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20

Categories Note: Higher values allow more categories but increase dimensionality.

Usage Examples

Example 1: Basic Environment Configuration

# Set basic data transformation parameters
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10

# Run your application
python app.py

Example 2: Robust Production Configuration

# Optimized for robust data transformations
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20

Example 3: Development Configuration

# Development-friendly settings
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5

Example 4: Programmatic Configuration

from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool

# Initialize with custom configuration
data_transformer = DataTransformerTool(config={
    'outlier_std_threshold': 3.0,
    'default_missing_strategy': 'mean',
    'enable_pipeline_caching': True,
    'max_one_hot_categories': 10
})

Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

# Set environment defaults
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true

# Override for specific instance
data_transformer = DataTransformerTool(config={
    'default_missing_strategy': 'median',  # This overrides the environment variable
    'enable_pipeline_caching': False  # This overrides the environment variable
})

Configuration Priority

When the Data Transformer Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

Programmatic config - Values passed to the constructor
Environment variables - Values set via DATA_TRANSFORMER_* variables
Default values - Built-in defaults as specified above

Data Type Parsing

String Values

Strings should be provided as plain text without quotes:

export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mode

Float Values

Floats should be provided as decimal numbers:

export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5

Integer Values

Integers should be provided as numeric strings:

export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20

Boolean Values

Booleans should be provided as lowercase strings:

export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false

Validation

Automatic Type Validation

Pydantic automatically validates configuration values:

outlier_std_threshold must be a positive float
default_missing_strategy must be a valid strategy string
enable_pipeline_caching must be a boolean
max_one_hot_categories must be a positive integer

Runtime Validation

When transforming data, the tool validates:

Outlier threshold - Threshold must be reasonable for the data distribution
Missing strategy - Strategy must be appropriate for the data type
Category limits - Category limits must be reasonable for encoding
Data compatibility - Data must be compatible with transformation operations
Memory requirements - Transformations must not exceed memory limits

Transformation Types

The Data Transformer Tool supports various transformation types:

Cleaning Operations

Remove Duplicates - Remove duplicate rows or columns
Fill Missing - Fill missing values using various strategies
Remove Outliers - Remove statistical outliers

Transformation Operations

Normalize - Min-max normalization
Standardize - Z-score standardization
Log Transform - Logarithmic transformation
Box-Cox - Box-Cox power transformation

Encoding Operations

One-Hot Encode - One-hot encoding for categorical variables
Label Encode - Label encoding for ordinal variables
Target Encode - Target encoding for high-cardinality variables

Feature Engineering

Polynomial Features - Generate polynomial features
Interaction Features - Create feature interactions
Binning - Create bins for continuous variables
Aggregation - Aggregate features by groups

Missing Value Strategies

Statistical Strategies

Mean - Fill with mean value (default)
Median - Fill with median value
Mode - Fill with most frequent value

Interpolation Strategies

Forward Fill - Forward fill missing values
Backward Fill - Backward fill missing values
Interpolate - Linear interpolation

Other Strategies

Drop - Drop rows/columns with missing values
Constant - Fill with constant value

Operations Supported

The Data Transformer Tool supports comprehensive data transformation operations:

Basic Transformations

transform_data - Apply comprehensive data transformations
clean_data - Clean and preprocess data
handle_missing_values - Handle missing values
remove_outliers - Remove statistical outliers
remove_duplicates - Remove duplicate records

Feature Engineering

engineer_features - Engineer new features
create_polynomial_features - Create polynomial features
create_interaction_features - Create feature interactions
create_bins - Create bins for continuous variables
aggregate_features - Aggregate features by groups

Encoding Operations

encode_categorical - Encode categorical variables
one_hot_encode - One-hot encode categorical variables
label_encode - Label encode ordinal variables
target_encode - Target encode high-cardinality variables
handle_high_cardinality - Handle high-cardinality categorical variables

Normalization and Scaling

normalize_data - Normalize data to [0,1] range
standardize_data - Standardize data to mean=0, std=1
robust_scale - Robust scaling using median and IQR
min_max_scale - Min-max scaling
z_score_scale - Z-score standardization

Pipeline Operations

create_pipeline - Create transformation pipeline
fit_pipeline - Fit transformation pipeline
transform_pipeline - Apply transformation pipeline
inverse_transform - Inverse transform data
save_pipeline - Save transformation pipeline
load_pipeline - Load transformation pipeline

Advanced Operations

log_transform - Apply logarithmic transformation
box_cox_transform - Apply Box-Cox transformation
power_transform - Apply power transformation
quantile_transform - Apply quantile transformation
robust_transform - Apply robust transformation

Troubleshooting

Issue: Outlier removal too aggressive

Error: Too many data points removed as outliers

Solutions:

# Increase outlier threshold
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.5

# Or use domain-specific outlier detection
data_transformer.remove_outliers(data, method='domain_specific')

Issue: Missing value strategy fails

Error: Missing value imputation errors

Solutions:

# Change missing strategy
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median

# Or specify strategy per column
data_transformer.handle_missing_values(data, strategies={'col1': 'mean', 'col2': 'median'})

Issue: One-hot encoding creates too many columns

Error: Excessive dimensionality from one-hot encoding

Solutions:

# Reduce max categories
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5

# Or use alternative encoding
data_transformer.encode_categorical(data, method='target_encoding')

Issue: Pipeline caching issues

Error: Pipeline caching problems

Solutions:

# Disable caching
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false

# Or clear cache
data_transformer.clear_pipeline_cache()

Issue: Transformation performance issues

Error: Slow transformation operations

Solutions:

Enable pipeline caching
Use appropriate data types
Process data in chunks
Optimize transformation order

Issue: Memory usage exceeded

Error: Out of memory during transformations

Solutions:

# Disable caching
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false

# Reduce max categories
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5

# Process data in chunks
data_transformer.transform_data(data, chunk_size=10000)

Issue: Data type compatibility

Error: Data type incompatibility with transformations

Solutions:

Check data types before transformation
Convert data types appropriately
Use compatible transformation methods
Validate data structure

Best Practices

Performance Optimization

Pipeline Caching - Enable caching for repeated transformations
Category Management - Set appropriate category limits
Chunk Processing - Process large datasets in chunks
Memory Management - Monitor memory usage during transformations
Transformation Order - Optimize transformation sequence

Error Handling

Graceful Degradation - Handle transformation failures gracefully
Validation - Validate data before transformation
Fallback Strategies - Provide fallback transformation methods
Error Logging - Log errors for debugging and monitoring
User Feedback - Provide clear error messages

Security

Data Privacy - Ensure data privacy during transformations
Access Control - Control access to transformation results
Audit Logging - Log transformation activities
Data Sanitization - Sanitize sensitive data
Compliance - Ensure compliance with data regulations

Resource Management

Memory Monitoring - Monitor memory usage during transformations
Processing Time - Set reasonable timeouts
Storage Optimization - Optimize result storage
Cleanup - Clean up temporary files
Resource Limits - Set appropriate resource limits

Integration

Tool Dependencies - Ensure required tools are available
API Compatibility - Maintain API compatibility
Error Propagation - Properly propagate errors
Logging Integration - Integrate with logging systems
Monitoring - Monitor tool performance and usage

Development vs Production

Development:

DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5

Production:

DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20

Error Handling

Always wrap transformation operations in try-except blocks:

from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool, DataTransformerError, TransformationError

data_transformer = DataTransformerTool()

try:
    transformed_data = data_transformer.transform_data(
        data=df,
        transformations=['normalize', 'encode_categorical'],
        missing_strategy='mean'
    )
except TransformationError as e:
    print(f"Transformation error: {e}")
except DataTransformerError as e:
    print(f"Data transformer error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Dependencies

Core Dependencies

# Install core dependencies
pip install pydantic python-dotenv

# Install data processing dependencies
pip install pandas numpy scikit-learn

# Install transformation dependencies
pip install scipy statsmodels

Optional Dependencies

# For advanced transformations
pip install category-encoders feature-engine

# For feature selection
pip install sklearn-feature-selection

# For advanced scaling
pip install scikit-learn-extra

# For pipeline optimization
pip install optuna hyperopt

Verification

# Test dependency availability
try:
    import pandas
    import numpy
    import sklearn
    print("Core dependencies available")
except ImportError as e:
    print(f"Missing dependency: {e}")

# Test transformation availability
try:
    from sklearn.preprocessing import StandardScaler, MinMaxScaler
    from sklearn.impute import SimpleImputer
    print("Transformation dependencies available")
except ImportError:
    print("Transformation dependencies not available")

# Test advanced encoding availability
try:
    import category_encoders
    print("Advanced encoding available")
except ImportError:
    print("Advanced encoding not available")

# Test feature engineering availability
try:
    import feature_engine
    print("Feature engineering available")
except ImportError:
    print("Feature engineering not available")

Support

For issues or questions about Data Transformer Tool configuration:

Check the tool source code for implementation details
Review pandas tool documentation for core data operations
Consult the main aiecs documentation for architecture overview
Test with simple datasets first to isolate configuration vs. transformation issues
Verify data compatibility and format requirements
Check transformation parameters and strategies
Ensure proper encoding and scaling settings
Validate data quality and structure requirements

Data Transformer Tool Configuration Guide

Overview

Using .env Files in Your Project

Setting Up .env Files

Multiple Environment Files

Best Practices for .env Files

Configuration Options

1. Outlier STD Threshold

2. Default Missing Strategy

3. Enable Pipeline Caching

4. Max One Hot Categories

Usage Examples

Example 1: Basic Environment Configuration

Example 2: Robust Production Configuration

Example 3: Development Configuration

Example 4: Programmatic Configuration

Example 5: Mixed Configuration

Configuration Priority

Data Type Parsing

String Values

Float Values

Integer Values

Boolean Values

Validation

Automatic Type Validation

Runtime Validation

Transformation Types

Cleaning Operations

Transformation Operations

Encoding Operations

Feature Engineering

Missing Value Strategies

Statistical Strategies

Interpolation Strategies

Other Strategies

Operations Supported

Basic Transformations

Feature Engineering

Encoding Operations

Normalization and Scaling

Pipeline Operations

Advanced Operations

Troubleshooting

Issue: Outlier removal too aggressive

Issue: Missing value strategy fails

Issue: One-hot encoding creates too many columns

Issue: Pipeline caching issues

Issue: Transformation performance issues

Issue: Memory usage exceeded

Issue: Data type compatibility

Best Practices

Performance Optimization

Error Handling

Security

Resource Management

Integration

Development vs Production

Error Handling

Dependencies

Core Dependencies

Optional Dependencies

Verification

Related Documentation

Support