Data Transformer Tool Configuration Guide

Overview

The Data Transformer Tool is an advanced data transformation tool that provides comprehensive data transformation capabilities with data cleaning and preprocessing, feature engineering and encoding, normalization and standardization, transformation pipelines, and missing value handling. It can clean and preprocess data, engineer features, transform and normalize data, and build transformation pipelines. The tool integrates with pandas_tool for core operations and supports various transformation types (cleaning operations, transformation operations, encoding operations, feature engineering) and missing value strategies (drop, mean, median, mode, forward_fill, backward_fill, interpolate, constant). The tool can be configured via environment variables using the DATA_TRANSFORMER_ prefix or through programmatic configuration when initializing the tool.

Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Data Transformer Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.

Setting Up .env Files

1. Install python-dotenv:

pip install python-dotenv

2. Create a .env file in your project root:

# .env file in your project root
DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10

3. Load the .env file in your application:

# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool

# The tool will automatically use the environment variables
data_transformer = DataTransformerTool()

Multiple Environment Files

You can use different .env files for different environments:

import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool
data_transformer = DataTransformerTool()

Example .env.production:

# Production settings - optimized for robust transformations
DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20

Example .env.development:

# Development settings - optimized for testing and debugging
DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5

Best Practices for .env Files

  1. Never commit .env files to version control - Add .env to your .gitignore:

    # .gitignore
    .env
    .env.local
    .env.*.local
    .env.production
    .env.staging
    
  2. Provide a template - Create .env.example with documented dummy values:

    # .env.example
    # Data Transformer Tool Configuration
    
    # Standard deviation threshold for outlier detection
    DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
    
    # Default strategy for handling missing values
    DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
    
    # Whether to enable transformation pipeline caching
    DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
    
    # Maximum number of categories for one-hot encoding
    DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10
    
  3. Document your variables - Add comments explaining each setting

  4. Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports

  5. Format values correctly:

    • Strings: Plain text: mean, median, mode

    • Floats: Decimal numbers: 3.0, 2.5

    • Integers: Plain numbers: 10, 20

    • Booleans: true or false

Configuration Options

1. Outlier STD Threshold

Environment Variable: DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD

Type: Float

Default: 3.0

Description: Standard deviation threshold for outlier detection. Values beyond this threshold are considered outliers using the Z-score method during data cleaning operations.

Common Values:

  • 2.0 - Strict outlier detection (more outliers detected)

  • 2.5 - Moderate outlier detection

  • 3.0 - Standard outlier detection (default)

  • 3.5 - Lenient outlier detection (fewer outliers detected)

Example:

export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5

Threshold Note: Lower values detect more outliers, higher values are more lenient.

2. Default Missing Strategy

Environment Variable: DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY

Type: String

Default: "mean"

Description: Default strategy for handling missing values when no specific strategy is provided. This determines how missing values are imputed or handled.

Supported Strategies:

  • drop - Drop rows/columns with missing values

  • mean - Fill with mean value (default)

  • median - Fill with median value

  • mode - Fill with most frequent value

  • forward_fill - Forward fill missing values

  • backward_fill - Backward fill missing values

  • interpolate - Interpolate missing values

  • constant - Fill with constant value

Example:

export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median

Strategy Note: Choose based on your data characteristics and domain knowledge.

3. Enable Pipeline Caching

Environment Variable: DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING

Type: Boolean

Default: True

Description: Whether to enable transformation pipeline caching. Caching improves performance for repeated transformations but uses additional memory.

Values:

  • true - Enable pipeline caching (default)

  • false - Disable pipeline caching

Example:

export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true

Caching Note: Enable for better performance, disable to save memory.

4. Max One Hot Categories

Environment Variable: DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES

Type: Integer

Default: 10

Description: Maximum number of categories for one-hot encoding. Columns with more unique values will use alternative encoding methods to prevent excessive dimensionality.

Common Values:

  • 5 - Conservative encoding (few categories)

  • 10 - Standard encoding (default)

  • 20 - Liberal encoding (many categories)

  • 50 - Very liberal encoding (maximum categories)

Example:

export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20

Categories Note: Higher values allow more categories but increase dimensionality.

Usage Examples

Example 1: Basic Environment Configuration

# Set basic data transformation parameters
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10

# Run your application
python app.py

Example 2: Robust Production Configuration

# Optimized for robust data transformations
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20

Example 3: Development Configuration

# Development-friendly settings
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5

Example 4: Programmatic Configuration

from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool

# Initialize with custom configuration
data_transformer = DataTransformerTool(config={
    'outlier_std_threshold': 3.0,
    'default_missing_strategy': 'mean',
    'enable_pipeline_caching': True,
    'max_one_hot_categories': 10
})

Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

# Set environment defaults
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
# Override for specific instance
data_transformer = DataTransformerTool(config={
    'default_missing_strategy': 'median',  # This overrides the environment variable
    'enable_pipeline_caching': False  # This overrides the environment variable
})

Configuration Priority

When the Data Transformer Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

  1. Programmatic config - Values passed to the constructor

  2. Environment variables - Values set via DATA_TRANSFORMER_* variables

  3. Default values - Built-in defaults as specified above

Data Type Parsing

String Values

Strings should be provided as plain text without quotes:

export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mode

Float Values

Floats should be provided as decimal numbers:

export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5

Integer Values

Integers should be provided as numeric strings:

export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=10
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20

Boolean Values

Booleans should be provided as lowercase strings:

export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false

Validation

Automatic Type Validation

Pydantic automatically validates configuration values:

  • outlier_std_threshold must be a positive float

  • default_missing_strategy must be a valid strategy string

  • enable_pipeline_caching must be a boolean

  • max_one_hot_categories must be a positive integer

Runtime Validation

When transforming data, the tool validates:

  1. Outlier threshold - Threshold must be reasonable for the data distribution

  2. Missing strategy - Strategy must be appropriate for the data type

  3. Category limits - Category limits must be reasonable for encoding

  4. Data compatibility - Data must be compatible with transformation operations

  5. Memory requirements - Transformations must not exceed memory limits

Transformation Types

The Data Transformer Tool supports various transformation types:

Cleaning Operations

  • Remove Duplicates - Remove duplicate rows or columns

  • Fill Missing - Fill missing values using various strategies

  • Remove Outliers - Remove statistical outliers

Transformation Operations

  • Normalize - Min-max normalization

  • Standardize - Z-score standardization

  • Log Transform - Logarithmic transformation

  • Box-Cox - Box-Cox power transformation

Encoding Operations

  • One-Hot Encode - One-hot encoding for categorical variables

  • Label Encode - Label encoding for ordinal variables

  • Target Encode - Target encoding for high-cardinality variables

Feature Engineering

  • Polynomial Features - Generate polynomial features

  • Interaction Features - Create feature interactions

  • Binning - Create bins for continuous variables

  • Aggregation - Aggregate features by groups

Missing Value Strategies

Statistical Strategies

  • Mean - Fill with mean value (default)

  • Median - Fill with median value

  • Mode - Fill with most frequent value

Interpolation Strategies

  • Forward Fill - Forward fill missing values

  • Backward Fill - Backward fill missing values

  • Interpolate - Linear interpolation

Other Strategies

  • Drop - Drop rows/columns with missing values

  • Constant - Fill with constant value

Operations Supported

The Data Transformer Tool supports comprehensive data transformation operations:

Basic Transformations

  • transform_data - Apply comprehensive data transformations

  • clean_data - Clean and preprocess data

  • handle_missing_values - Handle missing values

  • remove_outliers - Remove statistical outliers

  • remove_duplicates - Remove duplicate records

Feature Engineering

  • engineer_features - Engineer new features

  • create_polynomial_features - Create polynomial features

  • create_interaction_features - Create feature interactions

  • create_bins - Create bins for continuous variables

  • aggregate_features - Aggregate features by groups

Encoding Operations

  • encode_categorical - Encode categorical variables

  • one_hot_encode - One-hot encode categorical variables

  • label_encode - Label encode ordinal variables

  • target_encode - Target encode high-cardinality variables

  • handle_high_cardinality - Handle high-cardinality categorical variables

Normalization and Scaling

  • normalize_data - Normalize data to [0,1] range

  • standardize_data - Standardize data to mean=0, std=1

  • robust_scale - Robust scaling using median and IQR

  • min_max_scale - Min-max scaling

  • z_score_scale - Z-score standardization

Pipeline Operations

  • create_pipeline - Create transformation pipeline

  • fit_pipeline - Fit transformation pipeline

  • transform_pipeline - Apply transformation pipeline

  • inverse_transform - Inverse transform data

  • save_pipeline - Save transformation pipeline

  • load_pipeline - Load transformation pipeline

Advanced Operations

  • log_transform - Apply logarithmic transformation

  • box_cox_transform - Apply Box-Cox transformation

  • power_transform - Apply power transformation

  • quantile_transform - Apply quantile transformation

  • robust_transform - Apply robust transformation

Troubleshooting

Issue: Outlier removal too aggressive

Error: Too many data points removed as outliers

Solutions:

# Increase outlier threshold
export DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.5

# Or use domain-specific outlier detection
data_transformer.remove_outliers(data, method='domain_specific')

Issue: Missing value strategy fails

Error: Missing value imputation errors

Solutions:

# Change missing strategy
export DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median

# Or specify strategy per column
data_transformer.handle_missing_values(data, strategies={'col1': 'mean', 'col2': 'median'})

Issue: One-hot encoding creates too many columns

Error: Excessive dimensionality from one-hot encoding

Solutions:

# Reduce max categories
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5

# Or use alternative encoding
data_transformer.encode_categorical(data, method='target_encoding')

Issue: Pipeline caching issues

Error: Pipeline caching problems

Solutions:

# Disable caching
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false

# Or clear cache
data_transformer.clear_pipeline_cache()

Issue: Transformation performance issues

Error: Slow transformation operations

Solutions:

  1. Enable pipeline caching

  2. Use appropriate data types

  3. Process data in chunks

  4. Optimize transformation order

Issue: Memory usage exceeded

Error: Out of memory during transformations

Solutions:

# Disable caching
export DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false

# Reduce max categories
export DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5

# Process data in chunks
data_transformer.transform_data(data, chunk_size=10000)

Issue: Data type compatibility

Error: Data type incompatibility with transformations

Solutions:

  1. Check data types before transformation

  2. Convert data types appropriately

  3. Use compatible transformation methods

  4. Validate data structure

Best Practices

Performance Optimization

  1. Pipeline Caching - Enable caching for repeated transformations

  2. Category Management - Set appropriate category limits

  3. Chunk Processing - Process large datasets in chunks

  4. Memory Management - Monitor memory usage during transformations

  5. Transformation Order - Optimize transformation sequence

Error Handling

  1. Graceful Degradation - Handle transformation failures gracefully

  2. Validation - Validate data before transformation

  3. Fallback Strategies - Provide fallback transformation methods

  4. Error Logging - Log errors for debugging and monitoring

  5. User Feedback - Provide clear error messages

Security

  1. Data Privacy - Ensure data privacy during transformations

  2. Access Control - Control access to transformation results

  3. Audit Logging - Log transformation activities

  4. Data Sanitization - Sanitize sensitive data

  5. Compliance - Ensure compliance with data regulations

Resource Management

  1. Memory Monitoring - Monitor memory usage during transformations

  2. Processing Time - Set reasonable timeouts

  3. Storage Optimization - Optimize result storage

  4. Cleanup - Clean up temporary files

  5. Resource Limits - Set appropriate resource limits

Integration

  1. Tool Dependencies - Ensure required tools are available

  2. API Compatibility - Maintain API compatibility

  3. Error Propagation - Properly propagate errors

  4. Logging Integration - Integrate with logging systems

  5. Monitoring - Monitor tool performance and usage

Development vs Production

Development:

DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=3.0
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=mean
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=false
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=5

Production:

DATA_TRANSFORMER_OUTLIER_STD_THRESHOLD=2.5
DATA_TRANSFORMER_DEFAULT_MISSING_STRATEGY=median
DATA_TRANSFORMER_ENABLE_PIPELINE_CACHING=true
DATA_TRANSFORMER_MAX_ONE_HOT_CATEGORIES=20

Error Handling

Always wrap transformation operations in try-except blocks:

from aiecs.tools.statistics.data_transformer_tool import DataTransformerTool, DataTransformerError, TransformationError

data_transformer = DataTransformerTool()

try:
    transformed_data = data_transformer.transform_data(
        data=df,
        transformations=['normalize', 'encode_categorical'],
        missing_strategy='mean'
    )
except TransformationError as e:
    print(f"Transformation error: {e}")
except DataTransformerError as e:
    print(f"Data transformer error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Dependencies

Core Dependencies

# Install core dependencies
pip install pydantic python-dotenv

# Install data processing dependencies
pip install pandas numpy scikit-learn

# Install transformation dependencies
pip install scipy statsmodels

Optional Dependencies

# For advanced transformations
pip install category-encoders feature-engine

# For feature selection
pip install sklearn-feature-selection

# For advanced scaling
pip install scikit-learn-extra

# For pipeline optimization
pip install optuna hyperopt

Verification

# Test dependency availability
try:
    import pandas
    import numpy
    import sklearn
    print("Core dependencies available")
except ImportError as e:
    print(f"Missing dependency: {e}")

# Test transformation availability
try:
    from sklearn.preprocessing import StandardScaler, MinMaxScaler
    from sklearn.impute import SimpleImputer
    print("Transformation dependencies available")
except ImportError:
    print("Transformation dependencies not available")

# Test advanced encoding availability
try:
    import category_encoders
    print("Advanced encoding available")
except ImportError:
    print("Advanced encoding not available")

# Test feature engineering availability
try:
    import feature_engine
    print("Feature engineering available")
except ImportError:
    print("Feature engineering not available")

Support

For issues or questions about Data Transformer Tool configuration:

  • Check the tool source code for implementation details

  • Review pandas tool documentation for core data operations

  • Consult the main aiecs documentation for architecture overview

  • Test with simple datasets first to isolate configuration vs. transformation issues

  • Verify data compatibility and format requirements

  • Check transformation parameters and strategies

  • Ensure proper encoding and scaling settings

  • Validate data quality and structure requirements