Model Trainer Tool Configuration Guide

Overview

The Model Trainer Tool is an AutoML and machine learning model training tool that provides AutoML capabilities with automatic model selection for classification and regression, hyperparameter tuning, model evaluation and comparison, feature importance analysis, and model explanation support. It can train multiple model types, perform hyperparameter tuning, evaluate and compare models, generate feature importance, and provide model explanations. The tool supports various model types (logistic regression, linear regression, random forest, gradient boosting) and task types (classification, regression, clustering). The tool can be configured via environment variables using the MODEL_TRAINER_ prefix or through programmatic configuration when initializing the tool.

Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Model Trainer Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.

Setting Up .env Files

1. Install python-dotenv:

pip install python-dotenv

2. Create a .env file in your project root:

# .env file in your project root
MODEL_TRAINER_TEST_SIZE=0.2
MODEL_TRAINER_RANDOM_STATE=42
MODEL_TRAINER_CV_FOLDS=5
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
MODEL_TRAINER_MAX_TUNING_ITERATIONS=20

3. Load the .env file in your application:

# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.statistics.model_trainer_tool import ModelTrainerTool

# The tool will automatically use the environment variables
model_trainer = ModelTrainerTool()

Multiple Environment Files

You can use different .env files for different environments:

import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.statistics.model_trainer_tool import ModelTrainerTool
model_trainer = ModelTrainerTool()

Example .env.production:

# Production settings - optimized for robust model training
MODEL_TRAINER_TEST_SIZE=0.2
MODEL_TRAINER_RANDOM_STATE=42
MODEL_TRAINER_CV_FOLDS=10
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
MODEL_TRAINER_MAX_TUNING_ITERATIONS=50

Example .env.development:

# Development settings - optimized for testing and debugging
MODEL_TRAINER_TEST_SIZE=0.3
MODEL_TRAINER_RANDOM_STATE=123
MODEL_TRAINER_CV_FOLDS=3
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
MODEL_TRAINER_MAX_TUNING_ITERATIONS=10

Best Practices for .env Files

Never commit .env files to version control - Add .env to your .gitignore:

# .gitignore
.env
.env.local
.env.*.local
.env.production
.env.staging

Provide a template - Create .env.example with documented dummy values:

# .env.example
# Model Trainer Tool Configuration

# Proportion of data to use for testing
MODEL_TRAINER_TEST_SIZE=0.2

# Random state for reproducibility
MODEL_TRAINER_RANDOM_STATE=42

# Number of cross-validation folds
MODEL_TRAINER_CV_FOLDS=5

# Whether to enable hyperparameter tuning
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false

# Maximum number of hyperparameter tuning iterations
MODEL_TRAINER_MAX_TUNING_ITERATIONS=20

Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format values correctly:
- Floats: Decimal numbers: 0.2, 0.3
- Integers: Plain numbers: 42, 5, 20
- Booleans: true or false

Configuration Options

1. Test Size

Environment Variable: MODEL_TRAINER_TEST_SIZE

Type: Float

Default: 0.2

Description: Proportion of data to use for testing. This determines how much of the dataset is reserved for model evaluation.

Common Values:

0.1 - Small test set (10% for testing)
0.2 - Standard test set (20% for testing, default)
0.3 - Large test set (30% for testing)
0.4 - Very large test set (40% for testing)

Example:

export MODEL_TRAINER_TEST_SIZE=0.2

Test Size Note: Larger test sets provide more reliable evaluation but less training data.

2. Random State

Environment Variable: MODEL_TRAINER_RANDOM_STATE

Type: Integer

Default: 42

Description: Random state for reproducibility. This ensures consistent results across runs by controlling random number generation.

Common Values:

42 - Standard random state (default)
123 - Alternative random state
0 - Zero random state
None - Truly random (not reproducible)

Example:

export MODEL_TRAINER_RANDOM_STATE=42

Random State Note: Use the same random state for reproducible results across experiments.

3. CV Folds

Environment Variable: MODEL_TRAINER_CV_FOLDS

Type: Integer

Default: 5

Description: Number of cross-validation folds for model evaluation. This determines how many times the data is split for cross-validation.

Common Values:

3 - Minimal cross-validation (3 folds)
5 - Standard cross-validation (5 folds, default)
10 - Comprehensive cross-validation (10 folds)
20 - Extensive cross-validation (20 folds)

Example:

export MODEL_TRAINER_CV_FOLDS=5

CV Folds Note: More folds provide better evaluation but take longer to compute.

4. Enable Hyperparameter Tuning

Environment Variable: MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING

Type: Boolean

Default: False

Description: Whether to enable hyperparameter tuning. When enabled, the tool will automatically search for optimal hyperparameters.

Values:

true - Enable hyperparameter tuning
false - Disable hyperparameter tuning (default)

Example:

export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true

Tuning Note: Hyperparameter tuning improves model performance but significantly increases training time.

5. Max Tuning Iterations

Environment Variable: MODEL_TRAINER_MAX_TUNING_ITERATIONS

Type: Integer

Default: 20

Description: Maximum number of hyperparameter tuning iterations. This limits how many different hyperparameter combinations are tested.

Common Values:

10 - Quick tuning (10 iterations)
20 - Standard tuning (20 iterations, default)
50 - Comprehensive tuning (50 iterations)
100 - Extensive tuning (100 iterations)

Example:

export MODEL_TRAINER_MAX_TUNING_ITERATIONS=50

Iterations Note: More iterations may find better hyperparameters but take much longer.

Usage Examples

Example 1: Basic Environment Configuration

# Set basic model training parameters
export MODEL_TRAINER_TEST_SIZE=0.2
export MODEL_TRAINER_RANDOM_STATE=42
export MODEL_TRAINER_CV_FOLDS=5
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=20

# Run your application
python app.py

Example 2: Production Configuration with Tuning

# Optimized for production with hyperparameter tuning
export MODEL_TRAINER_TEST_SIZE=0.2
export MODEL_TRAINER_RANDOM_STATE=42
export MODEL_TRAINER_CV_FOLDS=10
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=50

Example 3: Development Configuration

# Development-friendly settings
export MODEL_TRAINER_TEST_SIZE=0.3
export MODEL_TRAINER_RANDOM_STATE=123
export MODEL_TRAINER_CV_FOLDS=3
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=10

Example 4: Programmatic Configuration

from aiecs.tools.statistics.model_trainer_tool import ModelTrainerTool

# Initialize with custom configuration
model_trainer = ModelTrainerTool(config={
    'test_size': 0.2,
    'random_state': 42,
    'cv_folds': 5,
    'enable_hyperparameter_tuning': False,
    'max_tuning_iterations': 20
})

Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

# Set environment defaults
export MODEL_TRAINER_TEST_SIZE=0.2
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false

# Override for specific instance
model_trainer = ModelTrainerTool(config={
    'test_size': 0.3,  # This overrides the environment variable
    'enable_hyperparameter_tuning': True  # This overrides the environment variable
})

Configuration Priority

When the Model Trainer Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

Programmatic config - Values passed to the constructor
Environment variables - Values set via MODEL_TRAINER_* variables
Default values - Built-in defaults as specified above

Data Type Parsing

Float Values

Floats should be provided as decimal numbers:

export MODEL_TRAINER_TEST_SIZE=0.2
export MODEL_TRAINER_TEST_SIZE=0.3

Integer Values

Integers should be provided as numeric strings:

export MODEL_TRAINER_RANDOM_STATE=42
export MODEL_TRAINER_CV_FOLDS=5
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=20

Boolean Values

Booleans should be provided as lowercase strings:

export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false

Validation

Automatic Type Validation

Pydantic automatically validates configuration values:

test_size must be a float between 0 and 1
random_state must be a non-negative integer
cv_folds must be a positive integer
enable_hyperparameter_tuning must be a boolean
max_tuning_iterations must be a positive integer

Runtime Validation

When training models, the tool validates:

Test size - Test size must be reasonable for the dataset
Cross-validation - CV folds must be appropriate for data size
Hyperparameter tuning - Tuning iterations must be reasonable
Data compatibility - Data must be compatible with model training
Memory requirements - Training must not exceed memory limits

Model Types

The Model Trainer Tool supports various model types:

Classification Models

Logistic Regression - Linear classification model
Random Forest Classifier - Ensemble classification model
Gradient Boosting Classifier - Gradient boosting classification

Regression Models

Linear Regression - Linear regression model
Random Forest Regressor - Ensemble regression model
Gradient Boosting Regressor - Gradient boosting regression

Auto Selection

Auto - Automatically select best model type

Task Types

Classification

Binary classification
Multi-class classification
Multi-label classification

Regression

Linear regression
Non-linear regression
Time series regression

Clustering

K-means clustering
Hierarchical clustering
Density-based clustering

Operations Supported

The Model Trainer Tool supports comprehensive machine learning operations:

Basic Training

train_model - Train machine learning models
train_classifier - Train classification models
train_regressor - Train regression models
auto_train - Automatically train best model
train_multiple_models - Train multiple model types

Hyperparameter Tuning

tune_hyperparameters - Perform hyperparameter tuning
grid_search - Grid search hyperparameter optimization
random_search - Random search hyperparameter optimization
bayesian_optimization - Bayesian hyperparameter optimization
optimize_model - Optimize model hyperparameters

Model Evaluation

evaluate_model - Evaluate model performance
cross_validate - Perform cross-validation
compare_models - Compare multiple models
generate_metrics - Generate performance metrics
create_evaluation_report - Create comprehensive evaluation report

Feature Analysis

analyze_feature_importance - Analyze feature importance
select_features - Select important features
rank_features - Rank features by importance
generate_feature_report - Generate feature analysis report
visualize_features - Visualize feature importance

Model Management

save_model - Save trained models
load_model - Load saved models
export_model - Export models in various formats
create_model_pipeline - Create model training pipeline
deploy_model - Deploy models for inference

Advanced Operations

explain_model - Generate model explanations
create_model_report - Create comprehensive model report
validate_model - Validate model performance
optimize_model_size - Optimize model for deployment
benchmark_models - Benchmark model performance

Troubleshooting

Issue: Model training fails

Error: TrainingError during model training

Solutions:

Check data quality and format
Verify feature engineering
Check memory availability
Validate hyperparameters

Issue: Hyperparameter tuning takes too long

Error: Tuning process is very slow

Solutions:

# Reduce tuning iterations
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=10

# Disable tuning for development
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false

Issue: Cross-validation errors

Error: CV validation fails

Solutions:

# Reduce CV folds
export MODEL_TRAINER_CV_FOLDS=3

# Check data size and quality
model_trainer.validate_data(data)

Issue: Memory usage exceeded

Error: Out of memory during training

Solutions:

Reduce dataset size
Use simpler models
Disable hyperparameter tuning
Process data in batches

Issue: Poor model performance

Error: Low model accuracy/scores

Solutions:

Enable hyperparameter tuning
Increase CV folds for better evaluation
Check feature engineering
Try different model types

Issue: Non-reproducible results

Error: Results vary between runs

Solutions:

# Set fixed random state
export MODEL_TRAINER_RANDOM_STATE=42

# Ensure consistent data preprocessing
model_trainer.set_random_state(42)

Issue: Test set too small/large

Error: Unreliable model evaluation

Solutions:

# Adjust test size
export MODEL_TRAINER_TEST_SIZE=0.2

# Or use cross-validation instead
export MODEL_TRAINER_CV_FOLDS=10

Best Practices

Performance Optimization

Test Size Selection - Choose appropriate test size for dataset
CV Folds - Use appropriate number of CV folds
Hyperparameter Tuning - Enable only when needed
Model Selection - Choose appropriate model types
Feature Engineering - Optimize feature selection

Error Handling

Graceful Degradation - Handle training failures gracefully
Validation - Validate data before training
Fallback Strategies - Provide fallback model types
Error Logging - Log errors for debugging and monitoring
User Feedback - Provide clear error messages

Security

Data Privacy - Ensure data privacy during training
Model Security - Secure trained models
Access Control - Control access to training results
Audit Logging - Log training activities
Compliance - Ensure compliance with regulations

Resource Management

Memory Monitoring - Monitor memory usage during training
Processing Time - Set reasonable timeouts
Storage Optimization - Optimize model storage
Cleanup - Clean up temporary files
Resource Limits - Set appropriate resource limits

Integration

Tool Dependencies - Ensure required tools are available
API Compatibility - Maintain API compatibility
Error Propagation - Properly propagate errors
Logging Integration - Integrate with logging systems
Monitoring - Monitor tool performance and usage

Development vs Production

Development:

MODEL_TRAINER_TEST_SIZE=0.3
MODEL_TRAINER_RANDOM_STATE=123
MODEL_TRAINER_CV_FOLDS=3
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
MODEL_TRAINER_MAX_TUNING_ITERATIONS=10

Production:

MODEL_TRAINER_TEST_SIZE=0.2
MODEL_TRAINER_RANDOM_STATE=42
MODEL_TRAINER_CV_FOLDS=10
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
MODEL_TRAINER_MAX_TUNING_ITERATIONS=50

Error Handling

Always wrap training operations in try-except blocks:

from aiecs.tools.statistics.model_trainer_tool import ModelTrainerTool, ModelTrainerError, TrainingError

model_trainer = ModelTrainerTool()

try:
    model = model_trainer.train_model(
        X_train=X_train,
        y_train=y_train,
        model_type='auto',
        task_type='classification'
    )
except TrainingError as e:
    print(f"Training error: {e}")
except ModelTrainerError as e:
    print(f"Model trainer error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Dependencies

Core Dependencies

# Install core dependencies
pip install pydantic python-dotenv

# Install data processing dependencies
pip install pandas numpy scikit-learn

# Install machine learning dependencies
pip install scikit-learn xgboost lightgbm

Optional Dependencies

# For hyperparameter tuning
pip install optuna hyperopt scikit-optimize

# For model explanation
pip install shap lime

# For advanced models
pip install catboost

# For model deployment
pip install joblib pickle

Verification

# Test dependency availability
try:
    import pandas
    import numpy
    import sklearn
    print("Core dependencies available")
except ImportError as e:
    print(f"Missing dependency: {e}")

# Test ML libraries availability
try:
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.linear_model import LogisticRegression
    print("Scikit-learn available")
except ImportError:
    print("Scikit-learn not available")

# Test hyperparameter tuning availability
try:
    import optuna
    print("Hyperparameter tuning available")
except ImportError:
    print("Hyperparameter tuning not available")

# Test model explanation availability
try:
    import shap
    import lime
    print("Model explanation available")
except ImportError:
    print("Model explanation not available")

Support

For issues or questions about Model Trainer Tool configuration:

Check the tool source code for implementation details
Review statistics tool documentation for statistical analysis
Consult the main aiecs documentation for architecture overview
Test with simple datasets first to isolate configuration vs. training issues
Verify data compatibility and format requirements
Check model type and task type settings
Ensure proper hyperparameter tuning configuration
Validate data quality and preprocessing requirements

Model Trainer Tool Configuration Guide

Overview

Using .env Files in Your Project

Setting Up .env Files

Multiple Environment Files

Best Practices for .env Files

Configuration Options

1. Test Size

2. Random State

3. CV Folds

4. Enable Hyperparameter Tuning

5. Max Tuning Iterations

Usage Examples

Example 1: Basic Environment Configuration

Example 2: Production Configuration with Tuning

Example 3: Development Configuration

Example 4: Programmatic Configuration

Example 5: Mixed Configuration

Configuration Priority

Data Type Parsing

Float Values

Integer Values

Boolean Values

Validation

Automatic Type Validation

Runtime Validation

Model Types

Classification Models

Regression Models

Auto Selection

Task Types

Classification

Regression

Clustering

Operations Supported

Basic Training

Hyperparameter Tuning

Model Evaluation

Feature Analysis

Model Management

Advanced Operations

Troubleshooting

Issue: Model training fails

Issue: Hyperparameter tuning takes too long

Issue: Cross-validation errors

Issue: Memory usage exceeded

Issue: Poor model performance

Issue: Non-reproducible results

Issue: Test set too small/large

Best Practices

Performance Optimization

Error Handling

Security

Resource Management

Integration

Development vs Production

Error Handling

Dependencies

Core Dependencies

Optional Dependencies

Verification

Related Documentation

Support