Model Trainer Tool Configuration Guide
Overview
The Model Trainer Tool is an AutoML and machine learning model training tool that provides AutoML capabilities with automatic model selection for classification and regression, hyperparameter tuning, model evaluation and comparison, feature importance analysis, and model explanation support. It can train multiple model types, perform hyperparameter tuning, evaluate and compare models, generate feature importance, and provide model explanations. The tool supports various model types (logistic regression, linear regression, random forest, gradient boosting) and task types (classification, regression, clustering). The tool can be configured via environment variables using the MODEL_TRAINER_ prefix or through programmatic configuration when initializing the tool.
Using .env Files in Your Project
When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Model Trainer Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.
Setting Up .env Files
1. Install python-dotenv:
pip install python-dotenv
2. Create a .env file in your project root:
# .env file in your project root
MODEL_TRAINER_TEST_SIZE=0.2
MODEL_TRAINER_RANDOM_STATE=42
MODEL_TRAINER_CV_FOLDS=5
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
MODEL_TRAINER_MAX_TUNING_ITERATIONS=20
3. Load the .env file in your application:
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv
# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()
# Now import and use aiecs tools
from aiecs.tools.statistics.model_trainer_tool import ModelTrainerTool
# The tool will automatically use the environment variables
model_trainer = ModelTrainerTool()
Multiple Environment Files
You can use different .env files for different environments:
import os
from dotenv import load_dotenv
# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')
if env == 'production':
load_dotenv('.env.production')
elif env == 'staging':
load_dotenv('.env.staging')
else:
load_dotenv('.env.development')
from aiecs.tools.statistics.model_trainer_tool import ModelTrainerTool
model_trainer = ModelTrainerTool()
Example .env.production:
# Production settings - optimized for robust model training
MODEL_TRAINER_TEST_SIZE=0.2
MODEL_TRAINER_RANDOM_STATE=42
MODEL_TRAINER_CV_FOLDS=10
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
MODEL_TRAINER_MAX_TUNING_ITERATIONS=50
Example .env.development:
# Development settings - optimized for testing and debugging
MODEL_TRAINER_TEST_SIZE=0.3
MODEL_TRAINER_RANDOM_STATE=123
MODEL_TRAINER_CV_FOLDS=3
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
MODEL_TRAINER_MAX_TUNING_ITERATIONS=10
Best Practices for .env Files
Never commit .env files to version control - Add
.envto your.gitignore:# .gitignore .env .env.local .env.*.local .env.production .env.staging
Provide a template - Create
.env.examplewith documented dummy values:# .env.example # Model Trainer Tool Configuration # Proportion of data to use for testing MODEL_TRAINER_TEST_SIZE=0.2 # Random state for reproducibility MODEL_TRAINER_RANDOM_STATE=42 # Number of cross-validation folds MODEL_TRAINER_CV_FOLDS=5 # Whether to enable hyperparameter tuning MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false # Maximum number of hyperparameter tuning iterations MODEL_TRAINER_MAX_TUNING_ITERATIONS=20
Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format values correctly:
Floats: Decimal numbers:
0.2,0.3Integers: Plain numbers:
42,5,20Booleans:
trueorfalse
Configuration Options
1. Test Size
Environment Variable: MODEL_TRAINER_TEST_SIZE
Type: Float
Default: 0.2
Description: Proportion of data to use for testing. This determines how much of the dataset is reserved for model evaluation.
Common Values:
0.1- Small test set (10% for testing)0.2- Standard test set (20% for testing, default)0.3- Large test set (30% for testing)0.4- Very large test set (40% for testing)
Example:
export MODEL_TRAINER_TEST_SIZE=0.2
Test Size Note: Larger test sets provide more reliable evaluation but less training data.
2. Random State
Environment Variable: MODEL_TRAINER_RANDOM_STATE
Type: Integer
Default: 42
Description: Random state for reproducibility. This ensures consistent results across runs by controlling random number generation.
Common Values:
42- Standard random state (default)123- Alternative random state0- Zero random stateNone- Truly random (not reproducible)
Example:
export MODEL_TRAINER_RANDOM_STATE=42
Random State Note: Use the same random state for reproducible results across experiments.
3. CV Folds
Environment Variable: MODEL_TRAINER_CV_FOLDS
Type: Integer
Default: 5
Description: Number of cross-validation folds for model evaluation. This determines how many times the data is split for cross-validation.
Common Values:
3- Minimal cross-validation (3 folds)5- Standard cross-validation (5 folds, default)10- Comprehensive cross-validation (10 folds)20- Extensive cross-validation (20 folds)
Example:
export MODEL_TRAINER_CV_FOLDS=5
CV Folds Note: More folds provide better evaluation but take longer to compute.
4. Enable Hyperparameter Tuning
Environment Variable: MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING
Type: Boolean
Default: False
Description: Whether to enable hyperparameter tuning. When enabled, the tool will automatically search for optimal hyperparameters.
Values:
true- Enable hyperparameter tuningfalse- Disable hyperparameter tuning (default)
Example:
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
Tuning Note: Hyperparameter tuning improves model performance but significantly increases training time.
5. Max Tuning Iterations
Environment Variable: MODEL_TRAINER_MAX_TUNING_ITERATIONS
Type: Integer
Default: 20
Description: Maximum number of hyperparameter tuning iterations. This limits how many different hyperparameter combinations are tested.
Common Values:
10- Quick tuning (10 iterations)20- Standard tuning (20 iterations, default)50- Comprehensive tuning (50 iterations)100- Extensive tuning (100 iterations)
Example:
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=50
Iterations Note: More iterations may find better hyperparameters but take much longer.
Usage Examples
Example 1: Basic Environment Configuration
# Set basic model training parameters
export MODEL_TRAINER_TEST_SIZE=0.2
export MODEL_TRAINER_RANDOM_STATE=42
export MODEL_TRAINER_CV_FOLDS=5
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=20
# Run your application
python app.py
Example 2: Production Configuration with Tuning
# Optimized for production with hyperparameter tuning
export MODEL_TRAINER_TEST_SIZE=0.2
export MODEL_TRAINER_RANDOM_STATE=42
export MODEL_TRAINER_CV_FOLDS=10
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=50
Example 3: Development Configuration
# Development-friendly settings
export MODEL_TRAINER_TEST_SIZE=0.3
export MODEL_TRAINER_RANDOM_STATE=123
export MODEL_TRAINER_CV_FOLDS=3
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=10
Example 4: Programmatic Configuration
from aiecs.tools.statistics.model_trainer_tool import ModelTrainerTool
# Initialize with custom configuration
model_trainer = ModelTrainerTool(config={
'test_size': 0.2,
'random_state': 42,
'cv_folds': 5,
'enable_hyperparameter_tuning': False,
'max_tuning_iterations': 20
})
Example 5: Mixed Configuration
Environment variables are used as defaults, but can be overridden programmatically:
# Set environment defaults
export MODEL_TRAINER_TEST_SIZE=0.2
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
# Override for specific instance
model_trainer = ModelTrainerTool(config={
'test_size': 0.3, # This overrides the environment variable
'enable_hyperparameter_tuning': True # This overrides the environment variable
})
Configuration Priority
When the Model Trainer Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):
Programmatic config - Values passed to the constructor
Environment variables - Values set via
MODEL_TRAINER_*variablesDefault values - Built-in defaults as specified above
Data Type Parsing
Float Values
Floats should be provided as decimal numbers:
export MODEL_TRAINER_TEST_SIZE=0.2
export MODEL_TRAINER_TEST_SIZE=0.3
Integer Values
Integers should be provided as numeric strings:
export MODEL_TRAINER_RANDOM_STATE=42
export MODEL_TRAINER_CV_FOLDS=5
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=20
Boolean Values
Booleans should be provided as lowercase strings:
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
Validation
Automatic Type Validation
Pydantic automatically validates configuration values:
test_sizemust be a float between 0 and 1random_statemust be a non-negative integercv_foldsmust be a positive integerenable_hyperparameter_tuningmust be a booleanmax_tuning_iterationsmust be a positive integer
Runtime Validation
When training models, the tool validates:
Test size - Test size must be reasonable for the dataset
Cross-validation - CV folds must be appropriate for data size
Hyperparameter tuning - Tuning iterations must be reasonable
Data compatibility - Data must be compatible with model training
Memory requirements - Training must not exceed memory limits
Model Types
The Model Trainer Tool supports various model types:
Classification Models
Logistic Regression - Linear classification model
Random Forest Classifier - Ensemble classification model
Gradient Boosting Classifier - Gradient boosting classification
Regression Models
Linear Regression - Linear regression model
Random Forest Regressor - Ensemble regression model
Gradient Boosting Regressor - Gradient boosting regression
Auto Selection
Auto - Automatically select best model type
Task Types
Classification
Binary classification
Multi-class classification
Multi-label classification
Regression
Linear regression
Non-linear regression
Time series regression
Clustering
K-means clustering
Hierarchical clustering
Density-based clustering
Operations Supported
The Model Trainer Tool supports comprehensive machine learning operations:
Basic Training
train_model- Train machine learning modelstrain_classifier- Train classification modelstrain_regressor- Train regression modelsauto_train- Automatically train best modeltrain_multiple_models- Train multiple model types
Hyperparameter Tuning
tune_hyperparameters- Perform hyperparameter tuninggrid_search- Grid search hyperparameter optimizationrandom_search- Random search hyperparameter optimizationbayesian_optimization- Bayesian hyperparameter optimizationoptimize_model- Optimize model hyperparameters
Model Evaluation
evaluate_model- Evaluate model performancecross_validate- Perform cross-validationcompare_models- Compare multiple modelsgenerate_metrics- Generate performance metricscreate_evaluation_report- Create comprehensive evaluation report
Feature Analysis
analyze_feature_importance- Analyze feature importanceselect_features- Select important featuresrank_features- Rank features by importancegenerate_feature_report- Generate feature analysis reportvisualize_features- Visualize feature importance
Model Management
save_model- Save trained modelsload_model- Load saved modelsexport_model- Export models in various formatscreate_model_pipeline- Create model training pipelinedeploy_model- Deploy models for inference
Advanced Operations
explain_model- Generate model explanationscreate_model_report- Create comprehensive model reportvalidate_model- Validate model performanceoptimize_model_size- Optimize model for deploymentbenchmark_models- Benchmark model performance
Troubleshooting
Issue: Model training fails
Error: TrainingError during model training
Solutions:
Check data quality and format
Verify feature engineering
Check memory availability
Validate hyperparameters
Issue: Hyperparameter tuning takes too long
Error: Tuning process is very slow
Solutions:
# Reduce tuning iterations
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=10
# Disable tuning for development
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
Issue: Cross-validation errors
Error: CV validation fails
Solutions:
# Reduce CV folds
export MODEL_TRAINER_CV_FOLDS=3
# Check data size and quality
model_trainer.validate_data(data)
Issue: Memory usage exceeded
Error: Out of memory during training
Solutions:
Reduce dataset size
Use simpler models
Disable hyperparameter tuning
Process data in batches
Issue: Poor model performance
Error: Low model accuracy/scores
Solutions:
Enable hyperparameter tuning
Increase CV folds for better evaluation
Check feature engineering
Try different model types
Issue: Non-reproducible results
Error: Results vary between runs
Solutions:
# Set fixed random state
export MODEL_TRAINER_RANDOM_STATE=42
# Ensure consistent data preprocessing
model_trainer.set_random_state(42)
Issue: Test set too small/large
Error: Unreliable model evaluation
Solutions:
# Adjust test size
export MODEL_TRAINER_TEST_SIZE=0.2
# Or use cross-validation instead
export MODEL_TRAINER_CV_FOLDS=10
Best Practices
Performance Optimization
Test Size Selection - Choose appropriate test size for dataset
CV Folds - Use appropriate number of CV folds
Hyperparameter Tuning - Enable only when needed
Model Selection - Choose appropriate model types
Feature Engineering - Optimize feature selection
Error Handling
Graceful Degradation - Handle training failures gracefully
Validation - Validate data before training
Fallback Strategies - Provide fallback model types
Error Logging - Log errors for debugging and monitoring
User Feedback - Provide clear error messages
Security
Data Privacy - Ensure data privacy during training
Model Security - Secure trained models
Access Control - Control access to training results
Audit Logging - Log training activities
Compliance - Ensure compliance with regulations
Resource Management
Memory Monitoring - Monitor memory usage during training
Processing Time - Set reasonable timeouts
Storage Optimization - Optimize model storage
Cleanup - Clean up temporary files
Resource Limits - Set appropriate resource limits
Integration
Tool Dependencies - Ensure required tools are available
API Compatibility - Maintain API compatibility
Error Propagation - Properly propagate errors
Logging Integration - Integrate with logging systems
Monitoring - Monitor tool performance and usage
Development vs Production
Development:
MODEL_TRAINER_TEST_SIZE=0.3
MODEL_TRAINER_RANDOM_STATE=123
MODEL_TRAINER_CV_FOLDS=3
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
MODEL_TRAINER_MAX_TUNING_ITERATIONS=10
Production:
MODEL_TRAINER_TEST_SIZE=0.2
MODEL_TRAINER_RANDOM_STATE=42
MODEL_TRAINER_CV_FOLDS=10
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
MODEL_TRAINER_MAX_TUNING_ITERATIONS=50
Error Handling
Always wrap training operations in try-except blocks:
from aiecs.tools.statistics.model_trainer_tool import ModelTrainerTool, ModelTrainerError, TrainingError
model_trainer = ModelTrainerTool()
try:
model = model_trainer.train_model(
X_train=X_train,
y_train=y_train,
model_type='auto',
task_type='classification'
)
except TrainingError as e:
print(f"Training error: {e}")
except ModelTrainerError as e:
print(f"Model trainer error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Dependencies
Core Dependencies
# Install core dependencies
pip install pydantic python-dotenv
# Install data processing dependencies
pip install pandas numpy scikit-learn
# Install machine learning dependencies
pip install scikit-learn xgboost lightgbm
Optional Dependencies
# For hyperparameter tuning
pip install optuna hyperopt scikit-optimize
# For model explanation
pip install shap lime
# For advanced models
pip install catboost
# For model deployment
pip install joblib pickle
Verification
# Test dependency availability
try:
import pandas
import numpy
import sklearn
print("Core dependencies available")
except ImportError as e:
print(f"Missing dependency: {e}")
# Test ML libraries availability
try:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
print("Scikit-learn available")
except ImportError:
print("Scikit-learn not available")
# Test hyperparameter tuning availability
try:
import optuna
print("Hyperparameter tuning available")
except ImportError:
print("Hyperparameter tuning not available")
# Test model explanation availability
try:
import shap
import lime
print("Model explanation available")
except ImportError:
print("Model explanation not available")
Support
For issues or questions about Model Trainer Tool configuration:
Check the tool source code for implementation details
Review statistics tool documentation for statistical analysis
Consult the main aiecs documentation for architecture overview
Test with simple datasets first to isolate configuration vs. training issues
Verify data compatibility and format requirements
Check model type and task type settings
Ensure proper hyperparameter tuning configuration
Validate data quality and preprocessing requirements