# Model Trainer Tool Configuration Guide

## Overview

The Model Trainer Tool is an AutoML and machine learning model training tool that provides AutoML capabilities with automatic model selection for classification and regression, hyperparameter tuning, model evaluation and comparison, feature importance analysis, and model explanation support. It can train multiple model types, perform hyperparameter tuning, evaluate and compare models, generate feature importance, and provide model explanations. The tool supports various model types (logistic regression, linear regression, random forest, gradient boosting) and task types (classification, regression, clustering). The tool can be configured via environment variables using the `MODEL_TRAINER_` prefix or through programmatic configuration when initializing the tool.

## Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a `.env` file for convenience. The Model Trainer Tool reads from environment variables that are already loaded into the process, so you need to load the `.env` file in your application before importing aiecs tools.

### Setting Up .env Files

**1. Install python-dotenv:**

```bash
pip install python-dotenv
```

**2. Create a `.env` file in your project root:**

```bash
# .env file in your project root
MODEL_TRAINER_TEST_SIZE=0.2
MODEL_TRAINER_RANDOM_STATE=42
MODEL_TRAINER_CV_FOLDS=5
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
MODEL_TRAINER_MAX_TUNING_ITERATIONS=20
```

**3. Load the .env file in your application:**

```python
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.statistics.model_trainer_tool import ModelTrainerTool

# The tool will automatically use the environment variables
model_trainer = ModelTrainerTool()
```

### Multiple Environment Files

You can use different `.env` files for different environments:

```python
import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.statistics.model_trainer_tool import ModelTrainerTool
model_trainer = ModelTrainerTool()
```

**Example `.env.production`:**
```bash
# Production settings - optimized for robust model training
MODEL_TRAINER_TEST_SIZE=0.2
MODEL_TRAINER_RANDOM_STATE=42
MODEL_TRAINER_CV_FOLDS=10
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
MODEL_TRAINER_MAX_TUNING_ITERATIONS=50
```

**Example `.env.development`:**
```bash
# Development settings - optimized for testing and debugging
MODEL_TRAINER_TEST_SIZE=0.3
MODEL_TRAINER_RANDOM_STATE=123
MODEL_TRAINER_CV_FOLDS=3
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
MODEL_TRAINER_MAX_TUNING_ITERATIONS=10
```

### Best Practices for .env Files

1. **Never commit .env files to version control** - Add `.env` to your `.gitignore`:
   ```gitignore
   # .gitignore
   .env
   .env.local
   .env.*.local
   .env.production
   .env.staging
   ```

2. **Provide a template** - Create `.env.example` with documented dummy values:
   ```bash
   # .env.example
   # Model Trainer Tool Configuration
   
   # Proportion of data to use for testing
   MODEL_TRAINER_TEST_SIZE=0.2
   
   # Random state for reproducibility
   MODEL_TRAINER_RANDOM_STATE=42
   
   # Number of cross-validation folds
   MODEL_TRAINER_CV_FOLDS=5
   
   # Whether to enable hyperparameter tuning
   MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
   
   # Maximum number of hyperparameter tuning iterations
   MODEL_TRAINER_MAX_TUNING_ITERATIONS=20
   ```

3. **Document your variables** - Add comments explaining each setting

4. **Use load_dotenv() early** - Call it at the very top of your entry point, before any aiecs imports

5. **Format values correctly**:
   - Floats: Decimal numbers: `0.2`, `0.3`
   - Integers: Plain numbers: `42`, `5`, `20`
   - Booleans: `true` or `false`

## Configuration Options

### 1. Test Size

**Environment Variable:** `MODEL_TRAINER_TEST_SIZE`

**Type:** Float

**Default:** `0.2`

**Description:** Proportion of data to use for testing. This determines how much of the dataset is reserved for model evaluation.

**Common Values:**
- `0.1` - Small test set (10% for testing)
- `0.2` - Standard test set (20% for testing, default)
- `0.3` - Large test set (30% for testing)
- `0.4` - Very large test set (40% for testing)

**Example:**
```bash
export MODEL_TRAINER_TEST_SIZE=0.2
```

**Test Size Note:** Larger test sets provide more reliable evaluation but less training data.

### 2. Random State

**Environment Variable:** `MODEL_TRAINER_RANDOM_STATE`

**Type:** Integer

**Default:** `42`

**Description:** Random state for reproducibility. This ensures consistent results across runs by controlling random number generation.

**Common Values:**
- `42` - Standard random state (default)
- `123` - Alternative random state
- `0` - Zero random state
- `None` - Truly random (not reproducible)

**Example:**
```bash
export MODEL_TRAINER_RANDOM_STATE=42
```

**Random State Note:** Use the same random state for reproducible results across experiments.

### 3. CV Folds

**Environment Variable:** `MODEL_TRAINER_CV_FOLDS`

**Type:** Integer

**Default:** `5`

**Description:** Number of cross-validation folds for model evaluation. This determines how many times the data is split for cross-validation.

**Common Values:**
- `3` - Minimal cross-validation (3 folds)
- `5` - Standard cross-validation (5 folds, default)
- `10` - Comprehensive cross-validation (10 folds)
- `20` - Extensive cross-validation (20 folds)

**Example:**
```bash
export MODEL_TRAINER_CV_FOLDS=5
```

**CV Folds Note:** More folds provide better evaluation but take longer to compute.

### 4. Enable Hyperparameter Tuning

**Environment Variable:** `MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING`

**Type:** Boolean

**Default:** `False`

**Description:** Whether to enable hyperparameter tuning. When enabled, the tool will automatically search for optimal hyperparameters.

**Values:**
- `true` - Enable hyperparameter tuning
- `false` - Disable hyperparameter tuning (default)

**Example:**
```bash
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
```

**Tuning Note:** Hyperparameter tuning improves model performance but significantly increases training time.

### 5. Max Tuning Iterations

**Environment Variable:** `MODEL_TRAINER_MAX_TUNING_ITERATIONS`

**Type:** Integer

**Default:** `20`

**Description:** Maximum number of hyperparameter tuning iterations. This limits how many different hyperparameter combinations are tested.

**Common Values:**
- `10` - Quick tuning (10 iterations)
- `20` - Standard tuning (20 iterations, default)
- `50` - Comprehensive tuning (50 iterations)
- `100` - Extensive tuning (100 iterations)

**Example:**
```bash
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=50
```

**Iterations Note:** More iterations may find better hyperparameters but take much longer.

## Usage Examples

### Example 1: Basic Environment Configuration

```bash
# Set basic model training parameters
export MODEL_TRAINER_TEST_SIZE=0.2
export MODEL_TRAINER_RANDOM_STATE=42
export MODEL_TRAINER_CV_FOLDS=5
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=20

# Run your application
python app.py
```

### Example 2: Production Configuration with Tuning

```bash
# Optimized for production with hyperparameter tuning
export MODEL_TRAINER_TEST_SIZE=0.2
export MODEL_TRAINER_RANDOM_STATE=42
export MODEL_TRAINER_CV_FOLDS=10
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=50
```

### Example 3: Development Configuration

```bash
# Development-friendly settings
export MODEL_TRAINER_TEST_SIZE=0.3
export MODEL_TRAINER_RANDOM_STATE=123
export MODEL_TRAINER_CV_FOLDS=3
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=10
```

### Example 4: Programmatic Configuration

```python
from aiecs.tools.statistics.model_trainer_tool import ModelTrainerTool

# Initialize with custom configuration
model_trainer = ModelTrainerTool(config={
    'test_size': 0.2,
    'random_state': 42,
    'cv_folds': 5,
    'enable_hyperparameter_tuning': False,
    'max_tuning_iterations': 20
})
```

### Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

```bash
# Set environment defaults
export MODEL_TRAINER_TEST_SIZE=0.2
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
```

```python
# Override for specific instance
model_trainer = ModelTrainerTool(config={
    'test_size': 0.3,  # This overrides the environment variable
    'enable_hyperparameter_tuning': True  # This overrides the environment variable
})
```

## Configuration Priority

When the Model Trainer Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

1. **Programmatic config** - Values passed to the constructor
2. **Environment variables** - Values set via `MODEL_TRAINER_*` variables
3. **Default values** - Built-in defaults as specified above

## Data Type Parsing

### Float Values

Floats should be provided as decimal numbers:

```bash
export MODEL_TRAINER_TEST_SIZE=0.2
export MODEL_TRAINER_TEST_SIZE=0.3
```

### Integer Values

Integers should be provided as numeric strings:

```bash
export MODEL_TRAINER_RANDOM_STATE=42
export MODEL_TRAINER_CV_FOLDS=5
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=20
```

### Boolean Values

Booleans should be provided as lowercase strings:

```bash
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
```

## Validation

### Automatic Type Validation

Pydantic automatically validates configuration values:

- `test_size` must be a float between 0 and 1
- `random_state` must be a non-negative integer
- `cv_folds` must be a positive integer
- `enable_hyperparameter_tuning` must be a boolean
- `max_tuning_iterations` must be a positive integer

### Runtime Validation

When training models, the tool validates:

1. **Test size** - Test size must be reasonable for the dataset
2. **Cross-validation** - CV folds must be appropriate for data size
3. **Hyperparameter tuning** - Tuning iterations must be reasonable
4. **Data compatibility** - Data must be compatible with model training
5. **Memory requirements** - Training must not exceed memory limits

## Model Types

The Model Trainer Tool supports various model types:

### Classification Models
- **Logistic Regression** - Linear classification model
- **Random Forest Classifier** - Ensemble classification model
- **Gradient Boosting Classifier** - Gradient boosting classification

### Regression Models
- **Linear Regression** - Linear regression model
- **Random Forest Regressor** - Ensemble regression model
- **Gradient Boosting Regressor** - Gradient boosting regression

### Auto Selection
- **Auto** - Automatically select best model type

## Task Types

### Classification
- Binary classification
- Multi-class classification
- Multi-label classification

### Regression
- Linear regression
- Non-linear regression
- Time series regression

### Clustering
- K-means clustering
- Hierarchical clustering
- Density-based clustering

## Operations Supported

The Model Trainer Tool supports comprehensive machine learning operations:

### Basic Training
- `train_model` - Train machine learning models
- `train_classifier` - Train classification models
- `train_regressor` - Train regression models
- `auto_train` - Automatically train best model
- `train_multiple_models` - Train multiple model types

### Hyperparameter Tuning
- `tune_hyperparameters` - Perform hyperparameter tuning
- `grid_search` - Grid search hyperparameter optimization
- `random_search` - Random search hyperparameter optimization
- `bayesian_optimization` - Bayesian hyperparameter optimization
- `optimize_model` - Optimize model hyperparameters

### Model Evaluation
- `evaluate_model` - Evaluate model performance
- `cross_validate` - Perform cross-validation
- `compare_models` - Compare multiple models
- `generate_metrics` - Generate performance metrics
- `create_evaluation_report` - Create comprehensive evaluation report

### Feature Analysis
- `analyze_feature_importance` - Analyze feature importance
- `select_features` - Select important features
- `rank_features` - Rank features by importance
- `generate_feature_report` - Generate feature analysis report
- `visualize_features` - Visualize feature importance

### Model Management
- `save_model` - Save trained models
- `load_model` - Load saved models
- `export_model` - Export models in various formats
- `create_model_pipeline` - Create model training pipeline
- `deploy_model` - Deploy models for inference

### Advanced Operations
- `explain_model` - Generate model explanations
- `create_model_report` - Create comprehensive model report
- `validate_model` - Validate model performance
- `optimize_model_size` - Optimize model for deployment
- `benchmark_models` - Benchmark model performance

## Troubleshooting

### Issue: Model training fails

**Error:** `TrainingError` during model training

**Solutions:**
1. Check data quality and format
2. Verify feature engineering
3. Check memory availability
4. Validate hyperparameters

### Issue: Hyperparameter tuning takes too long

**Error:** Tuning process is very slow

**Solutions:**
```bash
# Reduce tuning iterations
export MODEL_TRAINER_MAX_TUNING_ITERATIONS=10

# Disable tuning for development
export MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
```

### Issue: Cross-validation errors

**Error:** CV validation fails

**Solutions:**
```bash
# Reduce CV folds
export MODEL_TRAINER_CV_FOLDS=3

# Check data size and quality
model_trainer.validate_data(data)
```

### Issue: Memory usage exceeded

**Error:** Out of memory during training

**Solutions:**
1. Reduce dataset size
2. Use simpler models
3. Disable hyperparameter tuning
4. Process data in batches

### Issue: Poor model performance

**Error:** Low model accuracy/scores

**Solutions:**
1. Enable hyperparameter tuning
2. Increase CV folds for better evaluation
3. Check feature engineering
4. Try different model types

### Issue: Non-reproducible results

**Error:** Results vary between runs

**Solutions:**
```bash
# Set fixed random state
export MODEL_TRAINER_RANDOM_STATE=42

# Ensure consistent data preprocessing
model_trainer.set_random_state(42)
```

### Issue: Test set too small/large

**Error:** Unreliable model evaluation

**Solutions:**
```bash
# Adjust test size
export MODEL_TRAINER_TEST_SIZE=0.2

# Or use cross-validation instead
export MODEL_TRAINER_CV_FOLDS=10
```

## Best Practices

### Performance Optimization

1. **Test Size Selection** - Choose appropriate test size for dataset
2. **CV Folds** - Use appropriate number of CV folds
3. **Hyperparameter Tuning** - Enable only when needed
4. **Model Selection** - Choose appropriate model types
5. **Feature Engineering** - Optimize feature selection

### Error Handling

1. **Graceful Degradation** - Handle training failures gracefully
2. **Validation** - Validate data before training
3. **Fallback Strategies** - Provide fallback model types
4. **Error Logging** - Log errors for debugging and monitoring
5. **User Feedback** - Provide clear error messages

### Security

1. **Data Privacy** - Ensure data privacy during training
2. **Model Security** - Secure trained models
3. **Access Control** - Control access to training results
4. **Audit Logging** - Log training activities
5. **Compliance** - Ensure compliance with regulations

### Resource Management

1. **Memory Monitoring** - Monitor memory usage during training
2. **Processing Time** - Set reasonable timeouts
3. **Storage Optimization** - Optimize model storage
4. **Cleanup** - Clean up temporary files
5. **Resource Limits** - Set appropriate resource limits

### Integration

1. **Tool Dependencies** - Ensure required tools are available
2. **API Compatibility** - Maintain API compatibility
3. **Error Propagation** - Properly propagate errors
4. **Logging Integration** - Integrate with logging systems
5. **Monitoring** - Monitor tool performance and usage

### Development vs Production

**Development:**
```bash
MODEL_TRAINER_TEST_SIZE=0.3
MODEL_TRAINER_RANDOM_STATE=123
MODEL_TRAINER_CV_FOLDS=3
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=false
MODEL_TRAINER_MAX_TUNING_ITERATIONS=10
```

**Production:**
```bash
MODEL_TRAINER_TEST_SIZE=0.2
MODEL_TRAINER_RANDOM_STATE=42
MODEL_TRAINER_CV_FOLDS=10
MODEL_TRAINER_ENABLE_HYPERPARAMETER_TUNING=true
MODEL_TRAINER_MAX_TUNING_ITERATIONS=50
```

### Error Handling

Always wrap training operations in try-except blocks:

```python
from aiecs.tools.statistics.model_trainer_tool import ModelTrainerTool, ModelTrainerError, TrainingError

model_trainer = ModelTrainerTool()

try:
    model = model_trainer.train_model(
        X_train=X_train,
        y_train=y_train,
        model_type='auto',
        task_type='classification'
    )
except TrainingError as e:
    print(f"Training error: {e}")
except ModelTrainerError as e:
    print(f"Model trainer error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")
```

## Dependencies

### Core Dependencies

```bash
# Install core dependencies
pip install pydantic python-dotenv

# Install data processing dependencies
pip install pandas numpy scikit-learn

# Install machine learning dependencies
pip install scikit-learn xgboost lightgbm
```

### Optional Dependencies

```bash
# For hyperparameter tuning
pip install optuna hyperopt scikit-optimize

# For model explanation
pip install shap lime

# For advanced models
pip install catboost

# For model deployment
pip install joblib pickle
```

### Verification

```python
# Test dependency availability
try:
    import pandas
    import numpy
    import sklearn
    print("Core dependencies available")
except ImportError as e:
    print(f"Missing dependency: {e}")

# Test ML libraries availability
try:
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.linear_model import LogisticRegression
    print("Scikit-learn available")
except ImportError:
    print("Scikit-learn not available")

# Test hyperparameter tuning availability
try:
    import optuna
    print("Hyperparameter tuning available")
except ImportError:
    print("Hyperparameter tuning not available")

# Test model explanation availability
try:
    import shap
    import lime
    print("Model explanation available")
except ImportError:
    print("Model explanation not available")
```

## Related Documentation

- Tool implementation details in the source code
- Statistics tool documentation for statistical analysis
- Data transformer tool documentation for feature engineering
- Main aiecs documentation for architecture overview

## Support

For issues or questions about Model Trainer Tool configuration:
- Check the tool source code for implementation details
- Review statistics tool documentation for statistical analysis
- Consult the main aiecs documentation for architecture overview
- Test with simple datasets first to isolate configuration vs. training issues
- Verify data compatibility and format requirements
- Check model type and task type settings
- Ensure proper hyperparameter tuning configuration
- Validate data quality and preprocessing requirements