# Statistical Analyzer Tool Configuration Guide

## Overview

The Statistical Analyzer Tool is an advanced statistical analysis and hypothesis testing tool that provides comprehensive statistical analysis with descriptive and inferential statistics, hypothesis testing (t-test, ANOVA, chi-square), regression analysis, time series analysis, and correlation and causality analysis. It can perform hypothesis testing, conduct regression analysis, analyze time series, and perform correlation and causal analysis. The tool integrates with stats_tool for core statistical operations and supports various analysis types (descriptive, t_test, anova, chi_square, linear_regression, logistic_regression, correlation, time_series). The tool can be configured via environment variables using the `STATISTICAL_ANALYZER_` prefix or through programmatic configuration when initializing the tool.

## Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a `.env` file for convenience. The Statistical Analyzer Tool reads from environment variables that are already loaded into the process, so you need to load the `.env` file in your application before importing aiecs tools.

### Setting Up .env Files

**1. Install python-dotenv:**

```bash
pip install python-dotenv
```

**2. Create a `.env` file in your project root:**

```bash
# .env file in your project root
STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL=0.05
STATISTICAL_ANALYZER_CONFIDENCE_LEVEL=0.95
STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=true
```

**3. Load the .env file in your application:**

```python
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.statistics.statistical_analyzer_tool import StatisticalAnalyzerTool

# The tool will automatically use the environment variables
statistical_analyzer = StatisticalAnalyzerTool()
```

### Multiple Environment Files

You can use different `.env` files for different environments:

```python
import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.statistics.statistical_analyzer_tool import StatisticalAnalyzerTool
statistical_analyzer = StatisticalAnalyzerTool()
```

**Example `.env.production`:**
```bash
# Production settings - optimized for rigorous statistical analysis
STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL=0.01
STATISTICAL_ANALYZER_CONFIDENCE_LEVEL=0.99
STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=true
```

**Example `.env.development`:**
```bash
# Development settings - optimized for testing and debugging
STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL=0.05
STATISTICAL_ANALYZER_CONFIDENCE_LEVEL=0.95
STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=false
```

### Best Practices for .env Files

1. **Never commit .env files to version control** - Add `.env` to your `.gitignore`:
   ```gitignore
   # .gitignore
   .env
   .env.local
   .env.*.local
   .env.production
   .env.staging
   ```

2. **Provide a template** - Create `.env.example` with documented dummy values:
   ```bash
   # .env.example
   # Statistical Analyzer Tool Configuration
   
   # Significance level for hypothesis testing
   STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL=0.05
   
   # Confidence level for statistical intervals
   STATISTICAL_ANALYZER_CONFIDENCE_LEVEL=0.95
   
   # Whether to calculate effect sizes in analyses
   STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=true
   ```

3. **Document your variables** - Add comments explaining each setting

4. **Use load_dotenv() early** - Call it at the very top of your entry point, before any aiecs imports

5. **Format values correctly**:
   - Floats: Decimal numbers: `0.05`, `0.95`
   - Booleans: `true` or `false`

## Configuration Options

### 1. Significance Level

**Environment Variable:** `STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL`

**Type:** Float

**Default:** `0.05`

**Description:** Significance level (alpha) for hypothesis testing. This determines the threshold for rejecting the null hypothesis.

**Common Values:**
- `0.01` - Very strict significance (1% level)
- `0.05` - Standard significance (5% level, default)
- `0.10` - Lenient significance (10% level)
- `0.001` - Extremely strict significance (0.1% level)

**Example:**
```bash
export STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL=0.01
```

**Significance Note:** Lower values are more strict and require stronger evidence to reject the null hypothesis.

### 2. Confidence Level

**Environment Variable:** `STATISTICAL_ANALYZER_CONFIDENCE_LEVEL`

**Type:** Float

**Default:** `0.95`

**Description:** Confidence level for statistical intervals and confidence intervals. This determines the probability that the true parameter lies within the calculated interval.

**Common Values:**
- `0.90` - 90% confidence level
- `0.95` - 95% confidence level (default)
- `0.99` - 99% confidence level
- `0.999` - 99.9% confidence level

**Example:**
```bash
export STATISTICAL_ANALYZER_CONFIDENCE_LEVEL=0.99
```

**Confidence Note:** Higher confidence levels provide wider intervals but greater certainty.

### 3. Enable Effect Size

**Environment Variable:** `STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE`

**Type:** Boolean

**Default:** `True`

**Description:** Whether to calculate effect sizes in statistical analyses. Effect sizes provide information about the practical significance of results.

**Values:**
- `true` - Enable effect size calculation (default)
- `false` - Disable effect size calculation

**Example:**
```bash
export STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=true
```

**Effect Size Note:** Effect sizes help interpret the practical significance of statistical results.

## Usage Examples

### Example 1: Basic Environment Configuration

```bash
# Set basic statistical analysis parameters
export STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL=0.05
export STATISTICAL_ANALYZER_CONFIDENCE_LEVEL=0.95
export STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=true

# Run your application
python app.py
```

### Example 2: Rigorous Analysis Configuration

```bash
# Optimized for rigorous statistical analysis
export STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL=0.01
export STATISTICAL_ANALYZER_CONFIDENCE_LEVEL=0.99
export STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=true
```

### Example 3: Development Configuration

```bash
# Development-friendly settings
export STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL=0.05
export STATISTICAL_ANALYZER_CONFIDENCE_LEVEL=0.95
export STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=false
```

### Example 4: Programmatic Configuration

```python
from aiecs.tools.statistics.statistical_analyzer_tool import StatisticalAnalyzerTool

# Initialize with custom configuration
statistical_analyzer = StatisticalAnalyzerTool(config={
    'significance_level': 0.05,
    'confidence_level': 0.95,
    'enable_effect_size': True
})
```

### Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

```bash
# Set environment defaults
export STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL=0.05
export STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=true
```

```python
# Override for specific instance
statistical_analyzer = StatisticalAnalyzerTool(config={
    'significance_level': 0.01,  # This overrides the environment variable
    'enable_effect_size': False  # This overrides the environment variable
})
```

## Configuration Priority

When the Statistical Analyzer Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

1. **Programmatic config** - Values passed to the constructor
2. **Environment variables** - Values set via `STATISTICAL_ANALYZER_*` variables
3. **Default values** - Built-in defaults as specified above

## Data Type Parsing

### Float Values

Floats should be provided as decimal numbers:

```bash
export STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL=0.05
export STATISTICAL_ANALYZER_CONFIDENCE_LEVEL=0.95
```

### Boolean Values

Booleans should be provided as lowercase strings:

```bash
export STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=true
export STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=false
```

## Validation

### Automatic Type Validation

Pydantic automatically validates configuration values:

- `significance_level` must be a float between 0 and 1
- `confidence_level` must be a float between 0 and 1
- `enable_effect_size` must be a boolean

### Runtime Validation

When performing statistical analyses, the tool validates:

1. **Significance level** - Level must be appropriate for the analysis type
2. **Confidence level** - Level must be reasonable for interval estimation
3. **Data compatibility** - Data must be compatible with statistical tests
4. **Sample size** - Sample size must be adequate for the analysis
5. **Assumptions** - Data must meet test assumptions

## Analysis Types

The Statistical Analyzer Tool supports various analysis types:

### Descriptive Statistics
- **Descriptive** - Basic descriptive statistics (mean, median, std, etc.)
- **Summary statistics** - Comprehensive data summaries
- **Distribution analysis** - Distribution characteristics

### Hypothesis Testing
- **T-test** - Student's t-test for means
- **ANOVA** - Analysis of variance
- **Chi-square** - Chi-square test for independence

### Regression Analysis
- **Linear Regression** - Linear regression analysis
- **Logistic Regression** - Logistic regression analysis
- **Multiple Regression** - Multiple variable regression

### Correlation Analysis
- **Correlation** - Correlation analysis
- **Partial Correlation** - Partial correlation analysis
- **Causality** - Causal analysis

### Time Series Analysis
- **Time Series** - Time series analysis
- **Trend Analysis** - Trend detection and analysis
- **Seasonality** - Seasonal pattern analysis

## Operations Supported

The Statistical Analyzer Tool supports comprehensive statistical analysis operations:

### Basic Analysis
- `analyze_data` - Perform comprehensive statistical analysis
- `descriptive_statistics` - Generate descriptive statistics
- `summary_statistics` - Create data summaries
- `distribution_analysis` - Analyze data distributions
- `correlation_analysis` - Perform correlation analysis

### Hypothesis Testing
- `t_test` - Perform t-tests
- `anova_test` - Perform ANOVA tests
- `chi_square_test` - Perform chi-square tests
- `mann_whitney_test` - Perform Mann-Whitney U tests
- `wilcoxon_test` - Perform Wilcoxon signed-rank tests

### Regression Analysis
- `linear_regression` - Perform linear regression
- `logistic_regression` - Perform logistic regression
- `multiple_regression` - Perform multiple regression
- `polynomial_regression` - Perform polynomial regression
- `ridge_regression` - Perform ridge regression

### Advanced Analysis
- `time_series_analysis` - Perform time series analysis
- `causal_analysis` - Perform causal analysis
- `effect_size_analysis` - Calculate effect sizes
- `power_analysis` - Perform statistical power analysis
- `meta_analysis` - Perform meta-analysis

### Statistical Tests
- `normality_tests` - Test for normality
- `homogeneity_tests` - Test for homogeneity of variance
- `independence_tests` - Test for independence
- `stationarity_tests` - Test for stationarity
- `cointegration_tests` - Test for cointegration

### Reporting Operations
- `generate_report` - Generate statistical analysis report
- `create_summary` - Create analysis summary
- `export_results` - Export analysis results
- `visualize_results` - Create result visualizations
- `interpret_results` - Provide result interpretations

## Troubleshooting

### Issue: Statistical test assumptions not met

**Error:** Test assumptions violated

**Solutions:**
1. Check data normality
2. Verify homogeneity of variance
3. Use non-parametric alternatives
4. Transform data if needed

### Issue: Insufficient sample size

**Error:** Sample size too small for analysis

**Solutions:**
1. Increase sample size
2. Use appropriate tests for small samples
3. Adjust significance level
4. Consider effect size requirements

### Issue: Multiple comparison problems

**Error:** Multiple testing issues

**Solutions:**
1. Apply Bonferroni correction
2. Use FDR correction
3. Adjust significance level
4. Use appropriate post-hoc tests

### Issue: Non-normal data

**Error:** Data not normally distributed

**Solutions:**
1. Use non-parametric tests
2. Transform data
3. Use robust statistical methods
4. Check for outliers

### Issue: Missing data

**Error:** Missing values in analysis

**Solutions:**
1. Handle missing data appropriately
2. Use complete case analysis
3. Apply imputation methods
4. Use maximum likelihood estimation

### Issue: Correlation vs causation

**Error:** Confusing correlation with causation

**Solutions:**
1. Use causal analysis methods
2. Control for confounding variables
3. Apply appropriate statistical techniques
4. Consider experimental design

### Issue: Effect size interpretation

**Error:** Effect size calculation or interpretation issues

**Solutions:**
```bash
# Enable effect size calculation
export STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=true

# Use appropriate effect size measures
statistical_analyzer.calculate_effect_size(data, measure='cohens_d')
```

## Best Practices

### Statistical Rigor

1. **Significance Level** - Choose appropriate significance level
2. **Effect Size** - Always report effect sizes
3. **Assumptions** - Check test assumptions
4. **Multiple Testing** - Account for multiple comparisons
5. **Sample Size** - Ensure adequate sample size

### Error Handling

1. **Graceful Degradation** - Handle analysis failures gracefully
2. **Validation** - Validate data before analysis
3. **Fallback Methods** - Provide alternative analysis methods
4. **Error Logging** - Log errors for debugging and monitoring
5. **User Feedback** - Provide clear error messages

### Security

1. **Data Privacy** - Ensure data privacy during analysis
2. **Access Control** - Control access to analysis results
3. **Audit Logging** - Log analysis activities
4. **Data Sanitization** - Sanitize sensitive data
5. **Compliance** - Ensure compliance with regulations

### Resource Management

1. **Memory Monitoring** - Monitor memory usage during analysis
2. **Processing Time** - Set reasonable timeouts
3. **Storage Optimization** - Optimize result storage
4. **Cleanup** - Clean up temporary files
5. **Resource Limits** - Set appropriate resource limits

### Integration

1. **Tool Dependencies** - Ensure required tools are available
2. **API Compatibility** - Maintain API compatibility
3. **Error Propagation** - Properly propagate errors
4. **Logging Integration** - Integrate with logging systems
5. **Monitoring** - Monitor tool performance and usage

### Development vs Production

**Development:**
```bash
STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL=0.05
STATISTICAL_ANALYZER_CONFIDENCE_LEVEL=0.95
STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=false
```

**Production:**
```bash
STATISTICAL_ANALYZER_SIGNIFICANCE_LEVEL=0.01
STATISTICAL_ANALYZER_CONFIDENCE_LEVEL=0.99
STATISTICAL_ANALYZER_ENABLE_EFFECT_SIZE=true
```

### Error Handling

Always wrap statistical analysis operations in try-except blocks:

```python
from aiecs.tools.statistics.statistical_analyzer_tool import StatisticalAnalyzerTool, StatisticalAnalyzerError, AnalysisError

statistical_analyzer = StatisticalAnalyzerTool()

try:
    result = statistical_analyzer.analyze_data(
        data=df,
        analysis_type='t_test',
        significance_level=0.05
    )
except AnalysisError as e:
    print(f"Analysis error: {e}")
except StatisticalAnalyzerError as e:
    print(f"Statistical analyzer error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")
```

## Dependencies

### Core Dependencies

```bash
# Install core dependencies
pip install pydantic python-dotenv

# Install data processing dependencies
pip install pandas numpy scipy

# Install statistical analysis dependencies
pip install scipy statsmodels
```

### Optional Dependencies

```bash
# For advanced statistical analysis
pip install scikit-learn

# For time series analysis
pip install statsmodels

# For effect size calculations
pip install pingouin

# For power analysis
pip install statsmodels
```

### Verification

```python
# Test dependency availability
try:
    import pandas
    import numpy
    import scipy
    print("Core dependencies available")
except ImportError as e:
    print(f"Missing dependency: {e}")

# Test statistical libraries availability
try:
    from scipy import stats
    import statsmodels
    print("Statistical libraries available")
except ImportError:
    print("Statistical libraries not available")

# Test advanced analysis availability
try:
    import pingouin
    print("Advanced statistical analysis available")
except ImportError:
    print("Advanced statistical analysis not available")

# Test time series analysis availability
try:
    from statsmodels.tsa import seasonal
    print("Time series analysis available")
except ImportError:
    print("Time series analysis not available")
```

## Related Documentation

- Tool implementation details in the source code
- Stats tool documentation for core statistical operations
- Statistics tool documentation for statistical analysis
- Main aiecs documentation for architecture overview

## Support

For issues or questions about Statistical Analyzer Tool configuration:
- Check the tool source code for implementation details
- Review stats tool documentation for core statistical operations
- Consult the main aiecs documentation for architecture overview
- Test with simple datasets first to isolate configuration vs. analysis issues
- Verify data compatibility and format requirements
- Check significance and confidence level settings
- Ensure proper statistical test assumptions
- Validate data quality and statistical requirements