# Stats Tool Configuration Guide

## Overview

The Stats Tool provides comprehensive statistical analysis capabilities for various data formats including SPSS (.sav, .sas7bdat, .por), CSV, Excel, JSON, Parquet, and Feather files. It supports descriptive statistics, hypothesis testing (t-tests, ANOVA), correlation analysis, regression analysis, and advanced statistical operations. The tool can be configured via environment variables using the `STATS_TOOL_` prefix or through programmatic configuration when initializing the tool.

## Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a `.env` file for convenience. The Stats Tool reads from environment variables that are already loaded into the process, so you need to load the `.env` file in your application before importing aiecs tools.

### Setting Up .env Files

**1. Install python-dotenv:**

```bash
pip install python-dotenv
```

**2. Create a `.env` file in your project root:**

```bash
# .env file in your project root
STATS_TOOL_MAX_FILE_SIZE_MB=200
STATS_TOOL_ALLOWED_EXTENSIONS=[".sav",".sas7bdat",".por",".csv",".xlsx",".xls",".json",".parquet",".feather"]
```

**3. Load the .env file in your application:**

```python
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.task_tools.stats_tool import StatsTool

# The tool will automatically use the environment variables
stats_tool = StatsTool()
```

### Multiple Environment Files

You can use different `.env` files for different environments:

```python
import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.task_tools.stats_tool import StatsTool
stats_tool = StatsTool()
```

**Example `.env.production`:**
```bash
# Production settings - optimized for large datasets
STATS_TOOL_MAX_FILE_SIZE_MB=500
STATS_TOOL_ALLOWED_EXTENSIONS=[".sav",".sas7bdat",".csv",".xlsx",".parquet"]
```

**Example `.env.development`:**
```bash
# Development settings - more permissive for testing
STATS_TOOL_MAX_FILE_SIZE_MB=100
STATS_TOOL_ALLOWED_EXTENSIONS=[".sav",".sas7bdat",".por",".csv",".xlsx",".xls",".json",".parquet",".feather"]
```

### Best Practices for .env Files

1. **Never commit .env files to version control** - Add `.env` to your `.gitignore`:
   ```gitignore
   # .gitignore
   .env
   .env.local
   .env.*.local
   .env.production
   .env.staging
   ```

2. **Provide a template** - Create `.env.example` with documented dummy values:
   ```bash
   # .env.example
   # Stats Tool Configuration
   
   # Maximum file size in megabytes
   STATS_TOOL_MAX_FILE_SIZE_MB=200
   
   # Allowed file extensions (JSON array)
   STATS_TOOL_ALLOWED_EXTENSIONS=[".sav",".sas7bdat",".por",".csv",".xlsx",".xls",".json",".parquet",".feather"]
   ```

3. **Document your variables** - Add comments explaining each setting

4. **Use load_dotenv() early** - Call it at the very top of your entry point, before any aiecs imports

5. **Format complex types correctly**:
   - Integers: Plain numbers: `200`, `500`
   - Lists: JSON array format: `[".sav",".csv",".xlsx"]`

## Configuration Options

### 1. Max File Size MB

**Environment Variable:** `STATS_TOOL_MAX_FILE_SIZE_MB`

**Type:** Integer

**Default:** `200`

**Description:** Maximum file size in megabytes for data files. This prevents memory issues with extremely large datasets and ensures reasonable processing times.

**Common Values:**
- `50` - Small datasets (development)
- `100` - Medium datasets (testing)
- `200` - Large datasets (default)
- `500` - Very large datasets (production)
- `1000` - Massive datasets (enterprise)

**Example:**
```bash
export STATS_TOOL_MAX_FILE_SIZE_MB=500
```

**Memory Note:** Larger values allow processing bigger files but use more memory. Adjust based on available system resources.

### 2. Allowed Extensions

**Environment Variable:** `STATS_TOOL_ALLOWED_EXTENSIONS`

**Type:** List[str]

**Default:** `['.sav', '.sas7bdat', '.por', '.csv', '.xlsx', '.xls', '.json', '.parquet', '.feather']`

**Description:** List of allowed file extensions for statistical analysis. This is a security feature that prevents processing of unauthorized file types.

**Format:** JSON array string with double quotes

**Supported Formats:**
- `.sav` - SPSS data files
- `.sas7bdat` - SAS data files
- `.por` - SPSS portable files
- `.csv` - Comma-separated values
- `.xlsx` - Excel 2007+ files
- `.xls` - Excel 97-2003 files
- `.json` - JSON data files
- `.parquet` - Apache Parquet files
- `.feather` - Feather format files

**Example:**
```bash
# Allow all supported formats
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".por",".csv",".xlsx",".xls",".json",".parquet",".feather"]'

# Restrict to common formats only
export STATS_TOOL_ALLOWED_EXTENSIONS='[".csv",".xlsx",".json"]'

# SPSS/SAS only
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".por"]'
```

**Security Note:** Only allow extensions that your application actually needs to process.

## Usage Examples

### Example 1: Basic Environment Configuration

```bash
# Set custom file size limit
export STATS_TOOL_MAX_FILE_SIZE_MB=500
export STATS_TOOL_ALLOWED_EXTENSIONS='[".csv",".xlsx",".sav"]'

# Run your application
python app.py
```

### Example 2: Production Configuration

```bash
# Production settings - optimized for large datasets
export STATS_TOOL_MAX_FILE_SIZE_MB=1000
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".csv",".xlsx",".parquet"]'
```

### Example 3: Development Configuration

```bash
# Development settings - permissive for testing
export STATS_TOOL_MAX_FILE_SIZE_MB=100
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".por",".csv",".xlsx",".xls",".json",".parquet",".feather"]'
```

### Example 4: Programmatic Configuration

```python
from aiecs.tools.task_tools.stats_tool import StatsTool

# Initialize with custom configuration
stats_tool = StatsTool(config={
    'max_file_size_mb': 500,
    'allowed_extensions': ['.sav', '.sas7bdat', '.csv', '.xlsx']
})
```

### Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

```bash
# Set environment defaults
export STATS_TOOL_MAX_FILE_SIZE_MB=200
export STATS_TOOL_ALLOWED_EXTENSIONS='[".csv",".xlsx"]'
```

```python
# Override for specific instance
stats_tool = StatsTool(config={
    'max_file_size_mb': 500,  # This overrides the environment variable
    'allowed_extensions': ['.sav', '.sas7bdat']  # This overrides the environment variable
})
```

## Configuration Priority

When the Stats Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

1. **Programmatic config** - Values passed to the constructor
2. **Environment variables** - Values set via `STATS_TOOL_*` variables
3. **Default values** - Built-in defaults as specified above

## Data Type Parsing

### Integer Values

Integers should be provided as numeric strings:

```bash
export STATS_TOOL_MAX_FILE_SIZE_MB=200
```

### List Values

Lists must be provided as JSON arrays with double quotes:

```bash
# Correct
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".csv",".xlsx"]'

# Incorrect (will not parse)
export STATS_TOOL_ALLOWED_EXTENSIONS=".sav,.sas7bdat,.csv,.xlsx"
```

**Important:** Use single quotes for the shell, double quotes for JSON:
```bash
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".csv",".xlsx"]'
#                                      ^                    ^
#                                      Single quotes for shell
#                                         ^      ^
#                                         Double quotes for JSON
```

## Validation

### Automatic Type Validation

Pydantic automatically validates configuration values:

- `max_file_size_mb` must be a positive integer
- `allowed_extensions` must be a list of strings

### Runtime Validation

When processing data, the tool validates:

1. **File extensions** - Files must have allowed extensions
2. **File size limits** - Files must not exceed max_file_size_mb
3. **Data structure** - Input data must be valid for statistical analysis
4. **Variable existence** - Referenced variables must exist in datasets
5. **Data types** - Statistical operations validate appropriate data types

## Operations Supported

The Stats Tool supports comprehensive statistical analysis operations:

### Data Loading and Inspection
- `read_data` - Load data from various file formats
- `describe` - Generate descriptive statistics
- Support for SPSS, SAS, CSV, Excel, JSON, Parquet, and Feather formats

### Descriptive Statistics
- **Basic statistics** - Mean, median, mode, standard deviation, variance
- **Distribution measures** - Skewness, kurtosis
- **Percentiles** - Custom percentile calculations
- **Summary statistics** - Comprehensive data summaries

### Hypothesis Testing
- **t-tests** - Independent and paired t-tests
- **ANOVA** - One-way and two-way analysis of variance
- **Chi-square tests** - Goodness of fit and independence tests
- **Mann-Whitney U test** - Non-parametric alternative to t-test
- **Kruskal-Wallis test** - Non-parametric alternative to ANOVA

### Correlation Analysis
- **Pearson correlation** - Linear correlation coefficient
- **Spearman correlation** - Rank-based correlation
- **Kendall's tau** - Alternative rank correlation
- **Partial correlation** - Controlling for other variables

### Regression Analysis
- **Linear regression** - Simple and multiple linear regression
- **Logistic regression** - Binary and multinomial logistic regression
- **Polynomial regression** - Non-linear relationship modeling
- **Ridge/Lasso regression** - Regularized regression methods

### Advanced Statistical Operations
- **Factor analysis** - Dimensionality reduction
- **Cluster analysis** - K-means and hierarchical clustering
- **Principal component analysis (PCA)** - Data transformation
- **Time series analysis** - Trend and seasonal analysis
- **Survival analysis** - Time-to-event analysis

### Data Transformation
- **Scaling and normalization** - Standard, MinMax, Robust scaling
- **Missing value handling** - Imputation and deletion strategies
- **Outlier detection** - Statistical and machine learning methods
- **Data encoding** - Categorical variable encoding

## Troubleshooting

### Issue: File format not supported

**Error:** `Unsupported file format: .xyz`

**Solutions:**
1. Add extension to allowed list: `export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".csv",".xyz"]'`
2. Convert file to supported format
3. Check file extension spelling

### Issue: File too large

**Error:** `File size exceeds maximum limit`

**Solutions:**
```bash
# Increase file size limit
export STATS_TOOL_MAX_FILE_SIZE_MB=1000

# Or process file in chunks
# Use sampling for large datasets
```

### Issue: Missing dependencies

**Error:** `ModuleNotFoundError: No module named 'pyreadstat'`

**Solutions:**
```bash
# Install required dependencies
pip install pyreadstat scipy statsmodels

# For SPSS files
pip install pyreadstat

# For advanced statistics
pip install scipy statsmodels scikit-learn
```

### Issue: Memory errors

**Error:** `MemoryError` or system becomes unresponsive

**Solutions:**
```bash
# Reduce file size limit
export STATS_TOOL_MAX_FILE_SIZE_MB=100

# Process data in chunks
# Use sampling techniques
# Increase system memory
```

### Issue: Statistical computation errors

**Error:** `AnalysisError: Statistical computation failed`

**Solutions:**
1. Check data quality and missing values
2. Verify variable types and distributions
3. Ensure sufficient sample size
4. Check for outliers and extreme values
5. Validate statistical assumptions

### Issue: Variable not found

**Error:** `Variables not found in dataset: ['variable_name']`

**Solutions:**
1. Check variable names (case-sensitive)
2. Use `read_data` to inspect available variables
3. Verify column names in the dataset
4. Check for typos in variable names

### Issue: List parsing error

**Error:** Configuration parsing fails for `allowed_extensions`

**Solution:**
```bash
# Use proper JSON array syntax
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".csv",".xlsx"]'

# NOT: [.sav,.sas7bdat,.csv,.xlsx] or .sav,.sas7bdat,.csv,.xlsx
```

### Issue: SPSS file reading errors

**Error:** `Error reading SPSS file`

**Solutions:**
1. Verify file is not corrupted
2. Check file encoding
3. Ensure pyreadstat is properly installed
4. Try converting to CSV format
5. Check file permissions

## Best Practices

### Data Quality

1. **Data validation** - Always validate data before analysis
2. **Missing value handling** - Implement appropriate strategies
3. **Outlier detection** - Identify and handle outliers appropriately
4. **Data types** - Ensure correct data types for statistical operations
5. **Sample size** - Verify adequate sample sizes for tests

### Statistical Analysis

1. **Assumption checking** - Verify statistical assumptions before tests
2. **Multiple testing** - Apply corrections for multiple comparisons
3. **Effect sizes** - Report effect sizes alongside p-values
4. **Confidence intervals** - Include confidence intervals in results
5. **Interpretation** - Provide clear interpretation of results

### Performance

1. **File size management** - Use appropriate file size limits
2. **Memory optimization** - Process large datasets in chunks
3. **Caching** - Cache results for repeated analyses
4. **Sampling** - Use sampling for exploratory analysis
5. **Parallel processing** - Use parallel processing for large datasets

### Security

1. **File validation** - Validate file types and sizes
2. **Path sanitization** - Sanitize file paths to prevent directory traversal
3. **Access control** - Implement proper file access controls
4. **Data privacy** - Handle sensitive data appropriately
5. **Audit logging** - Log statistical operations for compliance

### Development vs Production

**Development:**
```bash
STATS_TOOL_MAX_FILE_SIZE_MB=100
STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".por",".csv",".xlsx",".xls",".json",".parquet",".feather"]'
```

**Production:**
```bash
STATS_TOOL_MAX_FILE_SIZE_MB=500
STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".csv",".xlsx",".parquet"]'
```

### Error Handling

Always wrap statistical operations in try-except blocks:

```python
from aiecs.tools.task_tools.stats_tool import StatsTool, StatsToolError, FileOperationError, AnalysisError

stats_tool = StatsTool()

try:
    result = stats_tool.ttest("data.csv", "var1", "var2")
except FileOperationError as e:
    print(f"File operation failed: {e}")
except AnalysisError as e:
    print(f"Statistical analysis failed: {e}")
except StatsToolError as e:
    print(f"Stats tool error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")
```

## Dependencies

### Core Dependencies

```bash
# Install core statistical dependencies
pip install pandas numpy scipy

# Install optional dependencies for advanced features
pip install statsmodels scikit-learn
```

### SPSS/SAS Support

```bash
# Install pyreadstat for SPSS and SAS files
pip install pyreadstat

# Verify installation
python -c "import pyreadstat; print('pyreadstat installed successfully')"
```

### Excel Support

```bash
# Install openpyxl for Excel files
pip install openpyxl

# For older Excel files
pip install xlrd
```

### Parquet/Feather Support

```bash
# Install for Parquet files
pip install pyarrow

# Install for Feather files
pip install feather-format
```

### Verification

```python
# Test dependency availability
try:
    import pandas as pd
    import numpy as np
    import scipy
    print("Core dependencies available")
except ImportError as e:
    print(f"Missing dependency: {e}")

try:
    import pyreadstat
    print("SPSS/SAS support available")
except ImportError:
    print("SPSS/SAS support not available")

try:
    import statsmodels
    print("Advanced statistics available")
except ImportError:
    print("Advanced statistics not available")
```

## Statistical Interpretation Guide

### Effect Sizes

**Cohen's d (t-tests):**
- 0.2 = Small effect
- 0.5 = Medium effect
- 0.8 = Large effect

**Cramer's V (chi-square):**
- 0.1 = Small effect
- 0.3 = Medium effect
- 0.5 = Large effect

**R² (regression):**
- 0.02 = Small effect
- 0.13 = Medium effect
- 0.26 = Large effect

### P-value Interpretation

- p < 0.001 = Highly significant
- p < 0.01 = Very significant
- p < 0.05 = Significant
- p < 0.1 = Marginally significant
- p ≥ 0.1 = Not significant

### Sample Size Guidelines

**t-tests:** Minimum 30 per group
**ANOVA:** Minimum 20 per group
**Correlation:** Minimum 30 observations
**Regression:** Minimum 10 observations per predictor

## Related Documentation

- Tool implementation details in the source code
- Pandas documentation: https://pandas.pydata.org/docs/
- SciPy documentation: https://docs.scipy.org/
- Statsmodels documentation: https://www.statsmodels.org/
- Pyreadstat documentation: https://ofajardo.github.io/pyreadstat_documentation/
- Main aiecs documentation for architecture overview

## Support

For issues or questions about Stats Tool configuration:
- Check the tool source code for implementation details
- Review statistical method documentation for specific operations
- Consult the main aiecs documentation for architecture overview
- Test with small datasets first to isolate configuration vs. data issues
- Monitor memory usage and file size limits
- Validate statistical assumptions and data quality
- Check dependency installation and compatibility