Stats Tool Configuration Guide

Overview

The Stats Tool provides comprehensive statistical analysis capabilities for various data formats including SPSS (.sav, .sas7bdat, .por), CSV, Excel, JSON, Parquet, and Feather files. It supports descriptive statistics, hypothesis testing (t-tests, ANOVA), correlation analysis, regression analysis, and advanced statistical operations. The tool can be configured via environment variables using the STATS_TOOL_ prefix or through programmatic configuration when initializing the tool.

Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Stats Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.

Setting Up .env Files

1. Install python-dotenv:

pip install python-dotenv

2. Create a .env file in your project root:

# .env file in your project root
STATS_TOOL_MAX_FILE_SIZE_MB=200
STATS_TOOL_ALLOWED_EXTENSIONS=[".sav",".sas7bdat",".por",".csv",".xlsx",".xls",".json",".parquet",".feather"]

3. Load the .env file in your application:

# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.task_tools.stats_tool import StatsTool

# The tool will automatically use the environment variables
stats_tool = StatsTool()

Multiple Environment Files

You can use different .env files for different environments:

import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.task_tools.stats_tool import StatsTool
stats_tool = StatsTool()

Example .env.production:

# Production settings - optimized for large datasets
STATS_TOOL_MAX_FILE_SIZE_MB=500
STATS_TOOL_ALLOWED_EXTENSIONS=[".sav",".sas7bdat",".csv",".xlsx",".parquet"]

Example .env.development:

# Development settings - more permissive for testing
STATS_TOOL_MAX_FILE_SIZE_MB=100
STATS_TOOL_ALLOWED_EXTENSIONS=[".sav",".sas7bdat",".por",".csv",".xlsx",".xls",".json",".parquet",".feather"]

Best Practices for .env Files

Never commit .env files to version control - Add .env to your .gitignore:

# .gitignore
.env
.env.local
.env.*.local
.env.production
.env.staging

Provide a template - Create .env.example with documented dummy values:

# .env.example
# Stats Tool Configuration

# Maximum file size in megabytes
STATS_TOOL_MAX_FILE_SIZE_MB=200

# Allowed file extensions (JSON array)
STATS_TOOL_ALLOWED_EXTENSIONS=[".sav",".sas7bdat",".por",".csv",".xlsx",".xls",".json",".parquet",".feather"]

Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format complex types correctly:
- Integers: Plain numbers: 200, 500
- Lists: JSON array format: [".sav",".csv",".xlsx"]

Configuration Options

1. Max File Size MB

Environment Variable: STATS_TOOL_MAX_FILE_SIZE_MB

Type: Integer

Default: 200

Description: Maximum file size in megabytes for data files. This prevents memory issues with extremely large datasets and ensures reasonable processing times.

Common Values:

50 - Small datasets (development)
100 - Medium datasets (testing)
200 - Large datasets (default)
500 - Very large datasets (production)
1000 - Massive datasets (enterprise)

Example:

export STATS_TOOL_MAX_FILE_SIZE_MB=500

Memory Note: Larger values allow processing bigger files but use more memory. Adjust based on available system resources.

2. Allowed Extensions

Environment Variable: STATS_TOOL_ALLOWED_EXTENSIONS

Type: List[str]

Default: ['.sav', '.sas7bdat', '.por', '.csv', '.xlsx', '.xls', '.json', '.parquet', '.feather']

Description: List of allowed file extensions for statistical analysis. This is a security feature that prevents processing of unauthorized file types.

Format: JSON array string with double quotes

Supported Formats:

.sav - SPSS data files
.sas7bdat - SAS data files
.por - SPSS portable files
.csv - Comma-separated values
.xlsx - Excel 2007+ files
.xls - Excel 97-2003 files
.json - JSON data files
.parquet - Apache Parquet files
.feather - Feather format files

Example:

# Allow all supported formats
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".por",".csv",".xlsx",".xls",".json",".parquet",".feather"]'

# Restrict to common formats only
export STATS_TOOL_ALLOWED_EXTENSIONS='[".csv",".xlsx",".json"]'

# SPSS/SAS only
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".por"]'

Security Note: Only allow extensions that your application actually needs to process.

Usage Examples

Example 1: Basic Environment Configuration

# Set custom file size limit
export STATS_TOOL_MAX_FILE_SIZE_MB=500
export STATS_TOOL_ALLOWED_EXTENSIONS='[".csv",".xlsx",".sav"]'

# Run your application
python app.py

Example 2: Production Configuration

# Production settings - optimized for large datasets
export STATS_TOOL_MAX_FILE_SIZE_MB=1000
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".csv",".xlsx",".parquet"]'

Example 3: Development Configuration

# Development settings - permissive for testing
export STATS_TOOL_MAX_FILE_SIZE_MB=100
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".por",".csv",".xlsx",".xls",".json",".parquet",".feather"]'

Example 4: Programmatic Configuration

from aiecs.tools.task_tools.stats_tool import StatsTool

# Initialize with custom configuration
stats_tool = StatsTool(config={
    'max_file_size_mb': 500,
    'allowed_extensions': ['.sav', '.sas7bdat', '.csv', '.xlsx']
})

Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

# Set environment defaults
export STATS_TOOL_MAX_FILE_SIZE_MB=200
export STATS_TOOL_ALLOWED_EXTENSIONS='[".csv",".xlsx"]'

# Override for specific instance
stats_tool = StatsTool(config={
    'max_file_size_mb': 500,  # This overrides the environment variable
    'allowed_extensions': ['.sav', '.sas7bdat']  # This overrides the environment variable
})

Configuration Priority

When the Stats Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

Programmatic config - Values passed to the constructor
Environment variables - Values set via STATS_TOOL_* variables
Default values - Built-in defaults as specified above

Data Type Parsing

Integer Values

Integers should be provided as numeric strings:

export STATS_TOOL_MAX_FILE_SIZE_MB=200

List Values

Lists must be provided as JSON arrays with double quotes:

# Correct
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".csv",".xlsx"]'

# Incorrect (will not parse)
export STATS_TOOL_ALLOWED_EXTENSIONS=".sav,.sas7bdat,.csv,.xlsx"

Important: Use single quotes for the shell, double quotes for JSON:

export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".csv",".xlsx"]'
#                                      ^                    ^
#                                      Single quotes for shell
#                                         ^      ^
#                                         Double quotes for JSON

Validation

Automatic Type Validation

Pydantic automatically validates configuration values:

max_file_size_mb must be a positive integer
allowed_extensions must be a list of strings

Runtime Validation

When processing data, the tool validates:

File extensions - Files must have allowed extensions
File size limits - Files must not exceed max_file_size_mb
Data structure - Input data must be valid for statistical analysis
Variable existence - Referenced variables must exist in datasets
Data types - Statistical operations validate appropriate data types

Operations Supported

The Stats Tool supports comprehensive statistical analysis operations:

Data Loading and Inspection

read_data - Load data from various file formats
describe - Generate descriptive statistics
Support for SPSS, SAS, CSV, Excel, JSON, Parquet, and Feather formats

Descriptive Statistics

Basic statistics - Mean, median, mode, standard deviation, variance
Distribution measures - Skewness, kurtosis
Percentiles - Custom percentile calculations
Summary statistics - Comprehensive data summaries

Hypothesis Testing

t-tests - Independent and paired t-tests
ANOVA - One-way and two-way analysis of variance
Chi-square tests - Goodness of fit and independence tests
Mann-Whitney U test - Non-parametric alternative to t-test
Kruskal-Wallis test - Non-parametric alternative to ANOVA

Correlation Analysis

Pearson correlation - Linear correlation coefficient
Spearman correlation - Rank-based correlation
Kendall’s tau - Alternative rank correlation
Partial correlation - Controlling for other variables

Regression Analysis

Linear regression - Simple and multiple linear regression
Logistic regression - Binary and multinomial logistic regression
Polynomial regression - Non-linear relationship modeling
Ridge/Lasso regression - Regularized regression methods

Advanced Statistical Operations

Factor analysis - Dimensionality reduction
Cluster analysis - K-means and hierarchical clustering
Principal component analysis (PCA) - Data transformation
Time series analysis - Trend and seasonal analysis
Survival analysis - Time-to-event analysis

Data Transformation

Scaling and normalization - Standard, MinMax, Robust scaling
Missing value handling - Imputation and deletion strategies
Outlier detection - Statistical and machine learning methods
Data encoding - Categorical variable encoding

Troubleshooting

Issue: File format not supported

Error: Unsupported file format: .xyz

Solutions:

Add extension to allowed list: export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".csv",".xyz"]'
Convert file to supported format
Check file extension spelling

Issue: File too large

Error: File size exceeds maximum limit

Solutions:

# Increase file size limit
export STATS_TOOL_MAX_FILE_SIZE_MB=1000

# Or process file in chunks
# Use sampling for large datasets

Issue: Missing dependencies

Error: ModuleNotFoundError: No module named 'pyreadstat'

Solutions:

# Install required dependencies
pip install pyreadstat scipy statsmodels

# For SPSS files
pip install pyreadstat

# For advanced statistics
pip install scipy statsmodels scikit-learn

Issue: Memory errors

Error: MemoryError or system becomes unresponsive

Solutions:

# Reduce file size limit
export STATS_TOOL_MAX_FILE_SIZE_MB=100

# Process data in chunks
# Use sampling techniques
# Increase system memory

Issue: Statistical computation errors

Error: AnalysisError: Statistical computation failed

Solutions:

Check data quality and missing values
Verify variable types and distributions
Ensure sufficient sample size
Check for outliers and extreme values
Validate statistical assumptions

Issue: Variable not found

Error: Variables not found in dataset: ['variable_name']

Solutions:

Check variable names (case-sensitive)
Use read_data to inspect available variables
Verify column names in the dataset
Check for typos in variable names

Issue: List parsing error

Error: Configuration parsing fails for allowed_extensions

Solution:

# Use proper JSON array syntax
export STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".csv",".xlsx"]'

# NOT: [.sav,.sas7bdat,.csv,.xlsx] or .sav,.sas7bdat,.csv,.xlsx

Issue: SPSS file reading errors

Error: Error reading SPSS file

Solutions:

Verify file is not corrupted
Check file encoding
Ensure pyreadstat is properly installed
Try converting to CSV format
Check file permissions

Best Practices

Data Quality

Data validation - Always validate data before analysis
Missing value handling - Implement appropriate strategies
Outlier detection - Identify and handle outliers appropriately
Data types - Ensure correct data types for statistical operations
Sample size - Verify adequate sample sizes for tests

Statistical Analysis

Assumption checking - Verify statistical assumptions before tests
Multiple testing - Apply corrections for multiple comparisons
Effect sizes - Report effect sizes alongside p-values
Confidence intervals - Include confidence intervals in results
Interpretation - Provide clear interpretation of results

Performance

File size management - Use appropriate file size limits
Memory optimization - Process large datasets in chunks
Caching - Cache results for repeated analyses
Sampling - Use sampling for exploratory analysis
Parallel processing - Use parallel processing for large datasets

Security

File validation - Validate file types and sizes
Path sanitization - Sanitize file paths to prevent directory traversal
Access control - Implement proper file access controls
Data privacy - Handle sensitive data appropriately
Audit logging - Log statistical operations for compliance

Development vs Production

Development:

STATS_TOOL_MAX_FILE_SIZE_MB=100
STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".por",".csv",".xlsx",".xls",".json",".parquet",".feather"]'

Production:

STATS_TOOL_MAX_FILE_SIZE_MB=500
STATS_TOOL_ALLOWED_EXTENSIONS='[".sav",".sas7bdat",".csv",".xlsx",".parquet"]'

Error Handling

Always wrap statistical operations in try-except blocks:

from aiecs.tools.task_tools.stats_tool import StatsTool, StatsToolError, FileOperationError, AnalysisError

stats_tool = StatsTool()

try:
    result = stats_tool.ttest("data.csv", "var1", "var2")
except FileOperationError as e:
    print(f"File operation failed: {e}")
except AnalysisError as e:
    print(f"Statistical analysis failed: {e}")
except StatsToolError as e:
    print(f"Stats tool error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Dependencies

Core Dependencies

# Install core statistical dependencies
pip install pandas numpy scipy

# Install optional dependencies for advanced features
pip install statsmodels scikit-learn

SPSS/SAS Support

# Install pyreadstat for SPSS and SAS files
pip install pyreadstat

# Verify installation
python -c "import pyreadstat; print('pyreadstat installed successfully')"

Excel Support

# Install openpyxl for Excel files
pip install openpyxl

# For older Excel files
pip install xlrd

Parquet/Feather Support

# Install for Parquet files
pip install pyarrow

# Install for Feather files
pip install feather-format

Verification

# Test dependency availability
try:
    import pandas as pd
    import numpy as np
    import scipy
    print("Core dependencies available")
except ImportError as e:
    print(f"Missing dependency: {e}")

try:
    import pyreadstat
    print("SPSS/SAS support available")
except ImportError:
    print("SPSS/SAS support not available")

try:
    import statsmodels
    print("Advanced statistics available")
except ImportError:
    print("Advanced statistics not available")

Statistical Interpretation Guide

Effect Sizes

Cohen’s d (t-tests):

0.2 = Small effect
0.5 = Medium effect
0.8 = Large effect

Cramer’s V (chi-square):

0.1 = Small effect
0.3 = Medium effect
0.5 = Large effect

R² (regression):

0.02 = Small effect
0.13 = Medium effect
0.26 = Large effect

P-value Interpretation

p < 0.001 = Highly significant
p < 0.01 = Very significant
p < 0.05 = Significant
p < 0.1 = Marginally significant
p ≥ 0.1 = Not significant

Sample Size Guidelines

t-tests: Minimum 30 per group ANOVA: Minimum 20 per group Correlation: Minimum 30 observations Regression: Minimum 10 observations per predictor

Support

For issues or questions about Stats Tool configuration:

Check the tool source code for implementation details
Review statistical method documentation for specific operations
Consult the main aiecs documentation for architecture overview
Test with small datasets first to isolate configuration vs. data issues
Monitor memory usage and file size limits
Validate statistical assumptions and data quality
Check dependency installation and compatibility

Stats Tool Configuration Guide

Overview

Using .env Files in Your Project

Setting Up .env Files

Multiple Environment Files

Best Practices for .env Files

Configuration Options

1. Max File Size MB

2. Allowed Extensions

Usage Examples

Example 1: Basic Environment Configuration

Example 2: Production Configuration

Example 3: Development Configuration

Example 4: Programmatic Configuration

Example 5: Mixed Configuration

Configuration Priority

Data Type Parsing

Integer Values

List Values

Validation

Automatic Type Validation

Runtime Validation

Operations Supported

Data Loading and Inspection

Descriptive Statistics

Hypothesis Testing

Correlation Analysis

Regression Analysis

Advanced Statistical Operations

Data Transformation

Troubleshooting

Issue: File format not supported

Issue: File too large

Issue: Missing dependencies

Issue: Memory errors

Issue: Statistical computation errors

Issue: Variable not found

Issue: List parsing error

Issue: SPSS file reading errors

Best Practices

Data Quality

Statistical Analysis

Performance

Security

Development vs Production

Error Handling

Dependencies

Core Dependencies

SPSS/SAS Support

Excel Support

Parquet/Feather Support

Verification

Statistical Interpretation Guide

Effect Sizes

P-value Interpretation

Sample Size Guidelines

Related Documentation

Support