AI Data Analysis Orchestrator Configuration Guide

Overview

The AI Data Analysis Orchestrator is a powerful tool that coordinates multiple foundation tools to provide natural language driven analysis, automated workflow orchestration, multi-tool coordination, and comprehensive analysis execution. It supports various analysis modes (exploratory, diagnostic, predictive, prescriptive, comparative, causal) and coordinates foundation tools including data_loader, data_profiler, data_transformer, data_visualizer, statistical_analyzer, and model_trainer. The tool can be configured via environment variables using the AI_DATA_ORCHESTRATOR_ prefix or through programmatic configuration when initializing the tool.

Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The AI Data Analysis Orchestrator reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.

Setting Up .env Files

1. Install python-dotenv:

pip install python-dotenv

2. Create a .env file in your project root:

# .env file in your project root
AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory
AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=10
AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true
AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai
AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true

3. Load the .env file in your application:

# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.statistics.ai_data_analysis_orchestrator import AIDataAnalysisOrchestrator

# The tool will automatically use the environment variables
orchestrator = AIDataAnalysisOrchestrator()

Multiple Environment Files

You can use different .env files for different environments:

import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.statistics.ai_data_analysis_orchestrator import AIDataAnalysisOrchestrator
orchestrator = AIDataAnalysisOrchestrator()

Example .env.production:

# Production settings - optimized for performance and reliability
AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory
AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=20
AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true
AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai
AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true

Example .env.development:

# Development settings - optimized for testing and debugging
AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory
AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=5
AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=false
AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=local
AI_DATA_ORCHESTRATOR_ENABLE_CACHING=false

Best Practices for .env Files

  1. Never commit .env files to version control - Add .env to your .gitignore:

    # .gitignore
    .env
    .env.local
    .env.*.local
    .env.production
    .env.staging
    
  2. Provide a template - Create .env.example with documented dummy values:

    # .env.example
    # AI Data Analysis Orchestrator Configuration
    
    # Default analysis mode to use
    AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory
    
    # Maximum number of analysis iterations
    AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=10
    
    # Whether to enable automatic workflow generation
    AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true
    
    # Default AI provider to use
    AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai
    
    # Whether to enable result caching
    AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true
    
  3. Document your variables - Add comments explaining each setting

  4. Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports

  5. Format values correctly:

    • Strings: Plain text: exploratory, openai

    • Integers: Plain numbers: 10, 20

    • Booleans: true or false

Configuration Options

1. Default Mode

Environment Variable: AI_DATA_ORCHESTRATOR_DEFAULT_MODE

Type: String

Default: "exploratory"

Description: Default analysis mode to use for data analysis operations. This mode is used when no specific mode is specified in the analysis request.

Supported Modes:

  • exploratory - Exploratory data analysis (default)

  • diagnostic - Diagnostic analysis

  • predictive - Predictive analysis

  • prescriptive - Prescriptive analysis

  • comparative - Comparative analysis

  • causal - Causal analysis

Example:

export AI_DATA_ORCHESTRATOR_DEFAULT_MODE=predictive

Mode Note: Choose the mode that best fits your typical analysis requirements.

2. Max Iterations

Environment Variable: AI_DATA_ORCHESTRATOR_MAX_ITERATIONS

Type: Integer

Default: 10

Description: Maximum number of analysis iterations that can be performed in a single analysis workflow. This controls the depth and complexity of analysis operations.

Common Values:

  • 5 - Quick analysis (basic insights)

  • 10 - Standard analysis (default, balanced)

  • 20 - Deep analysis (comprehensive insights)

  • 50 - Maximum analysis (exhaustive exploration)

Example:

export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=20

Iteration Note: Higher values provide more comprehensive analysis but may increase processing time and resource usage.

3. Enable Auto Workflow

Environment Variable: AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW

Type: Boolean

Default: True

Description: Whether to enable automatic workflow generation. When enabled, the orchestrator automatically designs analysis workflows based on the data and requirements.

Values:

  • true - Enable auto workflow (default)

  • false - Disable auto workflow

Example:

export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true

Workflow Note: Auto workflow provides intelligent analysis design but may require more computational resources.

4. Default AI Provider

Environment Variable: AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER

Type: String

Default: "openai"

Description: Default AI provider to use for analysis operations. This provider is used when no specific provider is specified in the request.

Supported Providers:

  • openai - OpenAI API (default)

  • anthropic - Anthropic Claude

  • google - Google AI

  • local - Local AI model

Example:

export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=anthropic

Provider Note: Ensure the selected provider is properly configured with API keys and credentials.

5. Enable Caching

Environment Variable: AI_DATA_ORCHESTRATOR_ENABLE_CACHING

Type: Boolean

Default: True

Description: Whether to enable result caching. When enabled, analysis results are cached to improve performance for similar requests.

Values:

  • true - Enable caching (default)

  • false - Disable caching

Example:

export AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true

Caching Note: Caching improves performance but requires additional memory and storage.

Usage Examples

Example 1: Basic Environment Configuration

# Set basic analysis parameters
export AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory
export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=10
export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true
export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai
export AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true

# Run your application
python app.py

Example 2: High-Performance Configuration

# Optimized for comprehensive analysis
export AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory
export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=20
export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true
export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai
export AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true

Example 3: Development Configuration

# Development-friendly settings
export AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory
export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=5
export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=false
export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=local
export AI_DATA_ORCHESTRATOR_ENABLE_CACHING=false

Example 4: Programmatic Configuration

from aiecs.tools.statistics.ai_data_analysis_orchestrator import AIDataAnalysisOrchestrator

# Initialize with custom configuration
orchestrator = AIDataAnalysisOrchestrator(config={
    'default_mode': 'exploratory',
    'max_iterations': 10,
    'enable_auto_workflow': True,
    'default_ai_provider': 'openai',
    'enable_caching': True
})

Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

# Set environment defaults
export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=10
export AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory
# Override for specific instance
orchestrator = AIDataAnalysisOrchestrator(config={
    'max_iterations': 20,  # This overrides the environment variable
    'default_mode': 'predictive'  # This overrides the environment variable
})

Configuration Priority

When the AI Data Analysis Orchestrator is initialized, configuration values are resolved in the following order (highest to lowest priority):

  1. Programmatic config - Values passed to the constructor

  2. Environment variables - Values set via AI_DATA_ORCHESTRATOR_* variables

  3. Default values - Built-in defaults as specified above

Data Type Parsing

String Values

Strings should be provided as plain text without quotes:

export AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory
export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai

Integer Values

Integers should be provided as numeric strings:

export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=10
export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=20

Boolean Values

Booleans should be provided as lowercase strings:

export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true
export AI_DATA_ORCHESTRATOR_ENABLE_CACHING=false

Validation

Automatic Type Validation

Pydantic automatically validates configuration values:

  • default_mode must be a valid analysis mode string

  • max_iterations must be a positive integer

  • enable_auto_workflow must be a boolean

  • default_ai_provider must be a valid provider string

  • enable_caching must be a boolean

Runtime Validation

When performing analysis, the tool validates:

  1. Analysis mode - Mode must be supported

  2. Iteration limits - Analysis must not exceed max iterations

  3. AI provider availability - Provider must be configured

  4. Workflow constraints - Auto workflow must be properly configured

  5. Caching requirements - Cache must be accessible if enabled

Analysis Modes

The AI Data Analysis Orchestrator supports various analysis modes:

Basic Modes

  • Exploratory - Initial data exploration and discovery

  • Diagnostic - Root cause analysis and problem diagnosis

  • Predictive - Future trend prediction and forecasting

  • Prescriptive - Actionable recommendations and solutions

Advanced Modes

  • Comparative - Compare different datasets or time periods

  • Causal - Identify cause-and-effect relationships

AI Providers

Supported Providers

  • OpenAI - OpenAI API integration

  • Anthropic - Anthropic Claude integration

  • Google - Google AI integration

  • Local - Local AI model integration

Provider Configuration

Each provider requires specific configuration:

OpenAI:

export OPENAI_API_KEY=your-api-key
export OPENAI_ORG_ID=your-org-id  # Optional

Anthropic:

export ANTHROPIC_API_KEY=your-api-key

Google:

export GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account.json
export GOOGLE_CLOUD_PROJECT=your-project-id

Local:

export LOCAL_MODEL_PATH=path/to/model
export LOCAL_MODEL_TYPE=llama2  # or other model type

Operations Supported

The AI Data Analysis Orchestrator supports comprehensive data analysis operations:

Basic Analysis

  • analyze_data - Perform comprehensive data analysis

  • exploratory_analysis - Perform exploratory data analysis

  • diagnostic_analysis - Perform diagnostic analysis

  • predictive_analysis - Perform predictive analysis

  • prescriptive_analysis - Perform prescriptive analysis

Advanced Analysis

  • comparative_analysis - Compare different datasets

  • causal_analysis - Identify causal relationships

  • workflow_analysis - Execute custom analysis workflows

  • iterative_analysis - Perform iterative analysis with feedback

Workflow Management

  • design_workflow - Design analysis workflows

  • execute_workflow - Execute analysis workflows

  • optimize_workflow - Optimize workflow performance

  • cache_workflow - Cache workflow results

Tool Coordination

  • coordinate_tools - Coordinate multiple analysis tools

  • integrate_results - Integrate results from multiple tools

  • validate_analysis - Validate analysis results

  • generate_report - Generate comprehensive analysis reports

Troubleshooting

Issue: AI Provider not available

Error: OrchestratorError when calling AI providers

Solutions:

# Check provider configuration
export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai

# Verify API keys
export OPENAI_API_KEY=your-valid-api-key

# Test with local provider
export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=local

Issue: Analysis workflow fails

Error: WorkflowError during workflow execution

Solutions:

  1. Check foundation tool availability

  2. Verify data accessibility

  3. Check workflow configuration

  4. Validate analysis parameters

Issue: Max iterations exceeded

Error: Analysis exceeds maximum iterations

Solutions:

# Increase max iterations
export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=20

# Or optimize analysis workflow
export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true

Issue: Caching problems

Error: Cache operations fail

Solutions:

# Disable caching for testing
export AI_DATA_ORCHESTRATOR_ENABLE_CACHING=false

# Check cache directory permissions
# Verify cache configuration

Issue: Auto workflow issues

Error: Auto workflow generation fails

Solutions:

# Disable auto workflow for testing
export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=false

# Check AI provider configuration
# Verify workflow templates

Issue: Foundation tool errors

Error: Foundation tool operations fail

Solutions:

  1. Check tool availability and dependencies

  2. Verify data format compatibility

  3. Check tool configuration

  4. Validate input data

Best Practices

Performance Optimization

  1. Iteration Management - Set appropriate max iterations

  2. Caching Strategy - Enable caching for repeated analyses

  3. Workflow Optimization - Use auto workflow for efficiency

  4. Provider Selection - Choose providers based on task requirements

  5. Resource Management - Monitor memory and CPU usage

Error Handling

  1. Graceful Degradation - Handle tool failures gracefully

  2. Retry Logic - Implement retry for transient failures

  3. Fallback Strategies - Provide fallback analysis methods

  4. Error Logging - Log errors for debugging and monitoring

  5. User Feedback - Provide clear error messages

Security

  1. API Key Management - Secure storage of API keys

  2. Data Privacy - Ensure data privacy in analysis

  3. Access Control - Control access to analysis tools

  4. Audit Logging - Log analysis activities for compliance

  5. Data Validation - Validate input data before analysis

Resource Management

  1. Memory Usage - Monitor memory consumption during analysis

  2. API Rate Limits - Respect provider rate limits

  3. Cost Management - Monitor and control analysis costs

  4. Processing Time - Set reasonable timeouts

  5. Cleanup - Clean up temporary files and resources

Integration

  1. Tool Dependencies - Ensure required tools are available

  2. API Compatibility - Maintain API compatibility

  3. Error Propagation - Properly propagate errors

  4. Logging Integration - Integrate with logging systems

  5. Monitoring - Monitor tool performance and usage

Development vs Production

Development:

AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory
AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=5
AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=false
AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=local
AI_DATA_ORCHESTRATOR_ENABLE_CACHING=false

Production:

AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory
AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=20
AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true
AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai
AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true

Error Handling

Always wrap analysis operations in try-except blocks:

from aiecs.tools.statistics.ai_data_analysis_orchestrator import AIDataAnalysisOrchestrator, OrchestratorError, WorkflowError

orchestrator = AIDataAnalysisOrchestrator()

try:
    result = orchestrator.analyze_data(
        data_source="dataset.csv",
        analysis_mode="exploratory",
        max_iterations=10
    )
except WorkflowError as e:
    print(f"Workflow error: {e}")
except OrchestratorError as e:
    print(f"Orchestrator error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Dependencies

Core Dependencies

# Install core dependencies
pip install pydantic python-dotenv pandas

# Install AI provider dependencies
pip install openai anthropic google-cloud-aiplatform

# Install analysis dependencies
pip install numpy scipy scikit-learn matplotlib seaborn

Optional Dependencies

# For advanced analysis
pip install plotly dash streamlit

# For machine learning
pip install xgboost lightgbm catboost

# For statistical analysis
pip install statsmodels pingouin

# For data processing
pip install dask vaex

Verification

# Test dependency availability
try:
    import pydantic
    import pandas
    import numpy
    print("Core dependencies available")
except ImportError as e:
    print(f"Missing dependency: {e}")

# Test AI provider availability
try:
    import openai
    print("OpenAI available")
except ImportError:
    print("OpenAI not available")

try:
    import anthropic
    print("Anthropic available")
except ImportError:
    print("Anthropic not available")

# Test analysis tool availability
try:
    from aiecs.tools.statistics.data_loader import DataLoader
    from aiecs.tools.statistics.data_profiler import DataProfiler
    print("Foundation tools available")
except ImportError:
    print("Foundation tools not available")

Support

For issues or questions about AI Data Analysis Orchestrator configuration:

  • Check the tool source code for implementation details

  • Review foundation tool documentation for specific features

  • Consult the main aiecs documentation for architecture overview

  • Test with simple datasets first to isolate configuration vs. analysis issues

  • Monitor API rate limits and costs

  • Verify AI provider configuration and credentials

  • Ensure proper iteration and workflow limits

  • Check foundation tool availability and configuration

  • Validate analysis mode and provider compatibility