# AI Data Analysis Orchestrator Configuration Guide ## Overview The AI Data Analysis Orchestrator is a powerful tool that coordinates multiple foundation tools to provide natural language driven analysis, automated workflow orchestration, multi-tool coordination, and comprehensive analysis execution. It supports various analysis modes (exploratory, diagnostic, predictive, prescriptive, comparative, causal) and coordinates foundation tools including data_loader, data_profiler, data_transformer, data_visualizer, statistical_analyzer, and model_trainer. The tool can be configured via environment variables using the `AI_DATA_ORCHESTRATOR_` prefix or through programmatic configuration when initializing the tool. ## Using .env Files in Your Project When using aiecs as a dependency in your project, you can store configuration in a `.env` file for convenience. The AI Data Analysis Orchestrator reads from environment variables that are already loaded into the process, so you need to load the `.env` file in your application before importing aiecs tools. ### Setting Up .env Files **1. Install python-dotenv:** ```bash pip install python-dotenv ``` **2. Create a `.env` file in your project root:** ```bash # .env file in your project root AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=10 AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true ``` **3. Load the .env file in your application:** ```python # main.py or app.py - at the top of your entry point from dotenv import load_dotenv # Load environment variables from .env file # This must be done BEFORE importing aiecs tools load_dotenv() # Now import and use aiecs tools from aiecs.tools.statistics.ai_data_analysis_orchestrator import AIDataAnalysisOrchestrator # The tool will automatically use the environment variables orchestrator = AIDataAnalysisOrchestrator() ``` ### Multiple Environment Files You can use different `.env` files for different environments: ```python import os from dotenv import load_dotenv # Load environment-specific configuration env = os.getenv('APP_ENV', 'development') if env == 'production': load_dotenv('.env.production') elif env == 'staging': load_dotenv('.env.staging') else: load_dotenv('.env.development') from aiecs.tools.statistics.ai_data_analysis_orchestrator import AIDataAnalysisOrchestrator orchestrator = AIDataAnalysisOrchestrator() ``` **Example `.env.production`:** ```bash # Production settings - optimized for performance and reliability AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=20 AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true ``` **Example `.env.development`:** ```bash # Development settings - optimized for testing and debugging AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=5 AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=false AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=local AI_DATA_ORCHESTRATOR_ENABLE_CACHING=false ``` ### Best Practices for .env Files 1. **Never commit .env files to version control** - Add `.env` to your `.gitignore`: ```gitignore # .gitignore .env .env.local .env.*.local .env.production .env.staging ``` 2. **Provide a template** - Create `.env.example` with documented dummy values: ```bash # .env.example # AI Data Analysis Orchestrator Configuration # Default analysis mode to use AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory # Maximum number of analysis iterations AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=10 # Whether to enable automatic workflow generation AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true # Default AI provider to use AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai # Whether to enable result caching AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true ``` 3. **Document your variables** - Add comments explaining each setting 4. **Use load_dotenv() early** - Call it at the very top of your entry point, before any aiecs imports 5. **Format values correctly**: - Strings: Plain text: `exploratory`, `openai` - Integers: Plain numbers: `10`, `20` - Booleans: `true` or `false` ## Configuration Options ### 1. Default Mode **Environment Variable:** `AI_DATA_ORCHESTRATOR_DEFAULT_MODE` **Type:** String **Default:** `"exploratory"` **Description:** Default analysis mode to use for data analysis operations. This mode is used when no specific mode is specified in the analysis request. **Supported Modes:** - `exploratory` - Exploratory data analysis (default) - `diagnostic` - Diagnostic analysis - `predictive` - Predictive analysis - `prescriptive` - Prescriptive analysis - `comparative` - Comparative analysis - `causal` - Causal analysis **Example:** ```bash export AI_DATA_ORCHESTRATOR_DEFAULT_MODE=predictive ``` **Mode Note:** Choose the mode that best fits your typical analysis requirements. ### 2. Max Iterations **Environment Variable:** `AI_DATA_ORCHESTRATOR_MAX_ITERATIONS` **Type:** Integer **Default:** `10` **Description:** Maximum number of analysis iterations that can be performed in a single analysis workflow. This controls the depth and complexity of analysis operations. **Common Values:** - `5` - Quick analysis (basic insights) - `10` - Standard analysis (default, balanced) - `20` - Deep analysis (comprehensive insights) - `50` - Maximum analysis (exhaustive exploration) **Example:** ```bash export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=20 ``` **Iteration Note:** Higher values provide more comprehensive analysis but may increase processing time and resource usage. ### 3. Enable Auto Workflow **Environment Variable:** `AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW` **Type:** Boolean **Default:** `True` **Description:** Whether to enable automatic workflow generation. When enabled, the orchestrator automatically designs analysis workflows based on the data and requirements. **Values:** - `true` - Enable auto workflow (default) - `false` - Disable auto workflow **Example:** ```bash export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true ``` **Workflow Note:** Auto workflow provides intelligent analysis design but may require more computational resources. ### 4. Default AI Provider **Environment Variable:** `AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER` **Type:** String **Default:** `"openai"` **Description:** Default AI provider to use for analysis operations. This provider is used when no specific provider is specified in the request. **Supported Providers:** - `openai` - OpenAI API (default) - `anthropic` - Anthropic Claude - `google` - Google AI - `local` - Local AI model **Example:** ```bash export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=anthropic ``` **Provider Note:** Ensure the selected provider is properly configured with API keys and credentials. ### 5. Enable Caching **Environment Variable:** `AI_DATA_ORCHESTRATOR_ENABLE_CACHING` **Type:** Boolean **Default:** `True` **Description:** Whether to enable result caching. When enabled, analysis results are cached to improve performance for similar requests. **Values:** - `true` - Enable caching (default) - `false` - Disable caching **Example:** ```bash export AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true ``` **Caching Note:** Caching improves performance but requires additional memory and storage. ## Usage Examples ### Example 1: Basic Environment Configuration ```bash # Set basic analysis parameters export AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=10 export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai export AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true # Run your application python app.py ``` ### Example 2: High-Performance Configuration ```bash # Optimized for comprehensive analysis export AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=20 export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai export AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true ``` ### Example 3: Development Configuration ```bash # Development-friendly settings export AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=5 export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=false export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=local export AI_DATA_ORCHESTRATOR_ENABLE_CACHING=false ``` ### Example 4: Programmatic Configuration ```python from aiecs.tools.statistics.ai_data_analysis_orchestrator import AIDataAnalysisOrchestrator # Initialize with custom configuration orchestrator = AIDataAnalysisOrchestrator(config={ 'default_mode': 'exploratory', 'max_iterations': 10, 'enable_auto_workflow': True, 'default_ai_provider': 'openai', 'enable_caching': True }) ``` ### Example 5: Mixed Configuration Environment variables are used as defaults, but can be overridden programmatically: ```bash # Set environment defaults export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=10 export AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory ``` ```python # Override for specific instance orchestrator = AIDataAnalysisOrchestrator(config={ 'max_iterations': 20, # This overrides the environment variable 'default_mode': 'predictive' # This overrides the environment variable }) ``` ## Configuration Priority When the AI Data Analysis Orchestrator is initialized, configuration values are resolved in the following order (highest to lowest priority): 1. **Programmatic config** - Values passed to the constructor 2. **Environment variables** - Values set via `AI_DATA_ORCHESTRATOR_*` variables 3. **Default values** - Built-in defaults as specified above ## Data Type Parsing ### String Values Strings should be provided as plain text without quotes: ```bash export AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai ``` ### Integer Values Integers should be provided as numeric strings: ```bash export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=10 export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=20 ``` ### Boolean Values Booleans should be provided as lowercase strings: ```bash export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true export AI_DATA_ORCHESTRATOR_ENABLE_CACHING=false ``` ## Validation ### Automatic Type Validation Pydantic automatically validates configuration values: - `default_mode` must be a valid analysis mode string - `max_iterations` must be a positive integer - `enable_auto_workflow` must be a boolean - `default_ai_provider` must be a valid provider string - `enable_caching` must be a boolean ### Runtime Validation When performing analysis, the tool validates: 1. **Analysis mode** - Mode must be supported 2. **Iteration limits** - Analysis must not exceed max iterations 3. **AI provider availability** - Provider must be configured 4. **Workflow constraints** - Auto workflow must be properly configured 5. **Caching requirements** - Cache must be accessible if enabled ## Analysis Modes The AI Data Analysis Orchestrator supports various analysis modes: ### Basic Modes - **Exploratory** - Initial data exploration and discovery - **Diagnostic** - Root cause analysis and problem diagnosis - **Predictive** - Future trend prediction and forecasting - **Prescriptive** - Actionable recommendations and solutions ### Advanced Modes - **Comparative** - Compare different datasets or time periods - **Causal** - Identify cause-and-effect relationships ## AI Providers ### Supported Providers - **OpenAI** - OpenAI API integration - **Anthropic** - Anthropic Claude integration - **Google** - Google AI integration - **Local** - Local AI model integration ### Provider Configuration Each provider requires specific configuration: **OpenAI:** ```bash export OPENAI_API_KEY=your-api-key export OPENAI_ORG_ID=your-org-id # Optional ``` **Anthropic:** ```bash export ANTHROPIC_API_KEY=your-api-key ``` **Google:** ```bash export GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account.json export GOOGLE_CLOUD_PROJECT=your-project-id ``` **Local:** ```bash export LOCAL_MODEL_PATH=path/to/model export LOCAL_MODEL_TYPE=llama2 # or other model type ``` ## Operations Supported The AI Data Analysis Orchestrator supports comprehensive data analysis operations: ### Basic Analysis - `analyze_data` - Perform comprehensive data analysis - `exploratory_analysis` - Perform exploratory data analysis - `diagnostic_analysis` - Perform diagnostic analysis - `predictive_analysis` - Perform predictive analysis - `prescriptive_analysis` - Perform prescriptive analysis ### Advanced Analysis - `comparative_analysis` - Compare different datasets - `causal_analysis` - Identify causal relationships - `workflow_analysis` - Execute custom analysis workflows - `iterative_analysis` - Perform iterative analysis with feedback ### Workflow Management - `design_workflow` - Design analysis workflows - `execute_workflow` - Execute analysis workflows - `optimize_workflow` - Optimize workflow performance - `cache_workflow` - Cache workflow results ### Tool Coordination - `coordinate_tools` - Coordinate multiple analysis tools - `integrate_results` - Integrate results from multiple tools - `validate_analysis` - Validate analysis results - `generate_report` - Generate comprehensive analysis reports ## Troubleshooting ### Issue: AI Provider not available **Error:** `OrchestratorError` when calling AI providers **Solutions:** ```bash # Check provider configuration export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai # Verify API keys export OPENAI_API_KEY=your-valid-api-key # Test with local provider export AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=local ``` ### Issue: Analysis workflow fails **Error:** `WorkflowError` during workflow execution **Solutions:** 1. Check foundation tool availability 2. Verify data accessibility 3. Check workflow configuration 4. Validate analysis parameters ### Issue: Max iterations exceeded **Error:** Analysis exceeds maximum iterations **Solutions:** ```bash # Increase max iterations export AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=20 # Or optimize analysis workflow export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true ``` ### Issue: Caching problems **Error:** Cache operations fail **Solutions:** ```bash # Disable caching for testing export AI_DATA_ORCHESTRATOR_ENABLE_CACHING=false # Check cache directory permissions # Verify cache configuration ``` ### Issue: Auto workflow issues **Error:** Auto workflow generation fails **Solutions:** ```bash # Disable auto workflow for testing export AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=false # Check AI provider configuration # Verify workflow templates ``` ### Issue: Foundation tool errors **Error:** Foundation tool operations fail **Solutions:** 1. Check tool availability and dependencies 2. Verify data format compatibility 3. Check tool configuration 4. Validate input data ## Best Practices ### Performance Optimization 1. **Iteration Management** - Set appropriate max iterations 2. **Caching Strategy** - Enable caching for repeated analyses 3. **Workflow Optimization** - Use auto workflow for efficiency 4. **Provider Selection** - Choose providers based on task requirements 5. **Resource Management** - Monitor memory and CPU usage ### Error Handling 1. **Graceful Degradation** - Handle tool failures gracefully 2. **Retry Logic** - Implement retry for transient failures 3. **Fallback Strategies** - Provide fallback analysis methods 4. **Error Logging** - Log errors for debugging and monitoring 5. **User Feedback** - Provide clear error messages ### Security 1. **API Key Management** - Secure storage of API keys 2. **Data Privacy** - Ensure data privacy in analysis 3. **Access Control** - Control access to analysis tools 4. **Audit Logging** - Log analysis activities for compliance 5. **Data Validation** - Validate input data before analysis ### Resource Management 1. **Memory Usage** - Monitor memory consumption during analysis 2. **API Rate Limits** - Respect provider rate limits 3. **Cost Management** - Monitor and control analysis costs 4. **Processing Time** - Set reasonable timeouts 5. **Cleanup** - Clean up temporary files and resources ### Integration 1. **Tool Dependencies** - Ensure required tools are available 2. **API Compatibility** - Maintain API compatibility 3. **Error Propagation** - Properly propagate errors 4. **Logging Integration** - Integrate with logging systems 5. **Monitoring** - Monitor tool performance and usage ### Development vs Production **Development:** ```bash AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=5 AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=false AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=local AI_DATA_ORCHESTRATOR_ENABLE_CACHING=false ``` **Production:** ```bash AI_DATA_ORCHESTRATOR_DEFAULT_MODE=exploratory AI_DATA_ORCHESTRATOR_MAX_ITERATIONS=20 AI_DATA_ORCHESTRATOR_ENABLE_AUTO_WORKFLOW=true AI_DATA_ORCHESTRATOR_DEFAULT_AI_PROVIDER=openai AI_DATA_ORCHESTRATOR_ENABLE_CACHING=true ``` ### Error Handling Always wrap analysis operations in try-except blocks: ```python from aiecs.tools.statistics.ai_data_analysis_orchestrator import AIDataAnalysisOrchestrator, OrchestratorError, WorkflowError orchestrator = AIDataAnalysisOrchestrator() try: result = orchestrator.analyze_data( data_source="dataset.csv", analysis_mode="exploratory", max_iterations=10 ) except WorkflowError as e: print(f"Workflow error: {e}") except OrchestratorError as e: print(f"Orchestrator error: {e}") except Exception as e: print(f"Unexpected error: {e}") ``` ## Dependencies ### Core Dependencies ```bash # Install core dependencies pip install pydantic python-dotenv pandas # Install AI provider dependencies pip install openai anthropic google-cloud-aiplatform # Install analysis dependencies pip install numpy scipy scikit-learn matplotlib seaborn ``` ### Optional Dependencies ```bash # For advanced analysis pip install plotly dash streamlit # For machine learning pip install xgboost lightgbm catboost # For statistical analysis pip install statsmodels pingouin # For data processing pip install dask vaex ``` ### Verification ```python # Test dependency availability try: import pydantic import pandas import numpy print("Core dependencies available") except ImportError as e: print(f"Missing dependency: {e}") # Test AI provider availability try: import openai print("OpenAI available") except ImportError: print("OpenAI not available") try: import anthropic print("Anthropic available") except ImportError: print("Anthropic not available") # Test analysis tool availability try: from aiecs.tools.statistics.data_loader import DataLoader from aiecs.tools.statistics.data_profiler import DataProfiler print("Foundation tools available") except ImportError: print("Foundation tools not available") ``` ## Related Documentation - Tool implementation details in the source code - Foundation tools documentation (data_loader, data_profiler, etc.) - AIECS client documentation for AI operations - Main aiecs documentation for architecture overview ## Support For issues or questions about AI Data Analysis Orchestrator configuration: - Check the tool source code for implementation details - Review foundation tool documentation for specific features - Consult the main aiecs documentation for architecture overview - Test with simple datasets first to isolate configuration vs. analysis issues - Monitor API rate limits and costs - Verify AI provider configuration and credentials - Ensure proper iteration and workflow limits - Check foundation tool availability and configuration - Validate analysis mode and provider compatibility