# Data Loader Tool Configuration Guide ## Overview The Data Loader Tool is a universal data loading tool that provides comprehensive data loading capabilities with auto-detection of file formats, multiple loading strategies (full, streaming, chunked, lazy), data quality validation on load, schema inference and validation, and support for CSV, Excel, JSON, Parquet, and other formats. It can load data from multiple file formats, auto-detect data formats and schemas, handle large datasets with streaming, and validate data quality on load. The tool integrates with pandas_tool for core data operations and supports various data source types (CSV, Excel, JSON, Parquet, Feather, HDF5, Stata, SAS, SPSS) and loading strategies (full_load, streaming, chunked, lazy, incremental). The tool can be configured via environment variables using the `DATA_LOADER_` prefix or through programmatic configuration when initializing the tool. ## Using .env Files in Your Project When using aiecs as a dependency in your project, you can store configuration in a `.env` file for convenience. The Data Loader Tool reads from environment variables that are already loaded into the process, so you need to load the `.env` file in your application before importing aiecs tools. ### Setting Up .env Files **1. Install python-dotenv:** ```bash pip install python-dotenv ``` **2. Create a `.env` file in your project root:** ```bash # .env file in your project root DATA_LOADER_MAX_FILE_SIZE_MB=500 DATA_LOADER_DEFAULT_CHUNK_SIZE=10000 DATA_LOADER_MAX_MEMORY_USAGE_MB=2000 DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true DATA_LOADER_ENABLE_QUALITY_VALIDATION=true DATA_LOADER_DEFAULT_ENCODING=utf-8 ``` **3. Load the .env file in your application:** ```python # main.py or app.py - at the top of your entry point from dotenv import load_dotenv # Load environment variables from .env file # This must be done BEFORE importing aiecs tools load_dotenv() # Now import and use aiecs tools from aiecs.tools.statistics.data_loader_tool import DataLoaderTool # The tool will automatically use the environment variables data_loader = DataLoaderTool() ``` ### Multiple Environment Files You can use different `.env` files for different environments: ```python import os from dotenv import load_dotenv # Load environment-specific configuration env = os.getenv('APP_ENV', 'development') if env == 'production': load_dotenv('.env.production') elif env == 'staging': load_dotenv('.env.staging') else: load_dotenv('.env.development') from aiecs.tools.statistics.data_loader_tool import DataLoaderTool data_loader = DataLoaderTool() ``` **Example `.env.production`:** ```bash # Production settings - optimized for large datasets DATA_LOADER_MAX_FILE_SIZE_MB=1000 DATA_LOADER_DEFAULT_CHUNK_SIZE=50000 DATA_LOADER_MAX_MEMORY_USAGE_MB=4000 DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true DATA_LOADER_ENABLE_QUALITY_VALIDATION=true DATA_LOADER_DEFAULT_ENCODING=utf-8 ``` **Example `.env.development`:** ```bash # Development settings - optimized for testing and debugging DATA_LOADER_MAX_FILE_SIZE_MB=100 DATA_LOADER_DEFAULT_CHUNK_SIZE=1000 DATA_LOADER_MAX_MEMORY_USAGE_MB=500 DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true DATA_LOADER_ENABLE_QUALITY_VALIDATION=false DATA_LOADER_DEFAULT_ENCODING=utf-8 ``` ### Best Practices for .env Files 1. **Never commit .env files to version control** - Add `.env` to your `.gitignore`: ```gitignore # .gitignore .env .env.local .env.*.local .env.production .env.staging ``` 2. **Provide a template** - Create `.env.example` with documented dummy values: ```bash # .env.example # Data Loader Tool Configuration # Maximum file size in megabytes DATA_LOADER_MAX_FILE_SIZE_MB=500 # Default chunk size for chunked loading DATA_LOADER_DEFAULT_CHUNK_SIZE=10000 # Maximum memory usage in megabytes DATA_LOADER_MAX_MEMORY_USAGE_MB=2000 # Whether to enable automatic schema inference DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true # Whether to enable data quality validation DATA_LOADER_ENABLE_QUALITY_VALIDATION=true # Default text encoding for file operations DATA_LOADER_DEFAULT_ENCODING=utf-8 ``` 3. **Document your variables** - Add comments explaining each setting 4. **Use load_dotenv() early** - Call it at the very top of your entry point, before any aiecs imports 5. **Format values correctly**: - Strings: Plain text: `utf-8`, `latin-1` - Integers: Plain numbers: `500`, `10000`, `2000` - Booleans: `true` or `false` ## Configuration Options ### 1. Max File Size MB **Environment Variable:** `DATA_LOADER_MAX_FILE_SIZE_MB` **Type:** Integer **Default:** `500` **Description:** Maximum file size in megabytes that can be loaded. Files larger than this will trigger chunked or streaming loading strategies to prevent memory issues. **Common Values:** - `100` - Small files only (development/testing) - `500` - Standard files (default, balanced) - `1000` - Large files (production) - `2000` - Very large files (high-memory systems) **Example:** ```bash export DATA_LOADER_MAX_FILE_SIZE_MB=1000 ``` **Size Note:** Larger values allow bigger files but require more memory. ### 2. Default Chunk Size **Environment Variable:** `DATA_LOADER_DEFAULT_CHUNK_SIZE` **Type:** Integer **Default:** `10000` **Description:** Default chunk size for chunked loading operations. This determines how many rows are processed at once when using chunked loading strategies. **Common Values:** - `1000` - Small chunks (low memory usage) - `10000` - Standard chunks (default, balanced) - `50000` - Large chunks (high performance) - `100000` - Very large chunks (maximum performance) **Example:** ```bash export DATA_LOADER_DEFAULT_CHUNK_SIZE=50000 ``` **Chunk Note:** Larger chunks improve performance but use more memory. ### 3. Max Memory Usage MB **Environment Variable:** `DATA_LOADER_MAX_MEMORY_USAGE_MB` **Type:** Integer **Default:** `2000` **Description:** Maximum memory usage in megabytes for data loading operations. This helps prevent out-of-memory errors by controlling memory consumption. **Common Values:** - `500` - Low memory systems - `2000` - Standard systems (default) - `4000` - High memory systems - `8000` - Very high memory systems **Example:** ```bash export DATA_LOADER_MAX_MEMORY_USAGE_MB=4000 ``` **Memory Note:** Higher values allow larger datasets but require more system memory. ### 4. Enable Schema Inference **Environment Variable:** `DATA_LOADER_ENABLE_SCHEMA_INFERENCE` **Type:** Boolean **Default:** `True` **Description:** Whether to enable automatic schema inference when loading data. Schema inference automatically detects data types and structure. **Values:** - `true` - Enable schema inference (default) - `false` - Disable schema inference **Example:** ```bash export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true ``` **Schema Note:** Schema inference improves data quality but may slow down loading. ### 5. Enable Quality Validation **Environment Variable:** `DATA_LOADER_ENABLE_QUALITY_VALIDATION` **Type:** Boolean **Default:** `True` **Description:** Whether to enable data quality validation during loading. Quality validation checks for missing values, data type consistency, and other quality issues. **Values:** - `true` - Enable quality validation (default) - `false` - Disable quality validation **Example:** ```bash export DATA_LOADER_ENABLE_QUALITY_VALIDATION=true ``` **Quality Note:** Quality validation improves data reliability but may slow down loading. ### 6. Default Encoding **Environment Variable:** `DATA_LOADER_DEFAULT_ENCODING` **Type:** String **Default:** `"utf-8"` **Description:** Default text encoding for file operations. This is used when no specific encoding is provided for text-based file formats. **Common Encodings:** - `utf-8` - Unicode UTF-8 (default, most common) - `latin-1` - Latin-1 (ISO-8859-1) - `cp1252` - Windows-1252 - `ascii` - ASCII (7-bit) **Example:** ```bash export DATA_LOADER_DEFAULT_ENCODING=utf-8 ``` **Encoding Note:** UTF-8 is recommended for international text, Latin-1 for legacy systems. ## Usage Examples ### Example 1: Basic Environment Configuration ```bash # Set basic data loading parameters export DATA_LOADER_MAX_FILE_SIZE_MB=500 export DATA_LOADER_DEFAULT_CHUNK_SIZE=10000 export DATA_LOADER_MAX_MEMORY_USAGE_MB=2000 export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true export DATA_LOADER_ENABLE_QUALITY_VALIDATION=true export DATA_LOADER_DEFAULT_ENCODING=utf-8 # Run your application python app.py ``` ### Example 2: High-Performance Configuration ```bash # Optimized for large datasets and high performance export DATA_LOADER_MAX_FILE_SIZE_MB=1000 export DATA_LOADER_DEFAULT_CHUNK_SIZE=50000 export DATA_LOADER_MAX_MEMORY_USAGE_MB=4000 export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true export DATA_LOADER_ENABLE_QUALITY_VALIDATION=true export DATA_LOADER_DEFAULT_ENCODING=utf-8 ``` ### Example 3: Development Configuration ```bash # Development-friendly settings export DATA_LOADER_MAX_FILE_SIZE_MB=100 export DATA_LOADER_DEFAULT_CHUNK_SIZE=1000 export DATA_LOADER_MAX_MEMORY_USAGE_MB=500 export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true export DATA_LOADER_ENABLE_QUALITY_VALIDATION=false export DATA_LOADER_DEFAULT_ENCODING=utf-8 ``` ### Example 4: Programmatic Configuration ```python from aiecs.tools.statistics.data_loader_tool import DataLoaderTool # Initialize with custom configuration data_loader = DataLoaderTool(config={ 'max_file_size_mb': 500, 'default_chunk_size': 10000, 'max_memory_usage_mb': 2000, 'enable_schema_inference': True, 'enable_quality_validation': True, 'default_encoding': 'utf-8' }) ``` ### Example 5: Mixed Configuration Environment variables are used as defaults, but can be overridden programmatically: ```bash # Set environment defaults export DATA_LOADER_DEFAULT_CHUNK_SIZE=10000 export DATA_LOADER_ENABLE_QUALITY_VALIDATION=true ``` ```python # Override for specific instance data_loader = DataLoaderTool(config={ 'default_chunk_size': 50000, # This overrides the environment variable 'enable_quality_validation': False # This overrides the environment variable }) ``` ## Configuration Priority When the Data Loader Tool is initialized, configuration values are resolved in the following order (highest to lowest priority): 1. **Programmatic config** - Values passed to the constructor 2. **Environment variables** - Values set via `DATA_LOADER_*` variables 3. **Default values** - Built-in defaults as specified above ## Data Type Parsing ### String Values Strings should be provided as plain text without quotes: ```bash export DATA_LOADER_DEFAULT_ENCODING=utf-8 export DATA_LOADER_DEFAULT_ENCODING=latin-1 ``` ### Integer Values Integers should be provided as numeric strings: ```bash export DATA_LOADER_MAX_FILE_SIZE_MB=500 export DATA_LOADER_DEFAULT_CHUNK_SIZE=10000 export DATA_LOADER_MAX_MEMORY_USAGE_MB=2000 ``` ### Boolean Values Booleans should be provided as lowercase strings: ```bash export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true export DATA_LOADER_ENABLE_QUALITY_VALIDATION=false ``` ## Validation ### Automatic Type Validation Pydantic automatically validates configuration values: - `max_file_size_mb` must be a positive integer - `default_chunk_size` must be a positive integer - `max_memory_usage_mb` must be a positive integer - `enable_schema_inference` must be a boolean - `enable_quality_validation` must be a boolean - `default_encoding` must be a non-empty string ### Runtime Validation When loading data, the tool validates: 1. **File size** - Files must not exceed maximum size limit 2. **Memory usage** - Operations must not exceed memory limits 3. **Chunk size** - Chunk size must be reasonable for the dataset 4. **Encoding** - Encoding must be valid for the file format 5. **Schema compatibility** - Schema inference must be compatible with data 6. **Quality standards** - Data must meet quality validation criteria ## Data Source Types The Data Loader Tool supports various data source types: ### Text Formats - **CSV** - Comma-separated values - **JSON** - JavaScript Object Notation - **Excel** - Microsoft Excel files (.xlsx, .xls) ### Binary Formats - **Parquet** - Columnar storage format - **Feather** - Fast binary format - **HDF5** - Hierarchical Data Format ### Statistical Formats - **Stata** - Stata data files - **SAS** - SAS data files - **SPSS** - SPSS data files ### Auto-Detection - **AUTO** - Automatically detect file format ## Loading Strategies ### Full Load - Load entire dataset into memory - Fastest for small to medium datasets - Requires sufficient memory ### Streaming - Process data in continuous stream - Memory-efficient for large datasets - Slower but handles unlimited size ### Chunked - Process data in fixed-size chunks - Balanced memory usage and performance - Good for large datasets ### Lazy - Load data on-demand - Minimal initial memory usage - Slower access but memory-efficient ### Incremental - Load data in incremental batches - Good for ongoing data processing - Maintains processing state ## Operations Supported The Data Loader Tool supports comprehensive data loading operations: ### Basic Loading - `load_data` - Load data from various file formats - `load_csv` - Load CSV files with options - `load_excel` - Load Excel files with sheet selection - `load_json` - Load JSON files with structure handling - `load_parquet` - Load Parquet files efficiently ### Advanced Loading - `load_chunked` - Load data in chunks for large files - `load_streaming` - Stream data for memory efficiency - `load_lazy` - Lazy load data on-demand - `load_incremental` - Incremental data loading - `auto_detect_format` - Automatically detect file format ### Schema Operations - `infer_schema` - Infer data schema automatically - `validate_schema` - Validate data against schema - `apply_schema` - Apply schema to loaded data - `get_schema_info` - Get detailed schema information ### Quality Operations - `validate_quality` - Validate data quality - `check_missing_values` - Check for missing values - `validate_data_types` - Validate data type consistency - `generate_quality_report` - Generate data quality report ### Utility Operations - `get_file_info` - Get file information and metadata - `estimate_memory_usage` - Estimate memory requirements - `check_file_compatibility` - Check file format compatibility - `optimize_loading_strategy` - Optimize loading strategy ## Troubleshooting ### Issue: File too large error **Error:** File exceeds maximum size limit **Solutions:** ```bash # Increase file size limit export DATA_LOADER_MAX_FILE_SIZE_MB=1000 # Or use chunked loading data_loader.load_data(file_path, strategy='chunked') ``` ### Issue: Memory usage exceeded **Error:** Memory usage exceeds limit **Solutions:** ```bash # Increase memory limit export DATA_LOADER_MAX_MEMORY_USAGE_MB=4000 # Or reduce chunk size export DATA_LOADER_DEFAULT_CHUNK_SIZE=5000 ``` ### Issue: Schema inference fails **Error:** Schema inference errors **Solutions:** ```bash # Disable schema inference export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=false # Or provide explicit schema data_loader.load_data(file_path, schema=explicit_schema) ``` ### Issue: Quality validation fails **Error:** Data quality validation errors **Solutions:** ```bash # Disable quality validation export DATA_LOADER_ENABLE_QUALITY_VALIDATION=false # Or fix data quality issues data_loader.validate_quality(data) ``` ### Issue: Encoding problems **Error:** Text encoding errors **Solutions:** ```bash # Set correct encoding export DATA_LOADER_DEFAULT_ENCODING=latin-1 # Or specify encoding per file data_loader.load_data(file_path, encoding='utf-8') ``` ### Issue: Chunked loading performance **Error:** Slow chunked loading **Solutions:** ```bash # Increase chunk size export DATA_LOADER_DEFAULT_CHUNK_SIZE=50000 # Or use streaming strategy data_loader.load_data(file_path, strategy='streaming') ``` ### Issue: File format not supported **Error:** Unsupported file format **Solutions:** 1. Check file format compatibility 2. Use auto-detection 3. Convert file to supported format 4. Install required dependencies ## Best Practices ### Performance Optimization 1. **Chunk Size Tuning** - Optimize chunk size for your data 2. **Memory Management** - Monitor memory usage and set appropriate limits 3. **Format Selection** - Choose efficient formats (Parquet, Feather) 4. **Strategy Selection** - Use appropriate loading strategy 5. **Schema Optimization** - Disable schema inference when not needed ### Error Handling 1. **Graceful Degradation** - Handle loading failures gracefully 2. **Validation** - Validate data before processing 3. **Fallback Strategies** - Provide fallback loading methods 4. **Error Logging** - Log errors for debugging and monitoring 5. **User Feedback** - Provide clear error messages ### Security 1. **File Validation** - Validate file paths and permissions 2. **Content Validation** - Validate file content before loading 3. **Access Control** - Control access to data files 4. **Audit Logging** - Log data loading activities 5. **Data Privacy** - Ensure data privacy and compliance ### Resource Management 1. **Memory Monitoring** - Monitor memory usage during loading 2. **File Cleanup** - Clean up temporary files 3. **Connection Management** - Manage database connections efficiently 4. **Processing Time** - Set reasonable timeouts 5. **Storage Optimization** - Optimize data storage formats ### Integration 1. **Tool Dependencies** - Ensure required tools are available 2. **API Compatibility** - Maintain API compatibility 3. **Error Propagation** - Properly propagate errors 4. **Logging Integration** - Integrate with logging systems 5. **Monitoring** - Monitor tool performance and usage ### Development vs Production **Development:** ```bash DATA_LOADER_MAX_FILE_SIZE_MB=100 DATA_LOADER_DEFAULT_CHUNK_SIZE=1000 DATA_LOADER_MAX_MEMORY_USAGE_MB=500 DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true DATA_LOADER_ENABLE_QUALITY_VALIDATION=false DATA_LOADER_DEFAULT_ENCODING=utf-8 ``` **Production:** ```bash DATA_LOADER_MAX_FILE_SIZE_MB=1000 DATA_LOADER_DEFAULT_CHUNK_SIZE=50000 DATA_LOADER_MAX_MEMORY_USAGE_MB=4000 DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true DATA_LOADER_ENABLE_QUALITY_VALIDATION=true DATA_LOADER_DEFAULT_ENCODING=utf-8 ``` ### Error Handling Always wrap data loading operations in try-except blocks: ```python from aiecs.tools.statistics.data_loader_tool import DataLoaderTool, DataLoaderError, FileFormatError, SchemaValidationError, DataQualityError data_loader = DataLoaderTool() try: data = data_loader.load_data( file_path='data.csv', source_type='csv', strategy='full_load' ) except FileFormatError as e: print(f"File format error: {e}") except SchemaValidationError as e: print(f"Schema validation error: {e}") except DataQualityError as e: print(f"Data quality error: {e}") except DataLoaderError as e: print(f"Data loader error: {e}") except Exception as e: print(f"Unexpected error: {e}") ``` ## Dependencies ### Core Dependencies ```bash # Install core dependencies pip install pydantic python-dotenv # Install data processing dependencies pip install pandas numpy # Install file format dependencies pip install openpyxl xlrd pyarrow fastparquet ``` ### Optional Dependencies ```bash # For statistical formats pip install pyreadstat # For HDF5 support pip install h5py # For Feather support pip install feather-format # For advanced data types pip install pandas-stubs ``` ### Verification ```python # Test dependency availability try: import pandas import numpy print("Core dependencies available") except ImportError as e: print(f"Missing dependency: {e}") # Test format support try: import openpyxl print("Excel support available") except ImportError: print("Excel support not available") try: import pyarrow print("Parquet support available") except ImportError: print("Parquet support not available") # Test statistical format support try: import pyreadstat print("Statistical format support available") except ImportError: print("Statistical format support not available") ``` ## Related Documentation - Tool implementation details in the source code - Pandas tool documentation for core data operations - Statistics tools documentation for data analysis - Main aiecs documentation for architecture overview ## Support For issues or questions about Data Loader Tool configuration: - Check the tool source code for implementation details - Review pandas tool documentation for core data operations - Consult the main aiecs documentation for architecture overview - Test with simple files first to isolate configuration vs. loading issues - Verify file format compatibility and dependencies - Check memory and file size limits - Ensure proper encoding and schema settings - Validate data quality and structure requirements