Data Loader Tool Configuration Guide
Overview
The Data Loader Tool is a universal data loading tool that provides comprehensive data loading capabilities with auto-detection of file formats, multiple loading strategies (full, streaming, chunked, lazy), data quality validation on load, schema inference and validation, and support for CSV, Excel, JSON, Parquet, and other formats. It can load data from multiple file formats, auto-detect data formats and schemas, handle large datasets with streaming, and validate data quality on load. The tool integrates with pandas_tool for core data operations and supports various data source types (CSV, Excel, JSON, Parquet, Feather, HDF5, Stata, SAS, SPSS) and loading strategies (full_load, streaming, chunked, lazy, incremental). The tool can be configured via environment variables using the DATA_LOADER_ prefix or through programmatic configuration when initializing the tool.
Using .env Files in Your Project
When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Data Loader Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.
Setting Up .env Files
1. Install python-dotenv:
pip install python-dotenv
2. Create a .env file in your project root:
# .env file in your project root
DATA_LOADER_MAX_FILE_SIZE_MB=500
DATA_LOADER_DEFAULT_CHUNK_SIZE=10000
DATA_LOADER_MAX_MEMORY_USAGE_MB=2000
DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
DATA_LOADER_ENABLE_QUALITY_VALIDATION=true
DATA_LOADER_DEFAULT_ENCODING=utf-8
3. Load the .env file in your application:
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv
# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()
# Now import and use aiecs tools
from aiecs.tools.statistics.data_loader_tool import DataLoaderTool
# The tool will automatically use the environment variables
data_loader = DataLoaderTool()
Multiple Environment Files
You can use different .env files for different environments:
import os
from dotenv import load_dotenv
# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')
if env == 'production':
load_dotenv('.env.production')
elif env == 'staging':
load_dotenv('.env.staging')
else:
load_dotenv('.env.development')
from aiecs.tools.statistics.data_loader_tool import DataLoaderTool
data_loader = DataLoaderTool()
Example .env.production:
# Production settings - optimized for large datasets
DATA_LOADER_MAX_FILE_SIZE_MB=1000
DATA_LOADER_DEFAULT_CHUNK_SIZE=50000
DATA_LOADER_MAX_MEMORY_USAGE_MB=4000
DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
DATA_LOADER_ENABLE_QUALITY_VALIDATION=true
DATA_LOADER_DEFAULT_ENCODING=utf-8
Example .env.development:
# Development settings - optimized for testing and debugging
DATA_LOADER_MAX_FILE_SIZE_MB=100
DATA_LOADER_DEFAULT_CHUNK_SIZE=1000
DATA_LOADER_MAX_MEMORY_USAGE_MB=500
DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
DATA_LOADER_ENABLE_QUALITY_VALIDATION=false
DATA_LOADER_DEFAULT_ENCODING=utf-8
Best Practices for .env Files
Never commit .env files to version control - Add
.envto your.gitignore:# .gitignore .env .env.local .env.*.local .env.production .env.staging
Provide a template - Create
.env.examplewith documented dummy values:# .env.example # Data Loader Tool Configuration # Maximum file size in megabytes DATA_LOADER_MAX_FILE_SIZE_MB=500 # Default chunk size for chunked loading DATA_LOADER_DEFAULT_CHUNK_SIZE=10000 # Maximum memory usage in megabytes DATA_LOADER_MAX_MEMORY_USAGE_MB=2000 # Whether to enable automatic schema inference DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true # Whether to enable data quality validation DATA_LOADER_ENABLE_QUALITY_VALIDATION=true # Default text encoding for file operations DATA_LOADER_DEFAULT_ENCODING=utf-8
Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format values correctly:
Strings: Plain text:
utf-8,latin-1Integers: Plain numbers:
500,10000,2000Booleans:
trueorfalse
Configuration Options
1. Max File Size MB
Environment Variable: DATA_LOADER_MAX_FILE_SIZE_MB
Type: Integer
Default: 500
Description: Maximum file size in megabytes that can be loaded. Files larger than this will trigger chunked or streaming loading strategies to prevent memory issues.
Common Values:
100- Small files only (development/testing)500- Standard files (default, balanced)1000- Large files (production)2000- Very large files (high-memory systems)
Example:
export DATA_LOADER_MAX_FILE_SIZE_MB=1000
Size Note: Larger values allow bigger files but require more memory.
2. Default Chunk Size
Environment Variable: DATA_LOADER_DEFAULT_CHUNK_SIZE
Type: Integer
Default: 10000
Description: Default chunk size for chunked loading operations. This determines how many rows are processed at once when using chunked loading strategies.
Common Values:
1000- Small chunks (low memory usage)10000- Standard chunks (default, balanced)50000- Large chunks (high performance)100000- Very large chunks (maximum performance)
Example:
export DATA_LOADER_DEFAULT_CHUNK_SIZE=50000
Chunk Note: Larger chunks improve performance but use more memory.
3. Max Memory Usage MB
Environment Variable: DATA_LOADER_MAX_MEMORY_USAGE_MB
Type: Integer
Default: 2000
Description: Maximum memory usage in megabytes for data loading operations. This helps prevent out-of-memory errors by controlling memory consumption.
Common Values:
500- Low memory systems2000- Standard systems (default)4000- High memory systems8000- Very high memory systems
Example:
export DATA_LOADER_MAX_MEMORY_USAGE_MB=4000
Memory Note: Higher values allow larger datasets but require more system memory.
4. Enable Schema Inference
Environment Variable: DATA_LOADER_ENABLE_SCHEMA_INFERENCE
Type: Boolean
Default: True
Description: Whether to enable automatic schema inference when loading data. Schema inference automatically detects data types and structure.
Values:
true- Enable schema inference (default)false- Disable schema inference
Example:
export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
Schema Note: Schema inference improves data quality but may slow down loading.
5. Enable Quality Validation
Environment Variable: DATA_LOADER_ENABLE_QUALITY_VALIDATION
Type: Boolean
Default: True
Description: Whether to enable data quality validation during loading. Quality validation checks for missing values, data type consistency, and other quality issues.
Values:
true- Enable quality validation (default)false- Disable quality validation
Example:
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=true
Quality Note: Quality validation improves data reliability but may slow down loading.
6. Default Encoding
Environment Variable: DATA_LOADER_DEFAULT_ENCODING
Type: String
Default: "utf-8"
Description: Default text encoding for file operations. This is used when no specific encoding is provided for text-based file formats.
Common Encodings:
utf-8- Unicode UTF-8 (default, most common)latin-1- Latin-1 (ISO-8859-1)cp1252- Windows-1252ascii- ASCII (7-bit)
Example:
export DATA_LOADER_DEFAULT_ENCODING=utf-8
Encoding Note: UTF-8 is recommended for international text, Latin-1 for legacy systems.
Usage Examples
Example 1: Basic Environment Configuration
# Set basic data loading parameters
export DATA_LOADER_MAX_FILE_SIZE_MB=500
export DATA_LOADER_DEFAULT_CHUNK_SIZE=10000
export DATA_LOADER_MAX_MEMORY_USAGE_MB=2000
export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=true
export DATA_LOADER_DEFAULT_ENCODING=utf-8
# Run your application
python app.py
Example 2: High-Performance Configuration
# Optimized for large datasets and high performance
export DATA_LOADER_MAX_FILE_SIZE_MB=1000
export DATA_LOADER_DEFAULT_CHUNK_SIZE=50000
export DATA_LOADER_MAX_MEMORY_USAGE_MB=4000
export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=true
export DATA_LOADER_DEFAULT_ENCODING=utf-8
Example 3: Development Configuration
# Development-friendly settings
export DATA_LOADER_MAX_FILE_SIZE_MB=100
export DATA_LOADER_DEFAULT_CHUNK_SIZE=1000
export DATA_LOADER_MAX_MEMORY_USAGE_MB=500
export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=false
export DATA_LOADER_DEFAULT_ENCODING=utf-8
Example 4: Programmatic Configuration
from aiecs.tools.statistics.data_loader_tool import DataLoaderTool
# Initialize with custom configuration
data_loader = DataLoaderTool(config={
'max_file_size_mb': 500,
'default_chunk_size': 10000,
'max_memory_usage_mb': 2000,
'enable_schema_inference': True,
'enable_quality_validation': True,
'default_encoding': 'utf-8'
})
Example 5: Mixed Configuration
Environment variables are used as defaults, but can be overridden programmatically:
# Set environment defaults
export DATA_LOADER_DEFAULT_CHUNK_SIZE=10000
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=true
# Override for specific instance
data_loader = DataLoaderTool(config={
'default_chunk_size': 50000, # This overrides the environment variable
'enable_quality_validation': False # This overrides the environment variable
})
Configuration Priority
When the Data Loader Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):
Programmatic config - Values passed to the constructor
Environment variables - Values set via
DATA_LOADER_*variablesDefault values - Built-in defaults as specified above
Data Type Parsing
String Values
Strings should be provided as plain text without quotes:
export DATA_LOADER_DEFAULT_ENCODING=utf-8
export DATA_LOADER_DEFAULT_ENCODING=latin-1
Integer Values
Integers should be provided as numeric strings:
export DATA_LOADER_MAX_FILE_SIZE_MB=500
export DATA_LOADER_DEFAULT_CHUNK_SIZE=10000
export DATA_LOADER_MAX_MEMORY_USAGE_MB=2000
Boolean Values
Booleans should be provided as lowercase strings:
export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=false
Validation
Automatic Type Validation
Pydantic automatically validates configuration values:
max_file_size_mbmust be a positive integerdefault_chunk_sizemust be a positive integermax_memory_usage_mbmust be a positive integerenable_schema_inferencemust be a booleanenable_quality_validationmust be a booleandefault_encodingmust be a non-empty string
Runtime Validation
When loading data, the tool validates:
File size - Files must not exceed maximum size limit
Memory usage - Operations must not exceed memory limits
Chunk size - Chunk size must be reasonable for the dataset
Encoding - Encoding must be valid for the file format
Schema compatibility - Schema inference must be compatible with data
Quality standards - Data must meet quality validation criteria
Data Source Types
The Data Loader Tool supports various data source types:
Text Formats
CSV - Comma-separated values
JSON - JavaScript Object Notation
Excel - Microsoft Excel files (.xlsx, .xls)
Binary Formats
Parquet - Columnar storage format
Feather - Fast binary format
HDF5 - Hierarchical Data Format
Statistical Formats
Stata - Stata data files
SAS - SAS data files
SPSS - SPSS data files
Auto-Detection
AUTO - Automatically detect file format
Loading Strategies
Full Load
Load entire dataset into memory
Fastest for small to medium datasets
Requires sufficient memory
Streaming
Process data in continuous stream
Memory-efficient for large datasets
Slower but handles unlimited size
Chunked
Process data in fixed-size chunks
Balanced memory usage and performance
Good for large datasets
Lazy
Load data on-demand
Minimal initial memory usage
Slower access but memory-efficient
Incremental
Load data in incremental batches
Good for ongoing data processing
Maintains processing state
Operations Supported
The Data Loader Tool supports comprehensive data loading operations:
Basic Loading
load_data- Load data from various file formatsload_csv- Load CSV files with optionsload_excel- Load Excel files with sheet selectionload_json- Load JSON files with structure handlingload_parquet- Load Parquet files efficiently
Advanced Loading
load_chunked- Load data in chunks for large filesload_streaming- Stream data for memory efficiencyload_lazy- Lazy load data on-demandload_incremental- Incremental data loadingauto_detect_format- Automatically detect file format
Schema Operations
infer_schema- Infer data schema automaticallyvalidate_schema- Validate data against schemaapply_schema- Apply schema to loaded dataget_schema_info- Get detailed schema information
Quality Operations
validate_quality- Validate data qualitycheck_missing_values- Check for missing valuesvalidate_data_types- Validate data type consistencygenerate_quality_report- Generate data quality report
Utility Operations
get_file_info- Get file information and metadataestimate_memory_usage- Estimate memory requirementscheck_file_compatibility- Check file format compatibilityoptimize_loading_strategy- Optimize loading strategy
Troubleshooting
Issue: File too large error
Error: File exceeds maximum size limit
Solutions:
# Increase file size limit
export DATA_LOADER_MAX_FILE_SIZE_MB=1000
# Or use chunked loading
data_loader.load_data(file_path, strategy='chunked')
Issue: Memory usage exceeded
Error: Memory usage exceeds limit
Solutions:
# Increase memory limit
export DATA_LOADER_MAX_MEMORY_USAGE_MB=4000
# Or reduce chunk size
export DATA_LOADER_DEFAULT_CHUNK_SIZE=5000
Issue: Schema inference fails
Error: Schema inference errors
Solutions:
# Disable schema inference
export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=false
# Or provide explicit schema
data_loader.load_data(file_path, schema=explicit_schema)
Issue: Quality validation fails
Error: Data quality validation errors
Solutions:
# Disable quality validation
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=false
# Or fix data quality issues
data_loader.validate_quality(data)
Issue: Encoding problems
Error: Text encoding errors
Solutions:
# Set correct encoding
export DATA_LOADER_DEFAULT_ENCODING=latin-1
# Or specify encoding per file
data_loader.load_data(file_path, encoding='utf-8')
Issue: Chunked loading performance
Error: Slow chunked loading
Solutions:
# Increase chunk size
export DATA_LOADER_DEFAULT_CHUNK_SIZE=50000
# Or use streaming strategy
data_loader.load_data(file_path, strategy='streaming')
Issue: File format not supported
Error: Unsupported file format
Solutions:
Check file format compatibility
Use auto-detection
Convert file to supported format
Install required dependencies
Best Practices
Performance Optimization
Chunk Size Tuning - Optimize chunk size for your data
Memory Management - Monitor memory usage and set appropriate limits
Format Selection - Choose efficient formats (Parquet, Feather)
Strategy Selection - Use appropriate loading strategy
Schema Optimization - Disable schema inference when not needed
Error Handling
Graceful Degradation - Handle loading failures gracefully
Validation - Validate data before processing
Fallback Strategies - Provide fallback loading methods
Error Logging - Log errors for debugging and monitoring
User Feedback - Provide clear error messages
Security
File Validation - Validate file paths and permissions
Content Validation - Validate file content before loading
Access Control - Control access to data files
Audit Logging - Log data loading activities
Data Privacy - Ensure data privacy and compliance
Resource Management
Memory Monitoring - Monitor memory usage during loading
File Cleanup - Clean up temporary files
Connection Management - Manage database connections efficiently
Processing Time - Set reasonable timeouts
Storage Optimization - Optimize data storage formats
Integration
Tool Dependencies - Ensure required tools are available
API Compatibility - Maintain API compatibility
Error Propagation - Properly propagate errors
Logging Integration - Integrate with logging systems
Monitoring - Monitor tool performance and usage
Development vs Production
Development:
DATA_LOADER_MAX_FILE_SIZE_MB=100
DATA_LOADER_DEFAULT_CHUNK_SIZE=1000
DATA_LOADER_MAX_MEMORY_USAGE_MB=500
DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
DATA_LOADER_ENABLE_QUALITY_VALIDATION=false
DATA_LOADER_DEFAULT_ENCODING=utf-8
Production:
DATA_LOADER_MAX_FILE_SIZE_MB=1000
DATA_LOADER_DEFAULT_CHUNK_SIZE=50000
DATA_LOADER_MAX_MEMORY_USAGE_MB=4000
DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
DATA_LOADER_ENABLE_QUALITY_VALIDATION=true
DATA_LOADER_DEFAULT_ENCODING=utf-8
Error Handling
Always wrap data loading operations in try-except blocks:
from aiecs.tools.statistics.data_loader_tool import DataLoaderTool, DataLoaderError, FileFormatError, SchemaValidationError, DataQualityError
data_loader = DataLoaderTool()
try:
data = data_loader.load_data(
file_path='data.csv',
source_type='csv',
strategy='full_load'
)
except FileFormatError as e:
print(f"File format error: {e}")
except SchemaValidationError as e:
print(f"Schema validation error: {e}")
except DataQualityError as e:
print(f"Data quality error: {e}")
except DataLoaderError as e:
print(f"Data loader error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Dependencies
Core Dependencies
# Install core dependencies
pip install pydantic python-dotenv
# Install data processing dependencies
pip install pandas numpy
# Install file format dependencies
pip install openpyxl xlrd pyarrow fastparquet
Optional Dependencies
# For statistical formats
pip install pyreadstat
# For HDF5 support
pip install h5py
# For Feather support
pip install feather-format
# For advanced data types
pip install pandas-stubs
Verification
# Test dependency availability
try:
import pandas
import numpy
print("Core dependencies available")
except ImportError as e:
print(f"Missing dependency: {e}")
# Test format support
try:
import openpyxl
print("Excel support available")
except ImportError:
print("Excel support not available")
try:
import pyarrow
print("Parquet support available")
except ImportError:
print("Parquet support not available")
# Test statistical format support
try:
import pyreadstat
print("Statistical format support available")
except ImportError:
print("Statistical format support not available")
Support
For issues or questions about Data Loader Tool configuration:
Check the tool source code for implementation details
Review pandas tool documentation for core data operations
Consult the main aiecs documentation for architecture overview
Test with simple files first to isolate configuration vs. loading issues
Verify file format compatibility and dependencies
Check memory and file size limits
Ensure proper encoding and schema settings
Validate data quality and structure requirements