Data Loader Tool Configuration Guide

Overview

The Data Loader Tool is a universal data loading tool that provides comprehensive data loading capabilities with auto-detection of file formats, multiple loading strategies (full, streaming, chunked, lazy), data quality validation on load, schema inference and validation, and support for CSV, Excel, JSON, Parquet, and other formats. It can load data from multiple file formats, auto-detect data formats and schemas, handle large datasets with streaming, and validate data quality on load. The tool integrates with pandas_tool for core data operations and supports various data source types (CSV, Excel, JSON, Parquet, Feather, HDF5, Stata, SAS, SPSS) and loading strategies (full_load, streaming, chunked, lazy, incremental). The tool can be configured via environment variables using the DATA_LOADER_ prefix or through programmatic configuration when initializing the tool.

Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Data Loader Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.

Setting Up .env Files

1. Install python-dotenv:

pip install python-dotenv

2. Create a .env file in your project root:

# .env file in your project root
DATA_LOADER_MAX_FILE_SIZE_MB=500
DATA_LOADER_DEFAULT_CHUNK_SIZE=10000
DATA_LOADER_MAX_MEMORY_USAGE_MB=2000
DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
DATA_LOADER_ENABLE_QUALITY_VALIDATION=true
DATA_LOADER_DEFAULT_ENCODING=utf-8

3. Load the .env file in your application:

# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.statistics.data_loader_tool import DataLoaderTool

# The tool will automatically use the environment variables
data_loader = DataLoaderTool()

Multiple Environment Files

You can use different .env files for different environments:

import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.statistics.data_loader_tool import DataLoaderTool
data_loader = DataLoaderTool()

Example .env.production:

# Production settings - optimized for large datasets
DATA_LOADER_MAX_FILE_SIZE_MB=1000
DATA_LOADER_DEFAULT_CHUNK_SIZE=50000
DATA_LOADER_MAX_MEMORY_USAGE_MB=4000
DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
DATA_LOADER_ENABLE_QUALITY_VALIDATION=true
DATA_LOADER_DEFAULT_ENCODING=utf-8

Example .env.development:

# Development settings - optimized for testing and debugging
DATA_LOADER_MAX_FILE_SIZE_MB=100
DATA_LOADER_DEFAULT_CHUNK_SIZE=1000
DATA_LOADER_MAX_MEMORY_USAGE_MB=500
DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
DATA_LOADER_ENABLE_QUALITY_VALIDATION=false
DATA_LOADER_DEFAULT_ENCODING=utf-8

Best Practices for .env Files

Never commit .env files to version control - Add .env to your .gitignore:

# .gitignore
.env
.env.local
.env.*.local
.env.production
.env.staging

Provide a template - Create .env.example with documented dummy values:

# .env.example
# Data Loader Tool Configuration

# Maximum file size in megabytes
DATA_LOADER_MAX_FILE_SIZE_MB=500

# Default chunk size for chunked loading
DATA_LOADER_DEFAULT_CHUNK_SIZE=10000

# Maximum memory usage in megabytes
DATA_LOADER_MAX_MEMORY_USAGE_MB=2000

# Whether to enable automatic schema inference
DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true

# Whether to enable data quality validation
DATA_LOADER_ENABLE_QUALITY_VALIDATION=true

# Default text encoding for file operations
DATA_LOADER_DEFAULT_ENCODING=utf-8

Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format values correctly:
- Strings: Plain text: utf-8, latin-1
- Integers: Plain numbers: 500, 10000, 2000
- Booleans: true or false

Configuration Options

1. Max File Size MB

Environment Variable: DATA_LOADER_MAX_FILE_SIZE_MB

Type: Integer

Default: 500

Description: Maximum file size in megabytes that can be loaded. Files larger than this will trigger chunked or streaming loading strategies to prevent memory issues.

Common Values:

100 - Small files only (development/testing)
500 - Standard files (default, balanced)
1000 - Large files (production)
2000 - Very large files (high-memory systems)

Example:

export DATA_LOADER_MAX_FILE_SIZE_MB=1000

Size Note: Larger values allow bigger files but require more memory.

2. Default Chunk Size

Environment Variable: DATA_LOADER_DEFAULT_CHUNK_SIZE

Type: Integer

Default: 10000

Description: Default chunk size for chunked loading operations. This determines how many rows are processed at once when using chunked loading strategies.

Common Values:

1000 - Small chunks (low memory usage)
10000 - Standard chunks (default, balanced)
50000 - Large chunks (high performance)
100000 - Very large chunks (maximum performance)

Example:

export DATA_LOADER_DEFAULT_CHUNK_SIZE=50000

Chunk Note: Larger chunks improve performance but use more memory.

3. Max Memory Usage MB

Environment Variable: DATA_LOADER_MAX_MEMORY_USAGE_MB

Type: Integer

Default: 2000

Description: Maximum memory usage in megabytes for data loading operations. This helps prevent out-of-memory errors by controlling memory consumption.

Common Values:

500 - Low memory systems
2000 - Standard systems (default)
4000 - High memory systems
8000 - Very high memory systems

Example:

export DATA_LOADER_MAX_MEMORY_USAGE_MB=4000

Memory Note: Higher values allow larger datasets but require more system memory.

4. Enable Schema Inference

Environment Variable: DATA_LOADER_ENABLE_SCHEMA_INFERENCE

Type: Boolean

Default: True

Description: Whether to enable automatic schema inference when loading data. Schema inference automatically detects data types and structure.

Values:

true - Enable schema inference (default)
false - Disable schema inference

Example:

export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true

Schema Note: Schema inference improves data quality but may slow down loading.

5. Enable Quality Validation

Environment Variable: DATA_LOADER_ENABLE_QUALITY_VALIDATION

Type: Boolean

Default: True

Description: Whether to enable data quality validation during loading. Quality validation checks for missing values, data type consistency, and other quality issues.

Values:

true - Enable quality validation (default)
false - Disable quality validation

Example:

export DATA_LOADER_ENABLE_QUALITY_VALIDATION=true

Quality Note: Quality validation improves data reliability but may slow down loading.

6. Default Encoding

Environment Variable: DATA_LOADER_DEFAULT_ENCODING

Type: String

Default: "utf-8"

Description: Default text encoding for file operations. This is used when no specific encoding is provided for text-based file formats.

Common Encodings:

utf-8 - Unicode UTF-8 (default, most common)
latin-1 - Latin-1 (ISO-8859-1)
cp1252 - Windows-1252
ascii - ASCII (7-bit)

Example:

export DATA_LOADER_DEFAULT_ENCODING=utf-8

Encoding Note: UTF-8 is recommended for international text, Latin-1 for legacy systems.

Usage Examples

Example 1: Basic Environment Configuration

# Set basic data loading parameters
export DATA_LOADER_MAX_FILE_SIZE_MB=500
export DATA_LOADER_DEFAULT_CHUNK_SIZE=10000
export DATA_LOADER_MAX_MEMORY_USAGE_MB=2000
export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=true
export DATA_LOADER_DEFAULT_ENCODING=utf-8

# Run your application
python app.py

Example 2: High-Performance Configuration

# Optimized for large datasets and high performance
export DATA_LOADER_MAX_FILE_SIZE_MB=1000
export DATA_LOADER_DEFAULT_CHUNK_SIZE=50000
export DATA_LOADER_MAX_MEMORY_USAGE_MB=4000
export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=true
export DATA_LOADER_DEFAULT_ENCODING=utf-8

Example 3: Development Configuration

# Development-friendly settings
export DATA_LOADER_MAX_FILE_SIZE_MB=100
export DATA_LOADER_DEFAULT_CHUNK_SIZE=1000
export DATA_LOADER_MAX_MEMORY_USAGE_MB=500
export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=false
export DATA_LOADER_DEFAULT_ENCODING=utf-8

Example 4: Programmatic Configuration

from aiecs.tools.statistics.data_loader_tool import DataLoaderTool

# Initialize with custom configuration
data_loader = DataLoaderTool(config={
    'max_file_size_mb': 500,
    'default_chunk_size': 10000,
    'max_memory_usage_mb': 2000,
    'enable_schema_inference': True,
    'enable_quality_validation': True,
    'default_encoding': 'utf-8'
})

Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

# Set environment defaults
export DATA_LOADER_DEFAULT_CHUNK_SIZE=10000
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=true

# Override for specific instance
data_loader = DataLoaderTool(config={
    'default_chunk_size': 50000,  # This overrides the environment variable
    'enable_quality_validation': False  # This overrides the environment variable
})

Configuration Priority

When the Data Loader Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

Programmatic config - Values passed to the constructor
Environment variables - Values set via DATA_LOADER_* variables
Default values - Built-in defaults as specified above

Data Type Parsing

String Values

Strings should be provided as plain text without quotes:

export DATA_LOADER_DEFAULT_ENCODING=utf-8
export DATA_LOADER_DEFAULT_ENCODING=latin-1

Integer Values

Integers should be provided as numeric strings:

export DATA_LOADER_MAX_FILE_SIZE_MB=500
export DATA_LOADER_DEFAULT_CHUNK_SIZE=10000
export DATA_LOADER_MAX_MEMORY_USAGE_MB=2000

Boolean Values

Booleans should be provided as lowercase strings:

export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=false

Validation

Automatic Type Validation

Pydantic automatically validates configuration values:

max_file_size_mb must be a positive integer
default_chunk_size must be a positive integer
max_memory_usage_mb must be a positive integer
enable_schema_inference must be a boolean
enable_quality_validation must be a boolean
default_encoding must be a non-empty string

Runtime Validation

When loading data, the tool validates:

File size - Files must not exceed maximum size limit
Memory usage - Operations must not exceed memory limits
Chunk size - Chunk size must be reasonable for the dataset
Encoding - Encoding must be valid for the file format
Schema compatibility - Schema inference must be compatible with data
Quality standards - Data must meet quality validation criteria

Data Source Types

The Data Loader Tool supports various data source types:

Text Formats

CSV - Comma-separated values
JSON - JavaScript Object Notation
Excel - Microsoft Excel files (.xlsx, .xls)

Binary Formats

Parquet - Columnar storage format
Feather - Fast binary format
HDF5 - Hierarchical Data Format

Statistical Formats

Stata - Stata data files
SAS - SAS data files
SPSS - SPSS data files

Auto-Detection

AUTO - Automatically detect file format

Loading Strategies

Full Load

Load entire dataset into memory
Fastest for small to medium datasets
Requires sufficient memory

Streaming

Process data in continuous stream
Memory-efficient for large datasets
Slower but handles unlimited size

Chunked

Process data in fixed-size chunks
Balanced memory usage and performance
Good for large datasets

Lazy

Load data on-demand
Minimal initial memory usage
Slower access but memory-efficient

Incremental

Load data in incremental batches
Good for ongoing data processing
Maintains processing state

Operations Supported

The Data Loader Tool supports comprehensive data loading operations:

Basic Loading

load_data - Load data from various file formats
load_csv - Load CSV files with options
load_excel - Load Excel files with sheet selection
load_json - Load JSON files with structure handling
load_parquet - Load Parquet files efficiently

Advanced Loading

load_chunked - Load data in chunks for large files
load_streaming - Stream data for memory efficiency
load_lazy - Lazy load data on-demand
load_incremental - Incremental data loading
auto_detect_format - Automatically detect file format

Schema Operations

infer_schema - Infer data schema automatically
validate_schema - Validate data against schema
apply_schema - Apply schema to loaded data
get_schema_info - Get detailed schema information

Quality Operations

validate_quality - Validate data quality
check_missing_values - Check for missing values
validate_data_types - Validate data type consistency
generate_quality_report - Generate data quality report

Utility Operations

get_file_info - Get file information and metadata
estimate_memory_usage - Estimate memory requirements
check_file_compatibility - Check file format compatibility
optimize_loading_strategy - Optimize loading strategy

Troubleshooting

Issue: File too large error

Error: File exceeds maximum size limit

Solutions:

# Increase file size limit
export DATA_LOADER_MAX_FILE_SIZE_MB=1000

# Or use chunked loading
data_loader.load_data(file_path, strategy='chunked')

Issue: Memory usage exceeded

Error: Memory usage exceeds limit

Solutions:

# Increase memory limit
export DATA_LOADER_MAX_MEMORY_USAGE_MB=4000

# Or reduce chunk size
export DATA_LOADER_DEFAULT_CHUNK_SIZE=5000

Issue: Schema inference fails

Error: Schema inference errors

Solutions:

# Disable schema inference
export DATA_LOADER_ENABLE_SCHEMA_INFERENCE=false

# Or provide explicit schema
data_loader.load_data(file_path, schema=explicit_schema)

Issue: Quality validation fails

Error: Data quality validation errors

Solutions:

# Disable quality validation
export DATA_LOADER_ENABLE_QUALITY_VALIDATION=false

# Or fix data quality issues
data_loader.validate_quality(data)

Issue: Encoding problems

Error: Text encoding errors

Solutions:

# Set correct encoding
export DATA_LOADER_DEFAULT_ENCODING=latin-1

# Or specify encoding per file
data_loader.load_data(file_path, encoding='utf-8')

Issue: Chunked loading performance

Error: Slow chunked loading

Solutions:

# Increase chunk size
export DATA_LOADER_DEFAULT_CHUNK_SIZE=50000

# Or use streaming strategy
data_loader.load_data(file_path, strategy='streaming')

Issue: File format not supported

Error: Unsupported file format

Solutions:

Check file format compatibility
Use auto-detection
Convert file to supported format
Install required dependencies

Best Practices

Performance Optimization

Chunk Size Tuning - Optimize chunk size for your data
Memory Management - Monitor memory usage and set appropriate limits
Format Selection - Choose efficient formats (Parquet, Feather)
Strategy Selection - Use appropriate loading strategy
Schema Optimization - Disable schema inference when not needed

Error Handling

Graceful Degradation - Handle loading failures gracefully
Validation - Validate data before processing
Fallback Strategies - Provide fallback loading methods
Error Logging - Log errors for debugging and monitoring
User Feedback - Provide clear error messages

Security

File Validation - Validate file paths and permissions
Content Validation - Validate file content before loading
Access Control - Control access to data files
Audit Logging - Log data loading activities
Data Privacy - Ensure data privacy and compliance

Resource Management

Memory Monitoring - Monitor memory usage during loading
File Cleanup - Clean up temporary files
Connection Management - Manage database connections efficiently
Processing Time - Set reasonable timeouts
Storage Optimization - Optimize data storage formats

Integration

Tool Dependencies - Ensure required tools are available
API Compatibility - Maintain API compatibility
Error Propagation - Properly propagate errors
Logging Integration - Integrate with logging systems
Monitoring - Monitor tool performance and usage

Development vs Production

Development:

DATA_LOADER_MAX_FILE_SIZE_MB=100
DATA_LOADER_DEFAULT_CHUNK_SIZE=1000
DATA_LOADER_MAX_MEMORY_USAGE_MB=500
DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
DATA_LOADER_ENABLE_QUALITY_VALIDATION=false
DATA_LOADER_DEFAULT_ENCODING=utf-8

Production:

DATA_LOADER_MAX_FILE_SIZE_MB=1000
DATA_LOADER_DEFAULT_CHUNK_SIZE=50000
DATA_LOADER_MAX_MEMORY_USAGE_MB=4000
DATA_LOADER_ENABLE_SCHEMA_INFERENCE=true
DATA_LOADER_ENABLE_QUALITY_VALIDATION=true
DATA_LOADER_DEFAULT_ENCODING=utf-8

Error Handling

Always wrap data loading operations in try-except blocks:

from aiecs.tools.statistics.data_loader_tool import DataLoaderTool, DataLoaderError, FileFormatError, SchemaValidationError, DataQualityError

data_loader = DataLoaderTool()

try:
    data = data_loader.load_data(
        file_path='data.csv',
        source_type='csv',
        strategy='full_load'
    )
except FileFormatError as e:
    print(f"File format error: {e}")
except SchemaValidationError as e:
    print(f"Schema validation error: {e}")
except DataQualityError as e:
    print(f"Data quality error: {e}")
except DataLoaderError as e:
    print(f"Data loader error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Dependencies

Core Dependencies

# Install core dependencies
pip install pydantic python-dotenv

# Install data processing dependencies
pip install pandas numpy

# Install file format dependencies
pip install openpyxl xlrd pyarrow fastparquet

Optional Dependencies

# For statistical formats
pip install pyreadstat

# For HDF5 support
pip install h5py

# For Feather support
pip install feather-format

# For advanced data types
pip install pandas-stubs

Verification

# Test dependency availability
try:
    import pandas
    import numpy
    print("Core dependencies available")
except ImportError as e:
    print(f"Missing dependency: {e}")

# Test format support
try:
    import openpyxl
    print("Excel support available")
except ImportError:
    print("Excel support not available")

try:
    import pyarrow
    print("Parquet support available")
except ImportError:
    print("Parquet support not available")

# Test statistical format support
try:
    import pyreadstat
    print("Statistical format support available")
except ImportError:
    print("Statistical format support not available")

Support

For issues or questions about Data Loader Tool configuration:

Check the tool source code for implementation details
Review pandas tool documentation for core data operations
Consult the main aiecs documentation for architecture overview
Test with simple files first to isolate configuration vs. loading issues
Verify file format compatibility and dependencies
Check memory and file size limits
Ensure proper encoding and schema settings
Validate data quality and structure requirements

Data Loader Tool Configuration Guide

Overview

Using .env Files in Your Project

Setting Up .env Files

Multiple Environment Files

Best Practices for .env Files

Configuration Options

1. Max File Size MB

2. Default Chunk Size

3. Max Memory Usage MB

4. Enable Schema Inference

5. Enable Quality Validation

6. Default Encoding

Usage Examples

Example 1: Basic Environment Configuration

Example 2: High-Performance Configuration

Example 3: Development Configuration

Example 4: Programmatic Configuration

Example 5: Mixed Configuration

Configuration Priority

Data Type Parsing

String Values

Integer Values

Boolean Values

Validation

Automatic Type Validation

Runtime Validation

Data Source Types

Text Formats

Binary Formats

Statistical Formats

Auto-Detection

Loading Strategies

Full Load

Streaming

Chunked

Lazy

Incremental

Operations Supported

Basic Loading

Advanced Loading

Schema Operations

Quality Operations

Utility Operations

Troubleshooting

Issue: File too large error

Issue: Memory usage exceeded

Issue: Schema inference fails

Issue: Quality validation fails

Issue: Encoding problems

Issue: Chunked loading performance

Issue: File format not supported

Best Practices

Performance Optimization

Error Handling

Security

Resource Management

Integration

Development vs Production

Error Handling

Dependencies

Core Dependencies

Optional Dependencies

Verification

Related Documentation

Support