Pandas Tool Configuration Guide

Overview

The Pandas Tool provides comprehensive data processing capabilities using pandas, supporting 30+ operations for data manipulation, analysis, and transformation. It can handle CSV, JSON, and Excel files with configurable processing parameters. It can be configured via environment variables using the PANDAS_TOOL_ prefix or through programmatic configuration when initializing the tool.

Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Pandas Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.

Setting Up .env Files

1. Install python-dotenv:

pip install python-dotenv

2. Create a .env file in your project root:

# .env file in your project root
PANDAS_TOOL_CSV_DELIMITER=,
PANDAS_TOOL_ENCODING=utf-8
PANDAS_TOOL_DEFAULT_AGG={"numeric":"mean","object":"count"}
PANDAS_TOOL_CHUNK_SIZE=10000
PANDAS_TOOL_MAX_CSV_SIZE=1000000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx",".json"]

3. Load the .env file in your application:

# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.task_tools.pandas_tool import PandasTool

# The tool will automatically use the environment variables
pandas_tool = PandasTool()

Multiple Environment Files

You can use different .env files for different environments:

import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.task_tools.pandas_tool import PandasTool
pandas_tool = PandasTool()

Example .env.production:

# Production settings - optimized for large datasets
PANDAS_TOOL_CSV_DELIMITER=,
PANDAS_TOOL_ENCODING=utf-8
PANDAS_TOOL_CHUNK_SIZE=50000
PANDAS_TOOL_MAX_CSV_SIZE=5000000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx"]

Example .env.development:

# Development settings - smaller chunks for testing
PANDAS_TOOL_CSV_DELIMITER=,
PANDAS_TOOL_ENCODING=utf-8
PANDAS_TOOL_CHUNK_SIZE=1000
PANDAS_TOOL_MAX_CSV_SIZE=100000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx",".json"]

Best Practices for .env Files

  1. Never commit .env files to version control - Add .env to your .gitignore:

    # .gitignore
    .env
    .env.local
    .env.*.local
    .env.production
    .env.staging
    
  2. Provide a template - Create .env.example with documented dummy values:

    # .env.example
    # Pandas Tool Configuration
    
    # CSV delimiter character
    PANDAS_TOOL_CSV_DELIMITER=,
    
    # File encoding
    PANDAS_TOOL_ENCODING=utf-8
    
    # Default aggregation functions (JSON object)
    PANDAS_TOOL_DEFAULT_AGG={"numeric":"mean","object":"count"}
    
    # Chunk size for large file processing
    PANDAS_TOOL_CHUNK_SIZE=10000
    
    # Threshold for chunked CSV processing
    PANDAS_TOOL_MAX_CSV_SIZE=1000000
    
    # Allowed file extensions (JSON array)
    PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx",".json"]
    
  3. Document your variables - Add comments explaining each setting

  4. Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports

  5. Format complex types correctly:

    • Strings: Plain text: ,, utf-8

    • Integers: Plain numbers: 10000, 1000000

    • Dictionaries: JSON object format: {"key":"value"}

    • Lists: JSON array format: [".csv",".xlsx"]

Configuration Options

1. CSV Delimiter

Environment Variable: PANDAS_TOOL_CSV_DELIMITER

Type: String

Default: ","

Description: Delimiter character used for CSV file parsing and writing. This affects how CSV files are read and written.

Common Values:

  • , - Comma (default, standard CSV)

  • ; - Semicolon (European CSV format)

  • \t - Tab (TSV format)

  • | - Pipe (alternative delimiter)

  • - Space (space-separated values)

Example:

export PANDAS_TOOL_CSV_DELIMITER=";"

Note: Use semicolon for European CSV files that use comma as decimal separator.

2. Encoding

Environment Variable: PANDAS_TOOL_ENCODING

Type: String

Default: "utf-8"

Description: Character encoding used for file operations. This ensures proper handling of international characters and special symbols.

Common Encodings:

  • utf-8 - Unicode (default, most common)

  • utf-16 - Unicode 16-bit

  • latin-1 - Western European

  • cp1252 - Windows-1252

  • iso-8859-1 - ISO Latin-1

Example:

export PANDAS_TOOL_ENCODING="utf-8"

Note: Use utf-8 unless you have specific requirements for legacy file formats.

3. Default Aggregation

Environment Variable: PANDAS_TOOL_DEFAULT_AGG

Type: Dict[str, str]

Default: {"numeric": "mean", "object": "count"}

Description: Default aggregation functions applied to different data types during groupby operations. This provides sensible defaults for statistical operations.

Format: JSON object with string keys and values

Available Functions:

  • mean - Average value

  • sum - Total sum

  • count - Count of non-null values

  • min - Minimum value

  • max - Maximum value

  • std - Standard deviation

  • median - Median value

  • first - First value

  • last - Last value

Example:

export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"sum","object":"count","datetime":"first"}'

4. Chunk Size

Environment Variable: PANDAS_TOOL_CHUNK_SIZE

Type: Integer

Default: 10000

Description: Number of rows to process at a time when reading large files. This helps manage memory usage for large datasets.

Common Values:

  • 1000 - Small chunks (development/testing)

  • 10000 - Default (balanced)

  • 50000 - Large chunks (production)

  • 100000 - Very large chunks (high-memory systems)

Example:

export PANDAS_TOOL_CHUNK_SIZE=50000

Performance Note: Larger chunks use more memory but process faster. Adjust based on available RAM.

5. Max CSV Size

Environment Variable: PANDAS_TOOL_MAX_CSV_SIZE

Type: Integer

Default: 1000000

Description: Threshold (in characters) for determining when to use chunked processing for CSV files. Files larger than this will be processed in chunks.

Common Values:

  • 100000 - Small threshold (100KB)

  • 1000000 - Default (1MB)

  • 10000000 - Large threshold (10MB)

  • 100000000 - Very large threshold (100MB)

Example:

export PANDAS_TOOL_MAX_CSV_SIZE=5000000

Memory Management: Lower values trigger chunked processing earlier, using less memory but potentially slower processing.

6. Allowed File Extensions

Environment Variable: PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS

Type: List[str]

Default: ['.csv', '.xlsx', '.json']

Description: List of allowed file extensions for data processing. This is a security feature that prevents processing of unauthorized file types.

Format: JSON array string with double quotes

Supported Formats:

  • .csv - Comma-separated values

  • .xlsx - Excel spreadsheets

  • .json - JSON data files

  • .xls - Legacy Excel format (if supported)

Example:

# Strict - Only CSV and Excel
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'

# Lenient - All supported formats
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'

Security Note: Only allow extensions that your application actually needs to process.

Usage Examples

Example 1: Basic Environment Configuration

# Set custom processing parameters
export PANDAS_TOOL_CSV_DELIMITER=";"
export PANDAS_TOOL_CHUNK_SIZE=50000
export PANDAS_TOOL_MAX_CSV_SIZE=5000000
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'

# Run your application
python app.py

Example 2: High-Performance Configuration

# Optimized for large datasets
export PANDAS_TOOL_CHUNK_SIZE=100000
export PANDAS_TOOL_MAX_CSV_SIZE=10000000
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"sum","object":"count"}'

Example 3: Memory-Constrained Configuration

# Conservative memory usage
export PANDAS_TOOL_CHUNK_SIZE=1000
export PANDAS_TOOL_MAX_CSV_SIZE=100000
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"mean","object":"count"}'

Example 4: Programmatic Configuration

from aiecs.tools.task_tools.pandas_tool import PandasTool

# Initialize with custom configuration
pandas_tool = PandasTool(config={
    'csv_delimiter': ';',
    'encoding': 'utf-8',
    'chunk_size': 50000,
    'max_csv_size': 5000000,
    'allowed_file_extensions': ['.csv', '.xlsx']
})

Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

# Set environment defaults
export PANDAS_TOOL_CHUNK_SIZE=10000
# Override for specific instance
pandas_tool = PandasTool(config={
    'chunk_size': 50000  # This overrides the environment variable
})

Configuration Priority

When the Pandas Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

  1. Programmatic config - Values passed to the constructor

  2. Environment variables - Values set via PANDAS_TOOL_* variables

  3. Default values - Built-in defaults as specified above

Data Type Parsing

String Values

Strings should be provided as plain text without quotes:

export PANDAS_TOOL_CSV_DELIMITER=,
export PANDAS_TOOL_ENCODING=utf-8

Integer Values

Integers should be provided as numeric strings:

export PANDAS_TOOL_CHUNK_SIZE=10000
export PANDAS_TOOL_MAX_CSV_SIZE=1000000

Dictionary Values

Dictionaries must be provided as JSON objects with double quotes:

# Correct
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"mean","object":"count"}'

# Incorrect (will not parse)
export PANDAS_TOOL_DEFAULT_AGG="numeric:mean,object:count"

List Values

Lists must be provided as JSON arrays with double quotes:

# Correct
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'

# Incorrect (will not parse)
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=".csv,.xlsx,.json"

Important: Use single quotes for the shell, double quotes for JSON:

export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'
#                                      ^                    ^
#                                      Single quotes for shell
#                                         ^      ^
#                                         Double quotes for JSON

Validation

Automatic Type Validation

Pydantic automatically validates configuration values:

  • csv_delimiter must be a non-empty string

  • encoding must be a valid encoding string

  • default_agg must be a dictionary with string keys and values

  • chunk_size must be a positive integer

  • max_csv_size must be a positive integer

  • allowed_file_extensions must be a list of strings

Runtime Validation

When processing data, the tool validates:

  1. File extensions - Must be in allowed_file_extensions list

  2. Data structure - Input records must be valid for DataFrame creation

  3. Column existence - Referenced columns must exist in DataFrames

  4. Data types - Type conversions are validated

  5. Query syntax - Filter conditions are validated

Operations Supported

The Pandas Tool supports 30+ operations across multiple categories:

Data Reading/Writing

  • read_csv - Read CSV string into DataFrame

  • read_json - Read JSON string into DataFrame

  • read_file - Read data from file (CSV, Excel, JSON)

  • write_file - Write DataFrame to file

Descriptive Statistics

  • summary - Compute summary statistics

  • describe - Compute descriptive statistics for columns

  • value_counts - Compute value counts for columns

Filtering and Selection

  • filter - Filter DataFrame based on condition

  • select_columns - Select specified columns

  • drop_columns - Drop specified columns

  • drop_duplicates - Drop duplicate rows

  • dropna - Drop rows/columns with missing values

Grouping and Aggregation

  • groupby - Group DataFrame and apply aggregations

  • pivot_table - Create pivot table

Merging and Concatenation

  • merge - Merge two DataFrames

  • concat - Concatenate multiple DataFrames

Data Transformation

  • sort_values - Sort DataFrame by columns

  • rename_columns - Rename DataFrame columns

  • replace_values - Replace values in DataFrame

  • fill_na - Fill missing values

  • astype - Convert column types

  • apply - Apply function to columns/rows

Data Reshaping

  • melt - Melt DataFrame to long format

  • pivot - Pivot DataFrame to wide format

  • stack - Stack DataFrame columns into rows

  • unstack - Unstack DataFrame rows into columns

Data Cleaning

  • strip_strings - Strip whitespace from string columns

  • to_numeric - Convert columns to numeric type

  • to_datetime - Convert columns to datetime type

Statistical Computations

  • mean - Compute mean of numeric columns

  • sum - Compute sum of numeric columns

  • count - Compute count of non-null values

  • min - Compute minimum values

  • max - Compute maximum values

Window Functions

  • rolling - Apply rolling window function

Sampling and Viewing

  • head - Return first n rows

  • tail - Return last n rows

  • sample - Return random sample of rows

Troubleshooting

Issue: CSV parsing fails

Error: Failed to read CSV: ParserError

Solutions:

  1. Check delimiter: export PANDAS_TOOL_CSV_DELIMITER=";"

  2. Check encoding: export PANDAS_TOOL_ENCODING="utf-8"

  3. Verify CSV format and structure

Issue: Memory errors with large files

Error: MemoryError or system becomes unresponsive

Solutions:

# Reduce chunk size
export PANDAS_TOOL_CHUNK_SIZE=1000

# Lower CSV size threshold
export PANDAS_TOOL_MAX_CSV_SIZE=100000

Issue: File extension not allowed

Error: Unsupported file type

Solution:

# Add the extension to allowed list
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'

Issue: Dictionary parsing error

Error: Configuration parsing fails for default_agg

Solution:

# Use proper JSON object syntax
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"mean","object":"count"}'

# NOT: {"numeric":mean,"object":count} or numeric:mean,object:count

Issue: List parsing error

Error: Configuration parsing fails for allowed_file_extensions

Solution:

# Use proper JSON array syntax
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'

# NOT: [.csv,.xlsx] or .csv,.xlsx

Issue: Encoding errors

Error: UnicodeDecodeError when reading files

Solutions:

  1. Try different encoding: export PANDAS_TOOL_ENCODING="latin-1"

  2. Check file encoding: file -i filename.csv

  3. Use UTF-8 with BOM: export PANDAS_TOOL_ENCODING="utf-8-sig"

Issue: Performance issues

Causes: Large datasets, inefficient chunk sizes

Solutions:

  1. Increase chunk size: export PANDAS_TOOL_CHUNK_SIZE=50000

  2. Increase CSV threshold: export PANDAS_TOOL_MAX_CSV_SIZE=5000000

  3. Use appropriate data types

  4. Consider data preprocessing

Issue: Column not found

Error: Columns not found: ['column_name']

Solutions:

  1. Check column names (case-sensitive)

  2. Use head() operation to inspect data structure

  3. Verify data format and headers

Best Practices

Performance Optimization

  1. Tune chunk sizes - Match chunk_size to available memory

  2. Set appropriate thresholds - Use max_csv_size to trigger chunking

  3. Use efficient data types - Convert to appropriate types early

  4. Filter early - Apply filters before expensive operations

  5. Cache results - Leverage BaseTool’s built-in caching

Memory Management

  1. Monitor memory usage - Watch for memory spikes

  2. Use chunked processing - For files larger than max_csv_size

  3. Process in batches - For multiple large files

  4. Clean up DataFrames - Delete large objects when done

  5. Use appropriate dtypes - Avoid object dtypes when possible

Data Quality

  1. Validate input data - Check for missing values and outliers

  2. Handle encoding properly - Use correct encoding for your data

  3. Sanitize data - Clean data before processing

  4. Use consistent delimiters - Match delimiter to data format

  5. Handle missing values - Use fill_na() or dropna() appropriately

Security

  1. Limit file extensions - Only allow necessary formats

  2. Validate file paths - Prevent directory traversal

  3. Sanitize queries - Validate filter conditions

  4. Monitor resource usage - Prevent DoS attacks

  5. Use appropriate permissions - Limit file system access

Development vs Production

Development:

PANDAS_TOOL_CHUNK_SIZE=1000
PANDAS_TOOL_MAX_CSV_SIZE=100000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'

Production:

PANDAS_TOOL_CHUNK_SIZE=50000
PANDAS_TOOL_MAX_CSV_SIZE=5000000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'

Error Handling

Always wrap pandas operations in try-except blocks:

from aiecs.tools.task_tools.pandas_tool import (
    PandasTool, 
    DataFrameError, 
    InputValidationError,
    ValidationError
)

pandas_tool = PandasTool()

try:
    result = pandas_tool.read_csv(csv_data)
except DataFrameError as e:
    print(f"DataFrame operation failed: {e}")
except InputValidationError as e:
    print(f"Input validation failed: {e}")
except ValidationError as e:
    print(f"Validation failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Support

For issues or questions about Pandas Tool configuration:

  • Check the tool source code for implementation details

  • Review pandas documentation for specific functionality

  • Consult the main aiecs documentation for architecture overview

  • Test with small datasets first to isolate configuration vs. data issues

  • Monitor memory and performance metrics during processing