Pandas Tool Configuration Guide

Overview

The Pandas Tool provides comprehensive data processing capabilities using pandas, supporting 30+ operations for data manipulation, analysis, and transformation. It can handle CSV, JSON, and Excel files with configurable processing parameters. It can be configured via environment variables using the PANDAS_TOOL_ prefix or through programmatic configuration when initializing the tool.

Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Pandas Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.

Setting Up .env Files

1. Install python-dotenv:

pip install python-dotenv

2. Create a .env file in your project root:

# .env file in your project root
PANDAS_TOOL_CSV_DELIMITER=,
PANDAS_TOOL_ENCODING=utf-8
PANDAS_TOOL_DEFAULT_AGG={"numeric":"mean","object":"count"}
PANDAS_TOOL_CHUNK_SIZE=10000
PANDAS_TOOL_MAX_CSV_SIZE=1000000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx",".json"]

3. Load the .env file in your application:

# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.task_tools.pandas_tool import PandasTool

# The tool will automatically use the environment variables
pandas_tool = PandasTool()

Multiple Environment Files

You can use different .env files for different environments:

import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.task_tools.pandas_tool import PandasTool
pandas_tool = PandasTool()

Example .env.production:

# Production settings - optimized for large datasets
PANDAS_TOOL_CSV_DELIMITER=,
PANDAS_TOOL_ENCODING=utf-8
PANDAS_TOOL_CHUNK_SIZE=50000
PANDAS_TOOL_MAX_CSV_SIZE=5000000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx"]

Example .env.development:

# Development settings - smaller chunks for testing
PANDAS_TOOL_CSV_DELIMITER=,
PANDAS_TOOL_ENCODING=utf-8
PANDAS_TOOL_CHUNK_SIZE=1000
PANDAS_TOOL_MAX_CSV_SIZE=100000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx",".json"]

Best Practices for .env Files

Never commit .env files to version control - Add .env to your .gitignore:

# .gitignore
.env
.env.local
.env.*.local
.env.production
.env.staging

Provide a template - Create .env.example with documented dummy values:

# .env.example
# Pandas Tool Configuration

# CSV delimiter character
PANDAS_TOOL_CSV_DELIMITER=,

# File encoding
PANDAS_TOOL_ENCODING=utf-8

# Default aggregation functions (JSON object)
PANDAS_TOOL_DEFAULT_AGG={"numeric":"mean","object":"count"}

# Chunk size for large file processing
PANDAS_TOOL_CHUNK_SIZE=10000

# Threshold for chunked CSV processing
PANDAS_TOOL_MAX_CSV_SIZE=1000000

# Allowed file extensions (JSON array)
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx",".json"]

Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format complex types correctly:
- Strings: Plain text: ,, utf-8
- Integers: Plain numbers: 10000, 1000000
- Dictionaries: JSON object format: {"key":"value"}
- Lists: JSON array format: [".csv",".xlsx"]

Configuration Options

1. CSV Delimiter

Environment Variable: PANDAS_TOOL_CSV_DELIMITER

Type: String

Default: ","

Description: Delimiter character used for CSV file parsing and writing. This affects how CSV files are read and written.

Common Values:

, - Comma (default, standard CSV)
; - Semicolon (European CSV format)
\t - Tab (TSV format)
| - Pipe (alternative delimiter)
- Space (space-separated values)

Example:

export PANDAS_TOOL_CSV_DELIMITER=";"

Note: Use semicolon for European CSV files that use comma as decimal separator.

2. Encoding

Environment Variable: PANDAS_TOOL_ENCODING

Type: String

Default: "utf-8"

Description: Character encoding used for file operations. This ensures proper handling of international characters and special symbols.

Common Encodings:

utf-8 - Unicode (default, most common)
utf-16 - Unicode 16-bit
latin-1 - Western European
cp1252 - Windows-1252
iso-8859-1 - ISO Latin-1

Example:

export PANDAS_TOOL_ENCODING="utf-8"

Note: Use utf-8 unless you have specific requirements for legacy file formats.

3. Default Aggregation

Environment Variable: PANDAS_TOOL_DEFAULT_AGG

Type: Dict[str, str]

Default: {"numeric": "mean", "object": "count"}

Description: Default aggregation functions applied to different data types during groupby operations. This provides sensible defaults for statistical operations.

Format: JSON object with string keys and values

Available Functions:

mean - Average value
sum - Total sum
count - Count of non-null values
min - Minimum value
max - Maximum value
std - Standard deviation
median - Median value
first - First value
last - Last value

Example:

export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"sum","object":"count","datetime":"first"}'

4. Chunk Size

Environment Variable: PANDAS_TOOL_CHUNK_SIZE

Type: Integer

Default: 10000

Description: Number of rows to process at a time when reading large files. This helps manage memory usage for large datasets.

Common Values:

1000 - Small chunks (development/testing)
10000 - Default (balanced)
50000 - Large chunks (production)
100000 - Very large chunks (high-memory systems)

Example:

export PANDAS_TOOL_CHUNK_SIZE=50000

Performance Note: Larger chunks use more memory but process faster. Adjust based on available RAM.

5. Max CSV Size

Environment Variable: PANDAS_TOOL_MAX_CSV_SIZE

Type: Integer

Default: 1000000

Description: Threshold (in characters) for determining when to use chunked processing for CSV files. Files larger than this will be processed in chunks.

Common Values:

100000 - Small threshold (100KB)
1000000 - Default (1MB)
10000000 - Large threshold (10MB)
100000000 - Very large threshold (100MB)

Example:

export PANDAS_TOOL_MAX_CSV_SIZE=5000000

Memory Management: Lower values trigger chunked processing earlier, using less memory but potentially slower processing.

6. Allowed File Extensions

Environment Variable: PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS

Type: List[str]

Default: ['.csv', '.xlsx', '.json']

Description: List of allowed file extensions for data processing. This is a security feature that prevents processing of unauthorized file types.

Format: JSON array string with double quotes

Supported Formats:

.csv - Comma-separated values
.xlsx - Excel spreadsheets
.json - JSON data files
.xls - Legacy Excel format (if supported)

Example:

# Strict - Only CSV and Excel
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'

# Lenient - All supported formats
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'

Security Note: Only allow extensions that your application actually needs to process.

Usage Examples

Example 1: Basic Environment Configuration

# Set custom processing parameters
export PANDAS_TOOL_CSV_DELIMITER=";"
export PANDAS_TOOL_CHUNK_SIZE=50000
export PANDAS_TOOL_MAX_CSV_SIZE=5000000
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'

# Run your application
python app.py

Example 2: High-Performance Configuration

# Optimized for large datasets
export PANDAS_TOOL_CHUNK_SIZE=100000
export PANDAS_TOOL_MAX_CSV_SIZE=10000000
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"sum","object":"count"}'

Example 3: Memory-Constrained Configuration

# Conservative memory usage
export PANDAS_TOOL_CHUNK_SIZE=1000
export PANDAS_TOOL_MAX_CSV_SIZE=100000
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"mean","object":"count"}'

Example 4: Programmatic Configuration

from aiecs.tools.task_tools.pandas_tool import PandasTool

# Initialize with custom configuration
pandas_tool = PandasTool(config={
    'csv_delimiter': ';',
    'encoding': 'utf-8',
    'chunk_size': 50000,
    'max_csv_size': 5000000,
    'allowed_file_extensions': ['.csv', '.xlsx']
})

Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

# Set environment defaults
export PANDAS_TOOL_CHUNK_SIZE=10000

# Override for specific instance
pandas_tool = PandasTool(config={
    'chunk_size': 50000  # This overrides the environment variable
})

Configuration Priority

When the Pandas Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

Programmatic config - Values passed to the constructor
Environment variables - Values set via PANDAS_TOOL_* variables
Default values - Built-in defaults as specified above

Data Type Parsing

String Values

Strings should be provided as plain text without quotes:

export PANDAS_TOOL_CSV_DELIMITER=,
export PANDAS_TOOL_ENCODING=utf-8

Integer Values

Integers should be provided as numeric strings:

export PANDAS_TOOL_CHUNK_SIZE=10000
export PANDAS_TOOL_MAX_CSV_SIZE=1000000

Dictionary Values

Dictionaries must be provided as JSON objects with double quotes:

# Correct
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"mean","object":"count"}'

# Incorrect (will not parse)
export PANDAS_TOOL_DEFAULT_AGG="numeric:mean,object:count"

List Values

Lists must be provided as JSON arrays with double quotes:

# Correct
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'

# Incorrect (will not parse)
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=".csv,.xlsx,.json"

Important: Use single quotes for the shell, double quotes for JSON:

export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'
#                                      ^                    ^
#                                      Single quotes for shell
#                                         ^      ^
#                                         Double quotes for JSON

Validation

Automatic Type Validation

Pydantic automatically validates configuration values:

csv_delimiter must be a non-empty string
encoding must be a valid encoding string
default_agg must be a dictionary with string keys and values
chunk_size must be a positive integer
max_csv_size must be a positive integer
allowed_file_extensions must be a list of strings

Runtime Validation

When processing data, the tool validates:

File extensions - Must be in allowed_file_extensions list
Data structure - Input records must be valid for DataFrame creation
Column existence - Referenced columns must exist in DataFrames
Data types - Type conversions are validated
Query syntax - Filter conditions are validated

Operations Supported

The Pandas Tool supports 30+ operations across multiple categories:

Data Reading/Writing

read_csv - Read CSV string into DataFrame
read_json - Read JSON string into DataFrame
read_file - Read data from file (CSV, Excel, JSON)
write_file - Write DataFrame to file

Descriptive Statistics

summary - Compute summary statistics
describe - Compute descriptive statistics for columns
value_counts - Compute value counts for columns

Filtering and Selection

filter - Filter DataFrame based on condition
select_columns - Select specified columns
drop_columns - Drop specified columns
drop_duplicates - Drop duplicate rows
dropna - Drop rows/columns with missing values

Grouping and Aggregation

groupby - Group DataFrame and apply aggregations
pivot_table - Create pivot table

Merging and Concatenation

merge - Merge two DataFrames
concat - Concatenate multiple DataFrames

Data Transformation

sort_values - Sort DataFrame by columns
rename_columns - Rename DataFrame columns
replace_values - Replace values in DataFrame
fill_na - Fill missing values
astype - Convert column types
apply - Apply function to columns/rows

Data Reshaping

melt - Melt DataFrame to long format
pivot - Pivot DataFrame to wide format
stack - Stack DataFrame columns into rows
unstack - Unstack DataFrame rows into columns

Data Cleaning

strip_strings - Strip whitespace from string columns
to_numeric - Convert columns to numeric type
to_datetime - Convert columns to datetime type

Statistical Computations

mean - Compute mean of numeric columns
sum - Compute sum of numeric columns
count - Compute count of non-null values
min - Compute minimum values
max - Compute maximum values

Window Functions

rolling - Apply rolling window function

Sampling and Viewing

head - Return first n rows
tail - Return last n rows
sample - Return random sample of rows

Troubleshooting

Issue: CSV parsing fails

Error: Failed to read CSV: ParserError

Solutions:

Check delimiter: export PANDAS_TOOL_CSV_DELIMITER=";"
Check encoding: export PANDAS_TOOL_ENCODING="utf-8"
Verify CSV format and structure

Issue: Memory errors with large files

Error: MemoryError or system becomes unresponsive

Solutions:

# Reduce chunk size
export PANDAS_TOOL_CHUNK_SIZE=1000

# Lower CSV size threshold
export PANDAS_TOOL_MAX_CSV_SIZE=100000

Issue: File extension not allowed

Error: Unsupported file type

Solution:

# Add the extension to allowed list
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'

Issue: Dictionary parsing error

Error: Configuration parsing fails for default_agg

Solution:

# Use proper JSON object syntax
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"mean","object":"count"}'

# NOT: {"numeric":mean,"object":count} or numeric:mean,object:count

Issue: List parsing error

Error: Configuration parsing fails for allowed_file_extensions

Solution:

# Use proper JSON array syntax
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'

# NOT: [.csv,.xlsx] or .csv,.xlsx

Issue: Encoding errors

Error: UnicodeDecodeError when reading files

Solutions:

Try different encoding: export PANDAS_TOOL_ENCODING="latin-1"
Check file encoding: file -i filename.csv
Use UTF-8 with BOM: export PANDAS_TOOL_ENCODING="utf-8-sig"

Issue: Performance issues

Causes: Large datasets, inefficient chunk sizes

Solutions:

Increase chunk size: export PANDAS_TOOL_CHUNK_SIZE=50000
Increase CSV threshold: export PANDAS_TOOL_MAX_CSV_SIZE=5000000
Use appropriate data types
Consider data preprocessing

Issue: Column not found

Error: Columns not found: ['column_name']

Solutions:

Check column names (case-sensitive)
Use head() operation to inspect data structure
Verify data format and headers

Best Practices

Performance Optimization

Tune chunk sizes - Match chunk_size to available memory
Set appropriate thresholds - Use max_csv_size to trigger chunking
Use efficient data types - Convert to appropriate types early
Filter early - Apply filters before expensive operations
Cache results - Leverage BaseTool’s built-in caching

Memory Management

Monitor memory usage - Watch for memory spikes
Use chunked processing - For files larger than max_csv_size
Process in batches - For multiple large files
Clean up DataFrames - Delete large objects when done
Use appropriate dtypes - Avoid object dtypes when possible

Data Quality

Validate input data - Check for missing values and outliers
Handle encoding properly - Use correct encoding for your data
Sanitize data - Clean data before processing
Use consistent delimiters - Match delimiter to data format
Handle missing values - Use fill_na() or dropna() appropriately

Security

Limit file extensions - Only allow necessary formats
Validate file paths - Prevent directory traversal
Sanitize queries - Validate filter conditions
Monitor resource usage - Prevent DoS attacks
Use appropriate permissions - Limit file system access

Development vs Production

Development:

PANDAS_TOOL_CHUNK_SIZE=1000
PANDAS_TOOL_MAX_CSV_SIZE=100000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'

Production:

PANDAS_TOOL_CHUNK_SIZE=50000
PANDAS_TOOL_MAX_CSV_SIZE=5000000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'

Error Handling

Always wrap pandas operations in try-except blocks:

from aiecs.tools.task_tools.pandas_tool import (
    PandasTool, 
    DataFrameError, 
    InputValidationError,
    ValidationError
)

pandas_tool = PandasTool()

try:
    result = pandas_tool.read_csv(csv_data)
except DataFrameError as e:
    print(f"DataFrame operation failed: {e}")
except InputValidationError as e:
    print(f"Input validation failed: {e}")
except ValidationError as e:
    print(f"Validation failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Support

For issues or questions about Pandas Tool configuration:

Check the tool source code for implementation details
Review pandas documentation for specific functionality
Consult the main aiecs documentation for architecture overview
Test with small datasets first to isolate configuration vs. data issues
Monitor memory and performance metrics during processing

Pandas Tool Configuration Guide

Overview

Using .env Files in Your Project

Setting Up .env Files

Multiple Environment Files

Best Practices for .env Files

Configuration Options

1. CSV Delimiter

2. Encoding

3. Default Aggregation

4. Chunk Size

5. Max CSV Size

6. Allowed File Extensions

Usage Examples

Example 1: Basic Environment Configuration

Example 2: High-Performance Configuration

Example 3: Memory-Constrained Configuration

Example 4: Programmatic Configuration

Example 5: Mixed Configuration

Configuration Priority

Data Type Parsing

String Values

Integer Values

Dictionary Values

List Values

Validation

Automatic Type Validation

Runtime Validation

Operations Supported

Data Reading/Writing

Descriptive Statistics

Filtering and Selection

Grouping and Aggregation

Merging and Concatenation

Data Transformation

Data Reshaping

Data Cleaning

Statistical Computations

Window Functions

Sampling and Viewing

Troubleshooting

Issue: CSV parsing fails

Issue: Memory errors with large files

Issue: File extension not allowed

Issue: Dictionary parsing error

Issue: List parsing error

Issue: Encoding errors

Issue: Performance issues

Issue: Column not found

Best Practices

Performance Optimization

Memory Management

Data Quality

Security

Development vs Production

Error Handling

Related Documentation

Support