Pandas Tool Configuration Guide
Overview
The Pandas Tool provides comprehensive data processing capabilities using pandas, supporting 30+ operations for data manipulation, analysis, and transformation. It can handle CSV, JSON, and Excel files with configurable processing parameters. It can be configured via environment variables using the PANDAS_TOOL_ prefix or through programmatic configuration when initializing the tool.
Using .env Files in Your Project
When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Pandas Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.
Setting Up .env Files
1. Install python-dotenv:
pip install python-dotenv
2. Create a .env file in your project root:
# .env file in your project root
PANDAS_TOOL_CSV_DELIMITER=,
PANDAS_TOOL_ENCODING=utf-8
PANDAS_TOOL_DEFAULT_AGG={"numeric":"mean","object":"count"}
PANDAS_TOOL_CHUNK_SIZE=10000
PANDAS_TOOL_MAX_CSV_SIZE=1000000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx",".json"]
3. Load the .env file in your application:
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv
# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()
# Now import and use aiecs tools
from aiecs.tools.task_tools.pandas_tool import PandasTool
# The tool will automatically use the environment variables
pandas_tool = PandasTool()
Multiple Environment Files
You can use different .env files for different environments:
import os
from dotenv import load_dotenv
# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')
if env == 'production':
load_dotenv('.env.production')
elif env == 'staging':
load_dotenv('.env.staging')
else:
load_dotenv('.env.development')
from aiecs.tools.task_tools.pandas_tool import PandasTool
pandas_tool = PandasTool()
Example .env.production:
# Production settings - optimized for large datasets
PANDAS_TOOL_CSV_DELIMITER=,
PANDAS_TOOL_ENCODING=utf-8
PANDAS_TOOL_CHUNK_SIZE=50000
PANDAS_TOOL_MAX_CSV_SIZE=5000000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx"]
Example .env.development:
# Development settings - smaller chunks for testing
PANDAS_TOOL_CSV_DELIMITER=,
PANDAS_TOOL_ENCODING=utf-8
PANDAS_TOOL_CHUNK_SIZE=1000
PANDAS_TOOL_MAX_CSV_SIZE=100000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx",".json"]
Best Practices for .env Files
Never commit .env files to version control - Add
.envto your.gitignore:# .gitignore .env .env.local .env.*.local .env.production .env.staging
Provide a template - Create
.env.examplewith documented dummy values:# .env.example # Pandas Tool Configuration # CSV delimiter character PANDAS_TOOL_CSV_DELIMITER=, # File encoding PANDAS_TOOL_ENCODING=utf-8 # Default aggregation functions (JSON object) PANDAS_TOOL_DEFAULT_AGG={"numeric":"mean","object":"count"} # Chunk size for large file processing PANDAS_TOOL_CHUNK_SIZE=10000 # Threshold for chunked CSV processing PANDAS_TOOL_MAX_CSV_SIZE=1000000 # Allowed file extensions (JSON array) PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx",".json"]
Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format complex types correctly:
Strings: Plain text:
,,utf-8Integers: Plain numbers:
10000,1000000Dictionaries: JSON object format:
{"key":"value"}Lists: JSON array format:
[".csv",".xlsx"]
Configuration Options
1. CSV Delimiter
Environment Variable: PANDAS_TOOL_CSV_DELIMITER
Type: String
Default: ","
Description: Delimiter character used for CSV file parsing and writing. This affects how CSV files are read and written.
Common Values:
,- Comma (default, standard CSV);- Semicolon (European CSV format)\t- Tab (TSV format)|- Pipe (alternative delimiter)- Space (space-separated values)
Example:
export PANDAS_TOOL_CSV_DELIMITER=";"
Note: Use semicolon for European CSV files that use comma as decimal separator.
2. Encoding
Environment Variable: PANDAS_TOOL_ENCODING
Type: String
Default: "utf-8"
Description: Character encoding used for file operations. This ensures proper handling of international characters and special symbols.
Common Encodings:
utf-8- Unicode (default, most common)utf-16- Unicode 16-bitlatin-1- Western Europeancp1252- Windows-1252iso-8859-1- ISO Latin-1
Example:
export PANDAS_TOOL_ENCODING="utf-8"
Note: Use utf-8 unless you have specific requirements for legacy file formats.
3. Default Aggregation
Environment Variable: PANDAS_TOOL_DEFAULT_AGG
Type: Dict[str, str]
Default: {"numeric": "mean", "object": "count"}
Description: Default aggregation functions applied to different data types during groupby operations. This provides sensible defaults for statistical operations.
Format: JSON object with string keys and values
Available Functions:
mean- Average valuesum- Total sumcount- Count of non-null valuesmin- Minimum valuemax- Maximum valuestd- Standard deviationmedian- Median valuefirst- First valuelast- Last value
Example:
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"sum","object":"count","datetime":"first"}'
4. Chunk Size
Environment Variable: PANDAS_TOOL_CHUNK_SIZE
Type: Integer
Default: 10000
Description: Number of rows to process at a time when reading large files. This helps manage memory usage for large datasets.
Common Values:
1000- Small chunks (development/testing)10000- Default (balanced)50000- Large chunks (production)100000- Very large chunks (high-memory systems)
Example:
export PANDAS_TOOL_CHUNK_SIZE=50000
Performance Note: Larger chunks use more memory but process faster. Adjust based on available RAM.
5. Max CSV Size
Environment Variable: PANDAS_TOOL_MAX_CSV_SIZE
Type: Integer
Default: 1000000
Description: Threshold (in characters) for determining when to use chunked processing for CSV files. Files larger than this will be processed in chunks.
Common Values:
100000- Small threshold (100KB)1000000- Default (1MB)10000000- Large threshold (10MB)100000000- Very large threshold (100MB)
Example:
export PANDAS_TOOL_MAX_CSV_SIZE=5000000
Memory Management: Lower values trigger chunked processing earlier, using less memory but potentially slower processing.
6. Allowed File Extensions
Environment Variable: PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS
Type: List[str]
Default: ['.csv', '.xlsx', '.json']
Description: List of allowed file extensions for data processing. This is a security feature that prevents processing of unauthorized file types.
Format: JSON array string with double quotes
Supported Formats:
.csv- Comma-separated values.xlsx- Excel spreadsheets.json- JSON data files.xls- Legacy Excel format (if supported)
Example:
# Strict - Only CSV and Excel
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'
# Lenient - All supported formats
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'
Security Note: Only allow extensions that your application actually needs to process.
Usage Examples
Example 1: Basic Environment Configuration
# Set custom processing parameters
export PANDAS_TOOL_CSV_DELIMITER=";"
export PANDAS_TOOL_CHUNK_SIZE=50000
export PANDAS_TOOL_MAX_CSV_SIZE=5000000
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'
# Run your application
python app.py
Example 2: High-Performance Configuration
# Optimized for large datasets
export PANDAS_TOOL_CHUNK_SIZE=100000
export PANDAS_TOOL_MAX_CSV_SIZE=10000000
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"sum","object":"count"}'
Example 3: Memory-Constrained Configuration
# Conservative memory usage
export PANDAS_TOOL_CHUNK_SIZE=1000
export PANDAS_TOOL_MAX_CSV_SIZE=100000
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"mean","object":"count"}'
Example 4: Programmatic Configuration
from aiecs.tools.task_tools.pandas_tool import PandasTool
# Initialize with custom configuration
pandas_tool = PandasTool(config={
'csv_delimiter': ';',
'encoding': 'utf-8',
'chunk_size': 50000,
'max_csv_size': 5000000,
'allowed_file_extensions': ['.csv', '.xlsx']
})
Example 5: Mixed Configuration
Environment variables are used as defaults, but can be overridden programmatically:
# Set environment defaults
export PANDAS_TOOL_CHUNK_SIZE=10000
# Override for specific instance
pandas_tool = PandasTool(config={
'chunk_size': 50000 # This overrides the environment variable
})
Configuration Priority
When the Pandas Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):
Programmatic config - Values passed to the constructor
Environment variables - Values set via
PANDAS_TOOL_*variablesDefault values - Built-in defaults as specified above
Data Type Parsing
String Values
Strings should be provided as plain text without quotes:
export PANDAS_TOOL_CSV_DELIMITER=,
export PANDAS_TOOL_ENCODING=utf-8
Integer Values
Integers should be provided as numeric strings:
export PANDAS_TOOL_CHUNK_SIZE=10000
export PANDAS_TOOL_MAX_CSV_SIZE=1000000
Dictionary Values
Dictionaries must be provided as JSON objects with double quotes:
# Correct
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"mean","object":"count"}'
# Incorrect (will not parse)
export PANDAS_TOOL_DEFAULT_AGG="numeric:mean,object:count"
List Values
Lists must be provided as JSON arrays with double quotes:
# Correct
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'
# Incorrect (will not parse)
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=".csv,.xlsx,.json"
Important: Use single quotes for the shell, double quotes for JSON:
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'
# ^ ^
# Single quotes for shell
# ^ ^
# Double quotes for JSON
Validation
Automatic Type Validation
Pydantic automatically validates configuration values:
csv_delimitermust be a non-empty stringencodingmust be a valid encoding stringdefault_aggmust be a dictionary with string keys and valueschunk_sizemust be a positive integermax_csv_sizemust be a positive integerallowed_file_extensionsmust be a list of strings
Runtime Validation
When processing data, the tool validates:
File extensions - Must be in
allowed_file_extensionslistData structure - Input records must be valid for DataFrame creation
Column existence - Referenced columns must exist in DataFrames
Data types - Type conversions are validated
Query syntax - Filter conditions are validated
Operations Supported
The Pandas Tool supports 30+ operations across multiple categories:
Data Reading/Writing
read_csv- Read CSV string into DataFrameread_json- Read JSON string into DataFrameread_file- Read data from file (CSV, Excel, JSON)write_file- Write DataFrame to file
Descriptive Statistics
summary- Compute summary statisticsdescribe- Compute descriptive statistics for columnsvalue_counts- Compute value counts for columns
Filtering and Selection
filter- Filter DataFrame based on conditionselect_columns- Select specified columnsdrop_columns- Drop specified columnsdrop_duplicates- Drop duplicate rowsdropna- Drop rows/columns with missing values
Grouping and Aggregation
groupby- Group DataFrame and apply aggregationspivot_table- Create pivot table
Merging and Concatenation
merge- Merge two DataFramesconcat- Concatenate multiple DataFrames
Data Transformation
sort_values- Sort DataFrame by columnsrename_columns- Rename DataFrame columnsreplace_values- Replace values in DataFramefill_na- Fill missing valuesastype- Convert column typesapply- Apply function to columns/rows
Data Reshaping
melt- Melt DataFrame to long formatpivot- Pivot DataFrame to wide formatstack- Stack DataFrame columns into rowsunstack- Unstack DataFrame rows into columns
Data Cleaning
strip_strings- Strip whitespace from string columnsto_numeric- Convert columns to numeric typeto_datetime- Convert columns to datetime type
Statistical Computations
mean- Compute mean of numeric columnssum- Compute sum of numeric columnscount- Compute count of non-null valuesmin- Compute minimum valuesmax- Compute maximum values
Window Functions
rolling- Apply rolling window function
Sampling and Viewing
head- Return first n rowstail- Return last n rowssample- Return random sample of rows
Troubleshooting
Issue: CSV parsing fails
Error: Failed to read CSV: ParserError
Solutions:
Check delimiter:
export PANDAS_TOOL_CSV_DELIMITER=";"Check encoding:
export PANDAS_TOOL_ENCODING="utf-8"Verify CSV format and structure
Issue: Memory errors with large files
Error: MemoryError or system becomes unresponsive
Solutions:
# Reduce chunk size
export PANDAS_TOOL_CHUNK_SIZE=1000
# Lower CSV size threshold
export PANDAS_TOOL_MAX_CSV_SIZE=100000
Issue: File extension not allowed
Error: Unsupported file type
Solution:
# Add the extension to allowed list
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'
Issue: Dictionary parsing error
Error: Configuration parsing fails for default_agg
Solution:
# Use proper JSON object syntax
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"mean","object":"count"}'
# NOT: {"numeric":mean,"object":count} or numeric:mean,object:count
Issue: List parsing error
Error: Configuration parsing fails for allowed_file_extensions
Solution:
# Use proper JSON array syntax
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'
# NOT: [.csv,.xlsx] or .csv,.xlsx
Issue: Encoding errors
Error: UnicodeDecodeError when reading files
Solutions:
Try different encoding:
export PANDAS_TOOL_ENCODING="latin-1"Check file encoding:
file -i filename.csvUse UTF-8 with BOM:
export PANDAS_TOOL_ENCODING="utf-8-sig"
Issue: Performance issues
Causes: Large datasets, inefficient chunk sizes
Solutions:
Increase chunk size:
export PANDAS_TOOL_CHUNK_SIZE=50000Increase CSV threshold:
export PANDAS_TOOL_MAX_CSV_SIZE=5000000Use appropriate data types
Consider data preprocessing
Issue: Column not found
Error: Columns not found: ['column_name']
Solutions:
Check column names (case-sensitive)
Use
head()operation to inspect data structureVerify data format and headers
Best Practices
Performance Optimization
Tune chunk sizes - Match
chunk_sizeto available memorySet appropriate thresholds - Use
max_csv_sizeto trigger chunkingUse efficient data types - Convert to appropriate types early
Filter early - Apply filters before expensive operations
Cache results - Leverage BaseTool’s built-in caching
Memory Management
Monitor memory usage - Watch for memory spikes
Use chunked processing - For files larger than
max_csv_sizeProcess in batches - For multiple large files
Clean up DataFrames - Delete large objects when done
Use appropriate dtypes - Avoid object dtypes when possible
Data Quality
Validate input data - Check for missing values and outliers
Handle encoding properly - Use correct encoding for your data
Sanitize data - Clean data before processing
Use consistent delimiters - Match delimiter to data format
Handle missing values - Use
fill_na()ordropna()appropriately
Security
Limit file extensions - Only allow necessary formats
Validate file paths - Prevent directory traversal
Sanitize queries - Validate filter conditions
Monitor resource usage - Prevent DoS attacks
Use appropriate permissions - Limit file system access
Development vs Production
Development:
PANDAS_TOOL_CHUNK_SIZE=1000
PANDAS_TOOL_MAX_CSV_SIZE=100000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'
Production:
PANDAS_TOOL_CHUNK_SIZE=50000
PANDAS_TOOL_MAX_CSV_SIZE=5000000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'
Error Handling
Always wrap pandas operations in try-except blocks:
from aiecs.tools.task_tools.pandas_tool import (
PandasTool,
DataFrameError,
InputValidationError,
ValidationError
)
pandas_tool = PandasTool()
try:
result = pandas_tool.read_csv(csv_data)
except DataFrameError as e:
print(f"DataFrame operation failed: {e}")
except InputValidationError as e:
print(f"Input validation failed: {e}")
except ValidationError as e:
print(f"Validation failed: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Support
For issues or questions about Pandas Tool configuration:
Check the tool source code for implementation details
Review pandas documentation for specific functionality
Consult the main aiecs documentation for architecture overview
Test with small datasets first to isolate configuration vs. data issues
Monitor memory and performance metrics during processing