# Pandas Tool Configuration Guide

## Overview

The Pandas Tool provides comprehensive data processing capabilities using pandas, supporting 30+ operations for data manipulation, analysis, and transformation. It can handle CSV, JSON, and Excel files with configurable processing parameters. It can be configured via environment variables using the `PANDAS_TOOL_` prefix or through programmatic configuration when initializing the tool.

## Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a `.env` file for convenience. The Pandas Tool reads from environment variables that are already loaded into the process, so you need to load the `.env` file in your application before importing aiecs tools.

### Setting Up .env Files

**1. Install python-dotenv:**

```bash
pip install python-dotenv
```

**2. Create a `.env` file in your project root:**

```bash
# .env file in your project root
PANDAS_TOOL_CSV_DELIMITER=,
PANDAS_TOOL_ENCODING=utf-8
PANDAS_TOOL_DEFAULT_AGG={"numeric":"mean","object":"count"}
PANDAS_TOOL_CHUNK_SIZE=10000
PANDAS_TOOL_MAX_CSV_SIZE=1000000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx",".json"]
```

**3. Load the .env file in your application:**

```python
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.task_tools.pandas_tool import PandasTool

# The tool will automatically use the environment variables
pandas_tool = PandasTool()
```

### Multiple Environment Files

You can use different `.env` files for different environments:

```python
import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.task_tools.pandas_tool import PandasTool
pandas_tool = PandasTool()
```

**Example `.env.production`:**
```bash
# Production settings - optimized for large datasets
PANDAS_TOOL_CSV_DELIMITER=,
PANDAS_TOOL_ENCODING=utf-8
PANDAS_TOOL_CHUNK_SIZE=50000
PANDAS_TOOL_MAX_CSV_SIZE=5000000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx"]
```

**Example `.env.development`:**
```bash
# Development settings - smaller chunks for testing
PANDAS_TOOL_CSV_DELIMITER=,
PANDAS_TOOL_ENCODING=utf-8
PANDAS_TOOL_CHUNK_SIZE=1000
PANDAS_TOOL_MAX_CSV_SIZE=100000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx",".json"]
```

### Best Practices for .env Files

1. **Never commit .env files to version control** - Add `.env` to your `.gitignore`:
   ```gitignore
   # .gitignore
   .env
   .env.local
   .env.*.local
   .env.production
   .env.staging
   ```

2. **Provide a template** - Create `.env.example` with documented dummy values:
   ```bash
   # .env.example
   # Pandas Tool Configuration
   
   # CSV delimiter character
   PANDAS_TOOL_CSV_DELIMITER=,
   
   # File encoding
   PANDAS_TOOL_ENCODING=utf-8
   
   # Default aggregation functions (JSON object)
   PANDAS_TOOL_DEFAULT_AGG={"numeric":"mean","object":"count"}
   
   # Chunk size for large file processing
   PANDAS_TOOL_CHUNK_SIZE=10000
   
   # Threshold for chunked CSV processing
   PANDAS_TOOL_MAX_CSV_SIZE=1000000
   
   # Allowed file extensions (JSON array)
   PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=[".csv",".xlsx",".json"]
   ```

3. **Document your variables** - Add comments explaining each setting

4. **Use load_dotenv() early** - Call it at the very top of your entry point, before any aiecs imports

5. **Format complex types correctly**:
   - Strings: Plain text: `,`, `utf-8`
   - Integers: Plain numbers: `10000`, `1000000`
   - Dictionaries: JSON object format: `{"key":"value"}`
   - Lists: JSON array format: `[".csv",".xlsx"]`

## Configuration Options

### 1. CSV Delimiter

**Environment Variable:** `PANDAS_TOOL_CSV_DELIMITER`

**Type:** String

**Default:** `","`

**Description:** Delimiter character used for CSV file parsing and writing. This affects how CSV files are read and written.

**Common Values:**
- `,` - Comma (default, standard CSV)
- `;` - Semicolon (European CSV format)
- `\t` - Tab (TSV format)
- `|` - Pipe (alternative delimiter)
- ` ` - Space (space-separated values)

**Example:**
```bash
export PANDAS_TOOL_CSV_DELIMITER=";"
```

**Note:** Use semicolon for European CSV files that use comma as decimal separator.

### 2. Encoding

**Environment Variable:** `PANDAS_TOOL_ENCODING`

**Type:** String

**Default:** `"utf-8"`

**Description:** Character encoding used for file operations. This ensures proper handling of international characters and special symbols.

**Common Encodings:**
- `utf-8` - Unicode (default, most common)
- `utf-16` - Unicode 16-bit
- `latin-1` - Western European
- `cp1252` - Windows-1252
- `iso-8859-1` - ISO Latin-1

**Example:**
```bash
export PANDAS_TOOL_ENCODING="utf-8"
```

**Note:** Use `utf-8` unless you have specific requirements for legacy file formats.

### 3. Default Aggregation

**Environment Variable:** `PANDAS_TOOL_DEFAULT_AGG`

**Type:** Dict[str, str]

**Default:** `{"numeric": "mean", "object": "count"}`

**Description:** Default aggregation functions applied to different data types during groupby operations. This provides sensible defaults for statistical operations.

**Format:** JSON object with string keys and values

**Available Functions:**
- `mean` - Average value
- `sum` - Total sum
- `count` - Count of non-null values
- `min` - Minimum value
- `max` - Maximum value
- `std` - Standard deviation
- `median` - Median value
- `first` - First value
- `last` - Last value

**Example:**
```bash
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"sum","object":"count","datetime":"first"}'
```

### 4. Chunk Size

**Environment Variable:** `PANDAS_TOOL_CHUNK_SIZE`

**Type:** Integer

**Default:** `10000`

**Description:** Number of rows to process at a time when reading large files. This helps manage memory usage for large datasets.

**Common Values:**
- `1000` - Small chunks (development/testing)
- `10000` - Default (balanced)
- `50000` - Large chunks (production)
- `100000` - Very large chunks (high-memory systems)

**Example:**
```bash
export PANDAS_TOOL_CHUNK_SIZE=50000
```

**Performance Note:** Larger chunks use more memory but process faster. Adjust based on available RAM.

### 5. Max CSV Size

**Environment Variable:** `PANDAS_TOOL_MAX_CSV_SIZE`

**Type:** Integer

**Default:** `1000000`

**Description:** Threshold (in characters) for determining when to use chunked processing for CSV files. Files larger than this will be processed in chunks.

**Common Values:**
- `100000` - Small threshold (100KB)
- `1000000` - Default (1MB)
- `10000000` - Large threshold (10MB)
- `100000000` - Very large threshold (100MB)

**Example:**
```bash
export PANDAS_TOOL_MAX_CSV_SIZE=5000000
```

**Memory Management:** Lower values trigger chunked processing earlier, using less memory but potentially slower processing.

### 6. Allowed File Extensions

**Environment Variable:** `PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS`

**Type:** List[str]

**Default:** `['.csv', '.xlsx', '.json']`

**Description:** List of allowed file extensions for data processing. This is a security feature that prevents processing of unauthorized file types.

**Format:** JSON array string with double quotes

**Supported Formats:**
- `.csv` - Comma-separated values
- `.xlsx` - Excel spreadsheets
- `.json` - JSON data files
- `.xls` - Legacy Excel format (if supported)

**Example:**
```bash
# Strict - Only CSV and Excel
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'

# Lenient - All supported formats
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'
```

**Security Note:** Only allow extensions that your application actually needs to process.

## Usage Examples

### Example 1: Basic Environment Configuration

```bash
# Set custom processing parameters
export PANDAS_TOOL_CSV_DELIMITER=";"
export PANDAS_TOOL_CHUNK_SIZE=50000
export PANDAS_TOOL_MAX_CSV_SIZE=5000000
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'

# Run your application
python app.py
```

### Example 2: High-Performance Configuration

```bash
# Optimized for large datasets
export PANDAS_TOOL_CHUNK_SIZE=100000
export PANDAS_TOOL_MAX_CSV_SIZE=10000000
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"sum","object":"count"}'
```

### Example 3: Memory-Constrained Configuration

```bash
# Conservative memory usage
export PANDAS_TOOL_CHUNK_SIZE=1000
export PANDAS_TOOL_MAX_CSV_SIZE=100000
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"mean","object":"count"}'
```

### Example 4: Programmatic Configuration

```python
from aiecs.tools.task_tools.pandas_tool import PandasTool

# Initialize with custom configuration
pandas_tool = PandasTool(config={
    'csv_delimiter': ';',
    'encoding': 'utf-8',
    'chunk_size': 50000,
    'max_csv_size': 5000000,
    'allowed_file_extensions': ['.csv', '.xlsx']
})
```

### Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

```bash
# Set environment defaults
export PANDAS_TOOL_CHUNK_SIZE=10000
```

```python
# Override for specific instance
pandas_tool = PandasTool(config={
    'chunk_size': 50000  # This overrides the environment variable
})
```

## Configuration Priority

When the Pandas Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

1. **Programmatic config** - Values passed to the constructor
2. **Environment variables** - Values set via `PANDAS_TOOL_*` variables
3. **Default values** - Built-in defaults as specified above

## Data Type Parsing

### String Values

Strings should be provided as plain text without quotes:

```bash
export PANDAS_TOOL_CSV_DELIMITER=,
export PANDAS_TOOL_ENCODING=utf-8
```

### Integer Values

Integers should be provided as numeric strings:

```bash
export PANDAS_TOOL_CHUNK_SIZE=10000
export PANDAS_TOOL_MAX_CSV_SIZE=1000000
```

### Dictionary Values

Dictionaries must be provided as JSON objects with double quotes:

```bash
# Correct
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"mean","object":"count"}'

# Incorrect (will not parse)
export PANDAS_TOOL_DEFAULT_AGG="numeric:mean,object:count"
```

### List Values

Lists must be provided as JSON arrays with double quotes:

```bash
# Correct
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'

# Incorrect (will not parse)
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS=".csv,.xlsx,.json"
```

**Important:** Use single quotes for the shell, double quotes for JSON:
```bash
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'
#                                      ^                    ^
#                                      Single quotes for shell
#                                         ^      ^
#                                         Double quotes for JSON
```

## Validation

### Automatic Type Validation

Pydantic automatically validates configuration values:

- `csv_delimiter` must be a non-empty string
- `encoding` must be a valid encoding string
- `default_agg` must be a dictionary with string keys and values
- `chunk_size` must be a positive integer
- `max_csv_size` must be a positive integer
- `allowed_file_extensions` must be a list of strings

### Runtime Validation

When processing data, the tool validates:

1. **File extensions** - Must be in `allowed_file_extensions` list
2. **Data structure** - Input records must be valid for DataFrame creation
3. **Column existence** - Referenced columns must exist in DataFrames
4. **Data types** - Type conversions are validated
5. **Query syntax** - Filter conditions are validated

## Operations Supported

The Pandas Tool supports 30+ operations across multiple categories:

### Data Reading/Writing
- `read_csv` - Read CSV string into DataFrame
- `read_json` - Read JSON string into DataFrame
- `read_file` - Read data from file (CSV, Excel, JSON)
- `write_file` - Write DataFrame to file

### Descriptive Statistics
- `summary` - Compute summary statistics
- `describe` - Compute descriptive statistics for columns
- `value_counts` - Compute value counts for columns

### Filtering and Selection
- `filter` - Filter DataFrame based on condition
- `select_columns` - Select specified columns
- `drop_columns` - Drop specified columns
- `drop_duplicates` - Drop duplicate rows
- `dropna` - Drop rows/columns with missing values

### Grouping and Aggregation
- `groupby` - Group DataFrame and apply aggregations
- `pivot_table` - Create pivot table

### Merging and Concatenation
- `merge` - Merge two DataFrames
- `concat` - Concatenate multiple DataFrames

### Data Transformation
- `sort_values` - Sort DataFrame by columns
- `rename_columns` - Rename DataFrame columns
- `replace_values` - Replace values in DataFrame
- `fill_na` - Fill missing values
- `astype` - Convert column types
- `apply` - Apply function to columns/rows

### Data Reshaping
- `melt` - Melt DataFrame to long format
- `pivot` - Pivot DataFrame to wide format
- `stack` - Stack DataFrame columns into rows
- `unstack` - Unstack DataFrame rows into columns

### Data Cleaning
- `strip_strings` - Strip whitespace from string columns
- `to_numeric` - Convert columns to numeric type
- `to_datetime` - Convert columns to datetime type

### Statistical Computations
- `mean` - Compute mean of numeric columns
- `sum` - Compute sum of numeric columns
- `count` - Compute count of non-null values
- `min` - Compute minimum values
- `max` - Compute maximum values

### Window Functions
- `rolling` - Apply rolling window function

### Sampling and Viewing
- `head` - Return first n rows
- `tail` - Return last n rows
- `sample` - Return random sample of rows

## Troubleshooting

### Issue: CSV parsing fails

**Error:** `Failed to read CSV: ParserError`

**Solutions:**
1. Check delimiter: `export PANDAS_TOOL_CSV_DELIMITER=";"`
2. Check encoding: `export PANDAS_TOOL_ENCODING="utf-8"`
3. Verify CSV format and structure

### Issue: Memory errors with large files

**Error:** `MemoryError` or system becomes unresponsive

**Solutions:**
```bash
# Reduce chunk size
export PANDAS_TOOL_CHUNK_SIZE=1000

# Lower CSV size threshold
export PANDAS_TOOL_MAX_CSV_SIZE=100000
```

### Issue: File extension not allowed

**Error:** `Unsupported file type`

**Solution:**
```bash
# Add the extension to allowed list
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'
```

### Issue: Dictionary parsing error

**Error:** Configuration parsing fails for `default_agg`

**Solution:**
```bash
# Use proper JSON object syntax
export PANDAS_TOOL_DEFAULT_AGG='{"numeric":"mean","object":"count"}'

# NOT: {"numeric":mean,"object":count} or numeric:mean,object:count
```

### Issue: List parsing error

**Error:** Configuration parsing fails for `allowed_file_extensions`

**Solution:**
```bash
# Use proper JSON array syntax
export PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'

# NOT: [.csv,.xlsx] or .csv,.xlsx
```

### Issue: Encoding errors

**Error:** `UnicodeDecodeError` when reading files

**Solutions:**
1. Try different encoding: `export PANDAS_TOOL_ENCODING="latin-1"`
2. Check file encoding: `file -i filename.csv`
3. Use UTF-8 with BOM: `export PANDAS_TOOL_ENCODING="utf-8-sig"`

### Issue: Performance issues

**Causes:** Large datasets, inefficient chunk sizes

**Solutions:**
1. Increase chunk size: `export PANDAS_TOOL_CHUNK_SIZE=50000`
2. Increase CSV threshold: `export PANDAS_TOOL_MAX_CSV_SIZE=5000000`
3. Use appropriate data types
4. Consider data preprocessing

### Issue: Column not found

**Error:** `Columns not found: ['column_name']`

**Solutions:**
1. Check column names (case-sensitive)
2. Use `head()` operation to inspect data structure
3. Verify data format and headers

## Best Practices

### Performance Optimization

1. **Tune chunk sizes** - Match `chunk_size` to available memory
2. **Set appropriate thresholds** - Use `max_csv_size` to trigger chunking
3. **Use efficient data types** - Convert to appropriate types early
4. **Filter early** - Apply filters before expensive operations
5. **Cache results** - Leverage BaseTool's built-in caching

### Memory Management

1. **Monitor memory usage** - Watch for memory spikes
2. **Use chunked processing** - For files larger than `max_csv_size`
3. **Process in batches** - For multiple large files
4. **Clean up DataFrames** - Delete large objects when done
5. **Use appropriate dtypes** - Avoid object dtypes when possible

### Data Quality

1. **Validate input data** - Check for missing values and outliers
2. **Handle encoding properly** - Use correct encoding for your data
3. **Sanitize data** - Clean data before processing
4. **Use consistent delimiters** - Match delimiter to data format
5. **Handle missing values** - Use `fill_na()` or `dropna()` appropriately

### Security

1. **Limit file extensions** - Only allow necessary formats
2. **Validate file paths** - Prevent directory traversal
3. **Sanitize queries** - Validate filter conditions
4. **Monitor resource usage** - Prevent DoS attacks
5. **Use appropriate permissions** - Limit file system access

### Development vs Production

**Development:**
```bash
PANDAS_TOOL_CHUNK_SIZE=1000
PANDAS_TOOL_MAX_CSV_SIZE=100000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx",".json"]'
```

**Production:**
```bash
PANDAS_TOOL_CHUNK_SIZE=50000
PANDAS_TOOL_MAX_CSV_SIZE=5000000
PANDAS_TOOL_ALLOWED_FILE_EXTENSIONS='[".csv",".xlsx"]'
```

### Error Handling

Always wrap pandas operations in try-except blocks:

```python
from aiecs.tools.task_tools.pandas_tool import (
    PandasTool, 
    DataFrameError, 
    InputValidationError,
    ValidationError
)

pandas_tool = PandasTool()

try:
    result = pandas_tool.read_csv(csv_data)
except DataFrameError as e:
    print(f"DataFrame operation failed: {e}")
except InputValidationError as e:
    print(f"Input validation failed: {e}")
except ValidationError as e:
    print(f"Validation failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")
```

## Related Documentation

- Tool implementation details in the source code
- Pandas documentation: https://pandas.pydata.org/docs/
- NumPy documentation: https://numpy.org/doc/
- Main aiecs documentation for architecture overview

## Support

For issues or questions about Pandas Tool configuration:
- Check the tool source code for implementation details
- Review pandas documentation for specific functionality
- Consult the main aiecs documentation for architecture overview
- Test with small datasets first to isolate configuration vs. data issues
- Monitor memory and performance metrics during processing