Office Tool Configuration Guide

Overview

The Office Tool provides comprehensive document processing capabilities for Microsoft Office formats (DOCX, PPTX, XLSX) and PDF files. It supports reading, writing, and text extraction from various document formats. It can be configured via environment variables using the OFFICE_TOOL_ prefix or through programmatic configuration when initializing the tool.

Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Office Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.

Setting Up .env Files

1. Install python-dotenv:

pip install python-dotenv

2. Create a .env file in your project root:

# .env file in your project root
OFFICE_TOOL_MAX_FILE_SIZE_MB=100
OFFICE_TOOL_DEFAULT_FONT=Arial
OFFICE_TOOL_DEFAULT_FONT_SIZE=12
OFFICE_TOOL_ALLOWED_EXTENSIONS=[".docx",".pptx",".xlsx",".pdf",".png",".jpg",".jpeg",".tiff",".bmp",".gif"]

3. Load the .env file in your application:

# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.task_tools.office_tool import OfficeTool

# The tool will automatically use the environment variables
office_tool = OfficeTool()

Multiple Environment Files

You can use different .env files for different environments:

import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.task_tools.office_tool import OfficeTool
office_tool = OfficeTool()

Example .env.production:

# Production settings - strict limits for security
OFFICE_TOOL_MAX_FILE_SIZE_MB=50
OFFICE_TOOL_DEFAULT_FONT=Arial
OFFICE_TOOL_DEFAULT_FONT_SIZE=11
OFFICE_TOOL_ALLOWED_EXTENSIONS=[".docx",".pptx",".xlsx",".pdf"]

Example .env.development:

# Development settings - relaxed limits for testing
OFFICE_TOOL_MAX_FILE_SIZE_MB=200
OFFICE_TOOL_DEFAULT_FONT=Calibri
OFFICE_TOOL_DEFAULT_FONT_SIZE=12
OFFICE_TOOL_ALLOWED_EXTENSIONS=[".docx",".pptx",".xlsx",".pdf",".png",".jpg",".jpeg",".tiff",".bmp",".gif"]

Best Practices for .env Files

  1. Never commit .env files to version control - Add .env to your .gitignore:

    # .gitignore
    .env
    .env.local
    .env.*.local
    .env.production
    .env.staging
    
  2. Provide a template - Create .env.example with documented dummy values:

    # .env.example
    # Office Tool Configuration
    
    # Maximum file size in megabytes
    OFFICE_TOOL_MAX_FILE_SIZE_MB=100
    
    # Default font for documents
    OFFICE_TOOL_DEFAULT_FONT=Arial
    
    # Default font size in points
    OFFICE_TOOL_DEFAULT_FONT_SIZE=12
    
    # Allowed document file extensions (JSON array)
    OFFICE_TOOL_ALLOWED_EXTENSIONS=[".docx",".pptx",".xlsx",".pdf",".png",".jpg",".jpeg",".tiff",".bmp",".gif"]
    
  3. Document your variables - Add comments explaining each setting

  4. Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports

  5. Format complex types correctly:

    • Integers: Plain numbers: 100, 12

    • Strings: Plain text: Arial, Calibri

    • Lists: Use JSON array format with double quotes: [".docx",".pdf"]

Configuration Options

1. Max File Size (MB)

Environment Variable: OFFICE_TOOL_MAX_FILE_SIZE_MB

Type: Integer

Default: 100

Description: Maximum allowed file size in megabytes. Files larger than this limit will be rejected during validation for security and performance reasons.

Common Values:

  • 10 - Conservative limit for public APIs

  • 50 - Moderate limit for web applications

  • 100 - Default (balanced)

  • 200 - Generous limit for internal tools

  • 500 - Large files for enterprise applications

Example:

export OFFICE_TOOL_MAX_FILE_SIZE_MB=50

Security Note: Keep this value as low as practical for your use case to prevent memory exhaustion and DoS attacks.

2. Default Font

Environment Variable: OFFICE_TOOL_DEFAULT_FONT

Type: String

Default: "Arial"

Description: Default font to use when creating DOCX documents. This font will be applied to the Normal style of generated documents.

Common Fonts:

  • Arial - Default, widely available

  • Calibri - Modern Microsoft default

  • Times New Roman - Traditional serif font

  • Verdana - Web-friendly sans-serif

  • Helvetica - Classic sans-serif

Example:

export OFFICE_TOOL_DEFAULT_FONT=Calibri

Note: Ensure the specified font is installed on the system where documents will be opened, otherwise a fallback font will be used.

3. Default Font Size

Environment Variable: OFFICE_TOOL_DEFAULT_FONT_SIZE

Type: Integer

Default: 12

Description: Default font size in points to use when creating DOCX documents. This size will be applied to the Normal style of generated documents.

Common Sizes:

  • 10 - Small, compact text

  • 11 - Common for business documents

  • 12 - Default, standard size

  • 14 - Large, easy to read

  • 16 - Headings or emphasis

Example:

export OFFICE_TOOL_DEFAULT_FONT_SIZE=11

4. Allowed Extensions

Environment Variable: OFFICE_TOOL_ALLOWED_EXTENSIONS

Type: List[str]

Default: ['.docx', '.pptx', '.xlsx', '.pdf', '.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif']

Description: List of allowed file extensions for document processing. This is a critical security feature that prevents processing of unauthorized or potentially malicious file types.

Format: JSON array string with double quotes

Supported Formats:

  • .docx - Microsoft Word documents

  • .pptx - Microsoft PowerPoint presentations

  • .xlsx - Microsoft Excel spreadsheets

  • .pdf - PDF documents

  • .png, .jpg, .jpeg - Image formats (for OCR)

  • .tiff, .bmp, .gif - Additional image formats

Example:

# Strict - Only Office formats
export OFFICE_TOOL_ALLOWED_EXTENSIONS='[".docx",".pptx",".xlsx"]'

# Moderate - Office and PDF
export OFFICE_TOOL_ALLOWED_EXTENSIONS='[".docx",".pptx",".xlsx",".pdf"]'

# Lenient - All supported formats
export OFFICE_TOOL_ALLOWED_EXTENSIONS='[".docx",".pptx",".xlsx",".pdf",".png",".jpg",".jpeg",".tiff",".bmp",".gif"]'

Security Note: Only allow extensions that your application actually needs to process. Images should only be included if OCR functionality is required.

Usage Examples

Example 1: Basic Environment Configuration

# Set custom limits and fonts
export OFFICE_TOOL_MAX_FILE_SIZE_MB=50
export OFFICE_TOOL_DEFAULT_FONT=Calibri
export OFFICE_TOOL_DEFAULT_FONT_SIZE=11
export OFFICE_TOOL_ALLOWED_EXTENSIONS='[".docx",".pptx",".xlsx",".pdf"]'

# Run your application
python app.py

Example 2: Security-Focused Configuration

# Strict limits for public-facing applications
export OFFICE_TOOL_MAX_FILE_SIZE_MB=20
export OFFICE_TOOL_ALLOWED_EXTENSIONS='[".docx",".pdf"]'
export OFFICE_TOOL_DEFAULT_FONT=Arial
export OFFICE_TOOL_DEFAULT_FONT_SIZE=12

Example 3: High-Capacity Configuration

# Optimized for internal high-volume processing
export OFFICE_TOOL_MAX_FILE_SIZE_MB=500
export OFFICE_TOOL_ALLOWED_EXTENSIONS='[".docx",".pptx",".xlsx",".pdf",".png",".jpg"]'
export OFFICE_TOOL_DEFAULT_FONT=Calibri
export OFFICE_TOOL_DEFAULT_FONT_SIZE=11

Example 4: Programmatic Configuration

from aiecs.tools.task_tools.office_tool import OfficeTool

# Initialize with custom configuration
office_tool = OfficeTool(config={
    'max_file_size_mb': 75,
    'default_font': 'Calibri',
    'default_font_size': 11,
    'allowed_extensions': ['.docx', '.pptx', '.xlsx', '.pdf']
})

Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

# Set environment defaults
export OFFICE_TOOL_MAX_FILE_SIZE_MB=100
export OFFICE_TOOL_DEFAULT_FONT=Arial
# Override for specific instance
office_tool = OfficeTool(config={
    'max_file_size_mb': 50,  # Override
    'default_font': 'Calibri'  # Override
})

Configuration Priority

When the Office Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

  1. Programmatic config - Values passed to the constructor

  2. Environment variables - Values set via OFFICE_TOOL_* variables

  3. Default values - Built-in defaults as specified above

Data Type Parsing

Integer Values

Integers should be provided as numeric strings:

export OFFICE_TOOL_MAX_FILE_SIZE_MB=100
export OFFICE_TOOL_DEFAULT_FONT_SIZE=12

String Values

Strings should be provided as plain text without quotes:

export OFFICE_TOOL_DEFAULT_FONT=Arial

List Values

Lists must be provided as JSON array strings with double quotes:

# Correct
export OFFICE_TOOL_ALLOWED_EXTENSIONS='[".docx",".pdf"]'

# Incorrect (will not parse)
export OFFICE_TOOL_ALLOWED_EXTENSIONS=".docx,.pdf"

Important: Use single quotes for the shell, double quotes for JSON:

export OFFICE_TOOL_ALLOWED_EXTENSIONS='[".docx",".pptx",".xlsx"]'
#                                      ^                          ^
#                                      Single quotes for shell
#                                         ^      ^      ^
#                                         Double quotes for JSON

Validation

Automatic Type Validation

Pydantic automatically validates configuration values:

  • max_file_size_mb must be a positive integer

  • default_font must be a non-empty string

  • default_font_size must be a positive integer

  • allowed_extensions must be a list of strings

File Validation

When processing documents, the tool validates:

  1. File existence - File must exist at the specified path

  2. File extension - Must be in allowed_extensions list

  3. File size - Must not exceed max_file_size_mb limit

  4. Path traversal - Prevents directory traversal attacks (../, ~, %)

  5. Allowed directories - Files must be in allowed locations (cwd, /tmp, ./data, ./uploads)

  6. Document structure - Validates document integrity before processing

Security Validation

The tool includes multiple security layers:

  • Extension whitelist prevents processing unauthorized file types

  • File size limits prevent memory exhaustion

  • Path validation prevents directory traversal attacks

  • Content sanitization removes control characters and enforces limits

  • Directory restrictions limits file access to safe locations

Dependencies Setup

The Office Tool requires several external dependencies for full functionality:

Required Python Packages

pip install pandas pdfplumber pytesseract python-docx python-pptx pillow tika

Apache Tika Setup

Tika is used as a fallback for text extraction from various formats:

Requirements:

  • Java 11 or higher (Java 8 is no longer supported in Tika 3.x)

Automatic (recommended):

# Tika will download automatically on first use
# Downloads Java server to ~/.tika-server.jar

Manual:

# Download Tika server JAR manually (version 3.2.2+ recommended for security fixes)
wget https://repo1.maven.org/maven2/org/apache/tika/tika-server/3.2.2/tika-server-3.2.2.jar
export TIKA_SERVER_JAR=/path/to/tika-server-3.2.2.jar

Tesseract OCR Setup

Required for extracting text from images:

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng tesseract-ocr-chi-sim

macOS:

brew install tesseract
brew install tesseract-lang  # For additional languages

Windows: Download from: https://github.com/UB-Mannheim/tesseract/wiki

Verify installation:

tesseract --version

Language Data

For multi-language OCR support:

# English (usually included)
sudo apt-get install tesseract-ocr-eng

# Chinese Simplified
sudo apt-get install tesseract-ocr-chi-sim

# Chinese Traditional
sudo apt-get install tesseract-ocr-chi-tra

# Spanish
sudo apt-get install tesseract-ocr-spa

# French
sudo apt-get install tesseract-ocr-fra

Java Runtime

Required for Apache Tika:

# Ubuntu/Debian
sudo apt-get install openjdk-11-jre

# macOS
brew install openjdk@11

# Verify
java -version

Operations Supported

The Office Tool supports the following operations:

1. Read DOCX

Read content and optionally tables from Word documents.

content = office_tool.read_docx('document.docx', include_tables=True)
# Returns: {'paragraphs': [...], 'tables': [...]}

2. Write DOCX

Create Word documents with text and optional tables.

result = office_tool.write_docx(
    text="Hello World\nSecond line",
    output_path='output.docx',
    table_data=[['Header1', 'Header2'], ['Row1Col1', 'Row1Col2']]
)
# Returns: {'success': True, 'file_path': 'output.docx'}

3. Read PPTX

Extract text content from PowerPoint presentations.

slides = office_tool.read_pptx('presentation.pptx')
# Returns: ['Slide 1 text', 'Slide 2 text', ...]

4. Write PPTX

Create PowerPoint presentations with text slides.

result = office_tool.write_pptx(
    slides=['Slide 1 content', 'Slide 2 content'],
    output_path='output.pptx',
    image_path='logo.png'  # Optional image for first slide
)
# Returns: {'success': True, 'file_path': 'output.pptx'}

5. Read XLSX

Read data from Excel spreadsheets.

data = office_tool.read_xlsx('spreadsheet.xlsx', sheet_name='Sheet1')
# Returns: [{'col1': val1, 'col2': val2}, ...]

6. Write XLSX

Create Excel spreadsheets from data.

result = office_tool.write_xlsx(
    data=[{'Name': 'John', 'Age': 30}, {'Name': 'Jane', 'Age': 25}],
    output_path='output.xlsx',
    sheet_name='Data'
)
# Returns: {'success': True, 'file_path': 'output.xlsx'}

7. Extract Text

Universal text extraction from various formats.

text = office_tool.extract_text('document.pdf')
# Works with: .docx, .pptx, .xlsx, .pdf, .png, .jpg, .jpeg, .tiff, .bmp, .gif
# Returns: extracted text as string

Troubleshooting

Issue: File size validation fails

Error: File too large: 150.3MB, max 100MB

Solution:

# Increase max file size limit
export OFFICE_TOOL_MAX_FILE_SIZE_MB=200

Issue: Extension not allowed

Error: Extension '.doc' not allowed

Solution:

# Add the extension if it's safe and supported
# Note: .doc (old format) is NOT supported, use .docx
export OFFICE_TOOL_ALLOWED_EXTENSIONS='[".docx",".pptx",".xlsx",".pdf"]'

Issue: Tika server not starting

Error: Failed to extract text with Tika

Solutions:

  1. Check Java installation: java -version

  2. Clear Tika cache: rm -rf ~/.tika-server.jar

  3. Set TIKA_LOG_PATH: Already configured in tool

  4. Check internet connection (first run downloads Tika)

Issue: Tesseract not found

Error: Failed to extract image text

Solution:

# Install Tesseract
sudo apt-get install tesseract-ocr  # Ubuntu/Debian
brew install tesseract              # macOS

# Verify installation
tesseract --version

Issue: Font not found in generated documents

Cause: Specified font not installed on the system

Solution:

# Use a standard font available on all systems
export OFFICE_TOOL_DEFAULT_FONT=Arial

# Or install the font system-wide
# Ubuntu/Debian
sudo apt-get install fonts-liberation

# macOS - fonts usually pre-installed

Issue: Path not in allowed directories

Error: Path not in allowed directories

Solution: Files must be in one of the allowed locations:

  • Current working directory and subdirectories

  • /tmp directory

  • ./data directory

  • ./uploads directory

Move files to an allowed location or adjust your working directory.

Issue: List parsing error

Error: Configuration parsing fails for allowed_extensions

Solution:

# Use proper JSON array syntax with double quotes
export OFFICE_TOOL_ALLOWED_EXTENSIONS='[".docx",".pdf"]'

# NOT: ['.docx','.pdf'] or .docx,.pdf

Issue: Memory issues with large documents

Causes: Large documents consuming too much memory

Solutions:

  1. Reduce max_file_size_mb limit

  2. Process documents in chunks

  3. Increase system memory

  4. Use streaming processing for very large files

Issue: Corrupted document

Error: Invalid DOCX/PPTX/XLSX structure

Causes: Corrupted or malformed document file

Solutions:

  1. Try opening the file in Microsoft Office/LibreOffice

  2. Repair the document using office software

  3. Re-export/re-save the document

  4. Check if file was properly uploaded/transferred

Best Practices

Security

  1. Minimize allowed extensions - Only allow file types you actually need

  2. Set conservative file size limits - Use smallest practical value

  3. Validate file content - Tool automatically validates document structure

  4. Sanitize output - Tool sanitizes text to remove control characters

  5. Restrict file paths - Tool enforces directory restrictions

  6. Monitor file operations - Log all document processing activities

Performance

  1. Set appropriate size limits - Balance between usability and performance

  2. Cache results - Leverage BaseTool’s built-in caching

  3. Process in batches - For multiple documents, use batch processing

  4. Monitor memory usage - Large documents can consume significant memory

  5. Use appropriate fonts - Standard fonts render faster

Document Quality

  1. Use standard fonts - Arial, Calibri, Times New Roman

  2. Appropriate font sizes - 10-12pt for body text, 14-16pt for headings

  3. Sanitize input - Tool automatically sanitizes text

  4. Validate structure - Tool validates before processing

  5. Handle tables carefully - Ensure consistent column counts

Development vs Production

Development:

OFFICE_TOOL_MAX_FILE_SIZE_MB=200
OFFICE_TOOL_DEFAULT_FONT=Calibri
OFFICE_TOOL_DEFAULT_FONT_SIZE=12
OFFICE_TOOL_ALLOWED_EXTENSIONS='[".docx",".pptx",".xlsx",".pdf",".png",".jpg"]'

Production:

OFFICE_TOOL_MAX_FILE_SIZE_MB=50
OFFICE_TOOL_DEFAULT_FONT=Arial
OFFICE_TOOL_DEFAULT_FONT_SIZE=11
OFFICE_TOOL_ALLOWED_EXTENSIONS='[".docx",".pptx",".xlsx",".pdf"]'

Error Handling

Always wrap office operations in try-except blocks:

from aiecs.tools.task_tools.office_tool import (
    OfficeTool, 
    FileOperationError, 
    SecurityError,
    ContentValidationError
)

office_tool = OfficeTool()

try:
    content = office_tool.read_docx('document.docx')
except FileOperationError as e:
    print(f"File operation failed: {e}")
except SecurityError as e:
    print(f"Security validation failed: {e}")
except ContentValidationError as e:
    print(f"Document validation failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Support

For issues or questions about Office Tool configuration:

  • Check the tool source code for implementation details

  • Review library-specific documentation for document formats

  • Consult the main aiecs documentation for architecture overview

  • Test with simple documents first to isolate configuration vs. document issues

  • Check dependency installation and versions