Document Parser Tool Configuration Guide
Overview
The Document Parser Tool provides comprehensive capabilities for parsing various document formats from URLs and files, including PDF, DOCX, XLSX, PPTX, TXT, HTML, RTF, CSV, JSON, XML, Markdown, and images. It supports multiple parsing strategies (text only, structured, full content, metadata only) and output formats (text, JSON, Markdown, HTML). The tool integrates with ScraperTool for URL downloading, OfficeTool for Office document parsing, and ImageTool for image OCR. It also supports cloud storage integration with Google Cloud Storage (GCS). The tool can be configured via environment variables using the DOC_PARSER_ prefix or through programmatic configuration when initializing the tool.
Using .env Files in Your Project
When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Document Parser Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.
Setting Up .env Files
1. Install python-dotenv:
pip install python-dotenv
2. Create a .env file in your project root:
# .env file in your project root
DOC_PARSER_USER_AGENT=DocumentParser/1.0
DOC_PARSER_MAX_FILE_SIZE=52428800
DOC_PARSER_TEMP_DIR=/path/to/temp
DOC_PARSER_DEFAULT_ENCODING=utf-8
DOC_PARSER_TIMEOUT=30
DOC_PARSER_MAX_PAGES=1000
DOC_PARSER_ENABLE_CLOUD_STORAGE=true
DOC_PARSER_GCS_BUCKET_NAME=aiecs-documents
DOC_PARSER_GCS_PROJECT_ID=your-project-id
3. Load the .env file in your application:
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv
# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()
# Now import and use aiecs tools
from aiecs.tools.docs.document_parser_tool import DocumentParserTool
# The tool will automatically use the environment variables
parser_tool = DocumentParserTool()
Multiple Environment Files
You can use different .env files for different environments:
import os
from dotenv import load_dotenv
# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')
if env == 'production':
load_dotenv('.env.production')
elif env == 'staging':
load_dotenv('.env.staging')
else:
load_dotenv('.env.development')
from aiecs.tools.docs.document_parser_tool import DocumentParserTool
parser_tool = DocumentParserTool()
Example .env.production:
# Production settings - optimized for performance and cloud storage
DOC_PARSER_USER_AGENT=DocumentParser/2.0
DOC_PARSER_MAX_FILE_SIZE=104857600
DOC_PARSER_TEMP_DIR=/app/temp/parser
DOC_PARSER_DEFAULT_ENCODING=utf-8
DOC_PARSER_TIMEOUT=60
DOC_PARSER_MAX_PAGES=2000
DOC_PARSER_ENABLE_CLOUD_STORAGE=true
DOC_PARSER_GCS_BUCKET_NAME=prod-aiecs-documents
DOC_PARSER_GCS_PROJECT_ID=production-project-id
Example .env.development:
# Development settings - more permissive for testing
DOC_PARSER_USER_AGENT=DocumentParser/Dev/1.0
DOC_PARSER_MAX_FILE_SIZE=10485760
DOC_PARSER_TEMP_DIR=./temp/parser
DOC_PARSER_DEFAULT_ENCODING=utf-8
DOC_PARSER_TIMEOUT=15
DOC_PARSER_MAX_PAGES=100
DOC_PARSER_ENABLE_CLOUD_STORAGE=false
DOC_PARSER_GCS_BUCKET_NAME=dev-aiecs-documents
DOC_PARSER_GCS_PROJECT_ID=development-project-id
Best Practices for .env Files
Never commit .env files to version control - Add
.envto your.gitignore:# .gitignore .env .env.local .env.*.local .env.production .env.staging
Provide a template - Create
.env.examplewith documented dummy values:# .env.example # Document Parser Tool Configuration # User agent for HTTP requests DOC_PARSER_USER_AGENT=DocumentParser/1.0 # Maximum file size in bytes (50MB) DOC_PARSER_MAX_FILE_SIZE=52428800 # Temporary directory for document processing DOC_PARSER_TEMP_DIR=/path/to/temp # Default encoding for text files DOC_PARSER_DEFAULT_ENCODING=utf-8 # Timeout for HTTP requests in seconds DOC_PARSER_TIMEOUT=30 # Maximum number of pages to process DOC_PARSER_MAX_PAGES=1000 # Whether to enable cloud storage integration DOC_PARSER_ENABLE_CLOUD_STORAGE=true # Google Cloud Storage bucket name DOC_PARSER_GCS_BUCKET_NAME=aiecs-documents # Google Cloud Storage project ID (optional) DOC_PARSER_GCS_PROJECT_ID=your-project-id
Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format values correctly:
Strings: Plain text:
utf-8,/path/to/dirIntegers: Plain numbers:
52428800,30Booleans:
trueorfalse
Configuration Options
1. User Agent
Environment Variable: DOC_PARSER_USER_AGENT
Type: String
Default: "DocumentParser/1.0"
Description: User agent string used for HTTP requests when downloading documents from URLs. This helps identify the tool to web servers and may be required by some sites.
Example:
export DOC_PARSER_USER_AGENT="DocumentParser/2.0"
Common Values:
DocumentParser/1.0- Default identifierDocumentParser/2.0- Updated versionCustomParser/1.0- Custom identifierMozilla/5.0 (compatible; DocumentParser/1.0)- Browser-like identifier
2. Max File Size
Environment Variable: DOC_PARSER_MAX_FILE_SIZE
Type: Integer
Default: 50 * 1024 * 1024 (50MB)
Description: Maximum file size in bytes for document processing. Files larger than this will be rejected to prevent memory issues and processing timeouts.
Common Values:
10 * 1024 * 1024- 10MB (small files)50 * 1024 * 1024- 50MB (default)100 * 1024 * 1024- 100MB (large files)500 * 1024 * 1024- 500MB (very large files)
Example:
export DOC_PARSER_MAX_FILE_SIZE=104857600
Memory Note: Larger values allow bigger files but use more memory during processing.
3. Temp Directory
Environment Variable: DOC_PARSER_TEMP_DIR
Type: String
Default: os.path.join(tempfile.gettempdir(), 'document_parser')
Description: Temporary directory used for document processing operations. This directory stores downloaded files, intermediate processing results, and temporary artifacts.
Example:
export DOC_PARSER_TEMP_DIR="/app/temp/parser"
Security Note: Ensure the directory has appropriate permissions and is not accessible via web servers.
4. Default Encoding
Environment Variable: DOC_PARSER_DEFAULT_ENCODING
Type: String
Default: "utf-8"
Description: Default text encoding for processing text files. This encoding is used when the file encoding cannot be automatically detected.
Supported Encodings:
utf-8- UTF-8 encoding (default, most common)utf-16- UTF-16 encodingascii- ASCII encodinglatin-1- Latin-1 encodingcp1252- Windows-1252 encodingiso-8859-1- ISO-8859-1 encoding
Example:
export DOC_PARSER_DEFAULT_ENCODING=utf-8
Encoding Note: UTF-8 is recommended for international text support.
5. Timeout
Environment Variable: DOC_PARSER_TIMEOUT
Type: Integer
Default: 30
Description: Timeout in seconds for HTTP requests when downloading documents from URLs. This prevents hanging requests and improves reliability.
Common Values:
15- 15 seconds (fast connections)30- 30 seconds (default)60- 60 seconds (slow connections)120- 120 seconds (very slow connections)
Example:
export DOC_PARSER_TIMEOUT=60
Network Note: Increase timeout for slower networks or large files.
6. Max Pages
Environment Variable: DOC_PARSER_MAX_PAGES
Type: Integer
Default: 1000
Description: Maximum number of pages to process for large documents (especially PDFs). This prevents excessive processing time and memory usage.
Common Values:
100- 100 pages (small documents)1000- 1000 pages (default)2000- 2000 pages (large documents)5000- 5000 pages (very large documents)
Example:
export DOC_PARSER_MAX_PAGES=2000
Performance Note: Higher values allow larger documents but increase processing time.
7. Enable Cloud Storage
Environment Variable: DOC_PARSER_ENABLE_CLOUD_STORAGE
Type: Boolean
Default: True
Description: Whether to enable cloud storage integration for document retrieval and caching. When enabled, the tool can store and retrieve documents from Google Cloud Storage.
Values:
true- Enable cloud storage (default)false- Disable cloud storage
Example:
export DOC_PARSER_ENABLE_CLOUD_STORAGE=true
Cloud Note: Requires proper GCS configuration and credentials.
8. GCS Bucket Name
Environment Variable: DOC_PARSER_GCS_BUCKET_NAME
Type: String
Default: "aiecs-documents"
Description: Google Cloud Storage bucket name for storing and retrieving documents. This bucket is used for document caching and cloud-based processing.
Example:
export DOC_PARSER_GCS_BUCKET_NAME="my-document-bucket"
Bucket Requirements:
Bucket must exist and be accessible
Proper permissions must be configured
Bucket name must be globally unique
9. GCS Project ID
Environment Variable: DOC_PARSER_GCS_PROJECT_ID
Type: Optional[String]
Default: None
Description: Google Cloud Storage project ID for authentication and billing. This is optional if using default project credentials.
Example:
export DOC_PARSER_GCS_PROJECT_ID="my-gcp-project"
Authentication Note: Can be omitted if using default project credentials or service account.
Usage Examples
Example 1: Basic Environment Configuration
# Set basic parsing parameters
export DOC_PARSER_USER_AGENT="MyParser/1.0"
export DOC_PARSER_MAX_FILE_SIZE=104857600
export DOC_PARSER_TEMP_DIR="/app/temp/parser"
export DOC_PARSER_TIMEOUT=60
# Run your application
python app.py
Example 2: Cloud Storage Configuration
# Enable cloud storage with GCS
export DOC_PARSER_ENABLE_CLOUD_STORAGE=true
export DOC_PARSER_GCS_BUCKET_NAME="my-document-bucket"
export DOC_PARSER_GCS_PROJECT_ID="my-gcp-project"
export DOC_PARSER_MAX_FILE_SIZE=209715200
export DOC_PARSER_MAX_PAGES=2000
Example 3: Development Configuration
# Development-friendly settings
export DOC_PARSER_USER_AGENT="DevParser/1.0"
export DOC_PARSER_MAX_FILE_SIZE=10485760
export DOC_PARSER_TEMP_DIR="./temp/parser"
export DOC_PARSER_TIMEOUT=15
export DOC_PARSER_MAX_PAGES=100
export DOC_PARSER_ENABLE_CLOUD_STORAGE=false
Example 4: Programmatic Configuration
from aiecs.tools.docs.document_parser_tool import DocumentParserTool
# Initialize with custom configuration
parser_tool = DocumentParserTool(config={
'user_agent': 'MyParser/2.0',
'max_file_size': 104857600,
'temp_dir': '/app/temp/parser',
'default_encoding': 'utf-8',
'timeout': 60,
'max_pages': 2000,
'enable_cloud_storage': True,
'gcs_bucket_name': 'my-document-bucket',
'gcs_project_id': 'my-gcp-project'
})
Example 5: Mixed Configuration
Environment variables are used as defaults, but can be overridden programmatically:
# Set environment defaults
export DOC_PARSER_MAX_FILE_SIZE=52428800
export DOC_PARSER_ENABLE_CLOUD_STORAGE=true
# Override for specific instance
parser_tool = DocumentParserTool(config={
'max_file_size': 104857600, # This overrides the environment variable
'enable_cloud_storage': False # This overrides the environment variable
})
Configuration Priority
When the Document Parser Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):
Programmatic config - Values passed to the constructor
Environment variables - Values set via
DOC_PARSER_*variablesDefault values - Built-in defaults as specified above
Data Type Parsing
String Values
Strings should be provided as plain text without quotes:
export DOC_PARSER_USER_AGENT=DocumentParser/1.0
export DOC_PARSER_TEMP_DIR=/path/to/temp
Integer Values
Integers should be provided as numeric strings:
export DOC_PARSER_MAX_FILE_SIZE=52428800
export DOC_PARSER_TIMEOUT=30
Boolean Values
Booleans should be provided as lowercase strings:
export DOC_PARSER_ENABLE_CLOUD_STORAGE=true
Optional Values
Optional values can be omitted or set to empty string:
# Omit optional value
# DOC_PARSER_GCS_PROJECT_ID not set
# Or set to empty string
export DOC_PARSER_GCS_PROJECT_ID=""
Validation
Automatic Type Validation
Pydantic’s BaseSettings automatically validates configuration values:
user_agentmust be a non-empty stringmax_file_sizemust be a positive integertemp_dirmust be a non-empty stringdefault_encodingmust be a valid encoding stringtimeoutmust be a positive integermax_pagesmust be a positive integerenable_cloud_storagemust be a booleangcs_bucket_namemust be a non-empty stringgcs_project_idmust be a string or None
Runtime Validation
When processing documents, the tool validates:
Directory accessibility - Temp directory must be writable
File size limits - Files must not exceed max_file_size
Network connectivity - URLs must be accessible within timeout
Cloud storage - GCS bucket must be accessible if enabled
Document format - Document type must be supported
Document Types
The Document Parser Tool supports various document types:
Office Documents
PDF - Portable Document Format
DOCX - Microsoft Word documents
XLSX - Microsoft Excel spreadsheets
PPTX - Microsoft PowerPoint presentations
Text Documents
TXT - Plain text files
HTML - HyperText Markup Language
RTF - Rich Text Format
Markdown - Markdown format
Data Documents
CSV - Comma-Separated Values
JSON - JavaScript Object Notation
XML - Extensible Markup Language
Media Documents
Image - Various image formats (PNG, JPG, etc.)
Unknown Documents
Unknown - Unrecognized document types
Parsing Strategies
Text Only
Purpose - Extract plain text content only
Use Cases - Text analysis, content indexing
Output - Clean text without formatting
Structured
Purpose - Extract structured content with metadata
Use Cases - Data extraction, content organization
Output - Structured data with headings, lists, tables
Full Content
Purpose - Extract all content including formatting
Use Cases - Complete document analysis, content preservation
Output - Rich content with formatting and structure
Metadata Only
Purpose - Extract document metadata only
Use Cases - Document indexing, cataloging
Output - Document properties and metadata
Output Formats
Text
Format - Plain text output
Use Cases - Simple text processing, analysis
Features - Clean, readable text
JSON
Format - Structured JSON output
Use Cases - API integration, data processing
Features - Structured data with metadata
Markdown
Format - Markdown formatted output
Use Cases - Documentation, web content
Features - Preserves formatting and structure
HTML
Format - HTML formatted output
Use Cases - Web display, rich content
Features - Rich formatting and styling
Cloud Storage
Google Cloud Storage Integration
The Document Parser Tool supports Google Cloud Storage for:
Document Caching - Store frequently accessed documents
Large File Processing - Process files too large for local storage
Distributed Processing - Share documents across multiple instances
Backup and Recovery - Backup processed documents
GCS Configuration
Required Setup:
Create a GCS bucket
Configure authentication (service account or default credentials)
Set appropriate permissions
Configure the tool with bucket name and project ID
Authentication Methods:
Service Account Key
Default Application Credentials
Workload Identity
User Account Credentials
Cloud Storage Benefits
Scalability - Handle large volumes of documents
Reliability - High availability and durability
Performance - Fast access to cached documents
Cost Efficiency - Pay only for storage used
Operations Supported
The Document Parser Tool supports comprehensive document parsing operations:
Document Detection
detect_document_type- Auto-detect document type from URL or filevalidate_document- Validate document format and accessibilityget_document_info- Get document metadata and properties
Document Download
download_document- Download document from URLdownload_with_retry- Download with retry logicvalidate_download- Validate downloaded document
Document Parsing
parse_document- Parse document with specified strategyparse_text_only- Extract text content onlyparse_structured- Extract structured contentparse_full_content- Extract all content with formattingparse_metadata_only- Extract metadata only
Content Processing
extract_text- Extract plain text from documentextract_tables- Extract table dataextract_images- Extract images and mediaextract_metadata- Extract document metadatachunk_content- Split content into manageable chunks
Output Generation
generate_text_output- Generate plain text outputgenerate_json_output- Generate JSON outputgenerate_markdown_output- Generate Markdown outputgenerate_html_output- Generate HTML output
Cloud Storage Operations
store_document- Store document in cloud storageretrieve_document- Retrieve document from cloud storagecache_document- Cache document for faster accesscleanup_cache- Clean up cached documents
Batch Operations
batch_parse- Parse multiple documentsbatch_download- Download multiple documentsbatch_extract- Extract content from multiple documentsbatch_convert- Convert multiple documents to different formats
Troubleshooting
Issue: Directory not accessible
Error: PermissionError when accessing temp directory
Solutions:
# Set accessible directory
export DOC_PARSER_TEMP_DIR="/accessible/temp/path"
# Or create directory with proper permissions
mkdir -p /path/to/directory
chmod 755 /path/to/directory
Issue: File too large
Error: DocumentParserError for files exceeding size limit
Solutions:
# Increase file size limit
export DOC_PARSER_MAX_FILE_SIZE=104857600
# Or use cloud storage for large files
export DOC_PARSER_ENABLE_CLOUD_STORAGE=true
Issue: Download timeout
Error: DownloadError for slow downloads
Solutions:
# Increase timeout
export DOC_PARSER_TIMEOUT=60
# Or check network connectivity
ping example.com
Issue: Parsing fails
Error: ParseError during document parsing
Solutions:
Check document format support
Verify document is not corrupted
Try different parsing strategy
Check file encoding
Issue: Cloud storage not working
Error: GCS integration fails
Solutions:
Verify GCS credentials
Check bucket permissions
Ensure bucket exists
Verify project ID
# Disable cloud storage if not needed
export DOC_PARSER_ENABLE_CLOUD_STORAGE=false
Issue: Encoding errors
Error: Text encoding issues
Solutions:
# Set appropriate encoding
export DOC_PARSER_DEFAULT_ENCODING=utf-8
# Or try different encoding
export DOC_PARSER_DEFAULT_ENCODING=latin-1
Issue: Memory errors with large documents
Error: MemoryError during processing
Solutions:
# Reduce max pages
export DOC_PARSER_MAX_PAGES=500
# Or reduce file size limit
export DOC_PARSER_MAX_FILE_SIZE=26214400
Best Practices
Performance Optimization
File Size Management - Set appropriate file size limits
Timeout Configuration - Configure timeouts based on network speed
Cloud Storage Usage - Use cloud storage for large files
Caching Strategy - Implement document caching
Batch Processing - Use batch operations for multiple documents
Error Handling
Graceful Degradation - Handle parsing failures gracefully
Retry Logic - Implement retry for network operations
Fallback Strategies - Provide fallback parsing methods
Error Logging - Log errors for debugging
User Feedback - Provide clear error messages
Security
File Validation - Validate files before processing
Size Limits - Enforce file size limits
Access Control - Control access to temp directories
Cloud Security - Secure cloud storage access
Input Sanitization - Sanitize user inputs
Resource Management
Memory Usage - Monitor memory consumption
Disk Space - Manage temp directory space
Network Usage - Optimize network requests
Processing Time - Set reasonable processing limits
Cleanup - Regular cleanup of temp files
Integration
Tool Dependencies - Ensure required tools are available
API Compatibility - Maintain API compatibility
Error Propagation - Properly propagate errors
Logging Integration - Integrate with logging systems
Monitoring - Monitor tool performance
Development vs Production
Development:
DOC_PARSER_USER_AGENT=DevParser/1.0
DOC_PARSER_MAX_FILE_SIZE=10485760
DOC_PARSER_TEMP_DIR=./temp/parser
DOC_PARSER_TIMEOUT=15
DOC_PARSER_MAX_PAGES=100
DOC_PARSER_ENABLE_CLOUD_STORAGE=false
Production:
DOC_PARSER_USER_AGENT=DocumentParser/2.0
DOC_PARSER_MAX_FILE_SIZE=104857600
DOC_PARSER_TEMP_DIR=/app/temp/parser
DOC_PARSER_TIMEOUT=60
DOC_PARSER_MAX_PAGES=2000
DOC_PARSER_ENABLE_CLOUD_STORAGE=true
DOC_PARSER_GCS_BUCKET_NAME=prod-documents
DOC_PARSER_GCS_PROJECT_ID=production-project
Error Handling
Always wrap document parsing operations in try-except blocks:
from aiecs.tools.docs.document_parser_tool import DocumentParserTool, DocumentParserError, UnsupportedDocumentError, DownloadError, ParseError
parser_tool = DocumentParserTool()
try:
result = parser_tool.parse_document(
source="https://example.com/document.pdf",
strategy="full_content",
output_format="json"
)
except UnsupportedDocumentError as e:
print(f"Unsupported document type: {e}")
except DownloadError as e:
print(f"Download failed: {e}")
except ParseError as e:
print(f"Parsing failed: {e}")
except DocumentParserError as e:
print(f"Document parser error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Dependencies
Core Dependencies
# Install core dependencies
pip install pydantic pydantic-settings python-dotenv httpx
# Install document processing dependencies
pip install python-docx openpyxl python-pptx
# Install PDF processing dependencies
pip install PyPDF2 pdfplumber
# Install image processing dependencies
pip install pillow pytesseract
Optional Dependencies
# For cloud storage
pip install google-cloud-storage
# For advanced PDF processing
pip install pdfminer.six
# For HTML processing
pip install beautifulsoup4 lxml
# For Excel processing
pip install xlrd xlsxwriter
Verification
# Test dependency availability
try:
import pydantic
from pydantic_settings import BaseSettings
import httpx
import docx
import PyPDF2
import PIL
print("Core dependencies available")
except ImportError as e:
print(f"Missing dependency: {e}")
# Test external tool availability
try:
from aiecs.tools.scraper_tool import ScraperTool
from aiecs.tools.task_tools.office_tool import OfficeTool
from aiecs.tools.task_tools.image_tool import ImageTool
print("External tools available")
except ImportError as e:
print(f"External tool not available: {e}")
# Test cloud storage availability
try:
from google.cloud import storage
print("Cloud storage available")
except ImportError:
print("Cloud storage not available")
Support
For issues or questions about Document Parser Tool configuration:
Check the tool source code for implementation details
Review external tool documentation for specific features
Consult the main aiecs documentation for architecture overview
Test with simple documents first to isolate configuration vs. parsing issues
Monitor directory permissions and disk space
Verify network connectivity and timeouts
Check cloud storage configuration and credentials
Ensure proper file size and page limits
Validate document format support