Document Parser Tool - Modern High-Performance Document Parsing Component

Overview

Document Parser Tool is a modern, standardized high-performance document parsing component that can parse specified documents (URLs, local files, or cloud storage) through AI instructions. The component first determines the document type, then selects an appropriate parsing strategy based on the type, and finally passes the parsing results to AI for intelligent analysis.

🏗️ New Directory Structure

Document processing tools have been moved to a dedicated docs directory for clearer structure:

aiecs/tools/
├── docs/                              # 📁 Document processing tools directory
│   ├── __init__.py                    # Document tools module initialization
│   ├── document_parser_tool.py        # 🔧 Core document parser tool
│   └── ai_document_orchestrator.py    # 🤖 AI intelligent orchestrator
├── task_tools/                        # 📁 Other task tools
│   ├── chart_tool.py
│   ├── scraper_tool.py
│   └── ...
└── __init__.py                        # Main tool registration

Core Features

1. Intelligent Document Type Detection

Multiple Detection Mechanisms: File extension, MIME type, content feature detection
High Accuracy: Combines multiple detection methods, provides confidence scores
Supported Formats: PDF, DOCX, XLSX, PPTX, TXT, HTML, RTF, CSV, JSON, XML, Markdown, images, etc.

2. Multi-Source Document Retrieval

Cloud Storage Support: Google Cloud Storage, AWS S3, Azure Blob Storage
URL Download: Supports direct download from HTTP/HTTPS links
Local Files: Process documents in local file system
Storage ID: Supports UUID or custom storage identifiers

3. Diverse Parsing Strategies

TEXT_ONLY: Pure text extraction
STRUCTURED: Structured content parsing
FULL_CONTENT: Complete content extraction (default)
METADATA_ONLY: Extract metadata only

4. AI Intelligent Orchestration

Multiple AI Provider Support: OpenAI, Google Vertex AI, xAI
Intelligent Processing Modes: Summarize, information extraction, analysis, translation, classification, Q&A
Batch Processing: Supports concurrent processing of multiple documents
Custom Workflows: Can create custom processing flows

5.1 High-Performance Architecture

Async Processing: Supports asynchronous operations and concurrent processing
Caching Mechanism: Intelligent caching of parsing results
Error Handling: Comprehensive error handling and retry mechanisms
Resource Management: Automatic cleanup of temporary files

Architecture Design

Component Architecture

Document Parsing Component (aiecs/tools/docs/)
├── DocumentParserTool           # Core parser tool
│   ├── Document type detector
│   ├── Content parser
│   ├── Metadata extractor
│   └── Output formatter
│
├── AIDocumentOrchestrator       # AI intelligent orchestrator
│   ├── AI provider management
│   ├── Processing template system
│   ├── Batch processing engine
│   └── Result post-processor
│
└── Dependent Tool Integration
    ├── ScraperTool             # URL download
    ├── OfficeTool              # Office document processing
    └── ImageTool               # Image OCR

Workflow

graph TD
    A[Input: URL/File Path/Cloud Storage] --> B[Document Type Detection]
    B --> C{Detection Result}
    C -->|Success| D[Download/Load Document]
    C -->|Failure| E[Error Handling]
    D --> F[Select Parsing Strategy]
    F --> G[Document Parsing]
    G --> H[AI Intelligent Processing]
    H --> I[Result Formatting]
    I --> J[Output Result]

Usage Methods

1. Basic Document Parsing (New Import Path)

# Use new import path
from aiecs.tools.docs.document_parser_tool import DocumentParserTool

# Initialize parser
parser = DocumentParserTool()

# Parse document (supports multiple sources)
result = parser.parse_document(
    source="https://example.com/document.pdf",  # URL
    strategy="full_content",
    output_format="json",
    extract_metadata=True
)

print(f"Document type: {result['document_type']}")
print(f"Content preview: {result['content'][:200]}...")

1.1 Cloud Storage Document Parsing

# Configure cloud storage support
config = {
    "enable_cloud_storage": True,
    "gcs_bucket_name": "my-documents",
    "gcs_project_id": "my-project"
}

parser = DocumentParserTool(config)

# Support multiple cloud storage formats
cloud_sources = [
    "gs://my-bucket/documents/report.pdf",        # Google Cloud Storage
    "s3://my-bucket/files/presentation.pptx",     # AWS S3  
    "azure://my-container/data/contract.docx",    # Azure Blob
    "cloud://shared/documents/analysis.xlsx",     # Generic cloud storage
    "doc_123456789abcdef",                        # Storage ID
    "a1b2c3d4-e5f6-7890-abcd-ef1234567890"      # UUID storage ID
]

for source in cloud_sources:
    try:
        result = parser.parse_document(source=source)
        print(f"✓ Successfully parsed: {source}")
    except Exception as e:
        print(f"✗ Parsing failed: {source} - {e}")

2. Document Type Detection

# Detect document type
detection_result = parser.detect_document_type(
    source="https://example.com/unknown_document",
    download_sample=True
)

print(f"Detected type: {detection_result['detected_type']}")
print(f"Confidence: {detection_result['confidence']}")
print(f"Detection methods: {detection_result['detection_methods']}")

3. AI Intelligent Analysis (New Import Path)

from aiecs.tools.docs.ai_document_orchestrator import AIDocumentOrchestrator

# Initialize AI orchestrator
orchestrator = AIDocumentOrchestrator()

# AI document analysis
result = orchestrator.process_document(
    source="document.pdf",
    processing_mode="summarize",
    ai_provider="openai"
)

print(f"AI Summary: {result['ai_result']['ai_response']}")

4. Batch Processing

# Batch process multiple documents
batch_result = orchestrator.batch_process_documents(
    sources=[
        "doc1.pdf",
        "https://example.com/doc2.docx",
        "gs://bucket/doc3.txt"  # Cloud storage support
    ],
    processing_mode="analyze",
    max_concurrent=3
)

print(f"Processing successful: {batch_result['successful_documents']}")
print(f"Processing failed: {batch_result['failed_documents']}")

5. Custom Processing Flow

# Create custom processor
custom_analyzer = orchestrator.create_custom_processor(
    system_prompt="You are a professional legal document analyst",
    user_prompt_template="Analyze the following legal document and extract key information: {content}"
)

# Use custom processor
result = custom_analyzer("legal_document.pdf")

Configuration Options

DocumentParserTool Configuration

config = {
    "max_file_size": 50 * 1024 * 1024,  # 50MB
    "timeout": 30,
    "default_encoding": "utf-8",
    "max_pages": 1000,
    # Cloud storage configuration
    "enable_cloud_storage": True,
    "gcs_bucket_name": "aiecs-documents",
    "gcs_project_id": "my-project"
}

parser = DocumentParserTool(config)

AIDocumentOrchestrator Configuration

config = {
    "default_ai_provider": "openai",
    "max_chunk_size": 4000,
    "max_concurrent_requests": 5,
    "default_temperature": 0.1,
    "max_tokens": 2000
}

orchestrator = AIDocumentOrchestrator(config)

Supported Document Formats

Format	Extensions	Parser	Features
PDF	.pdf	OfficeTool + Custom	Text extraction, page splitting
Word	.docx, .doc	OfficeTool	Paragraphs, styles, tables
Excel	.xlsx, .xls	OfficeTool	Worksheets, cell data
PowerPoint	.pptx, .ppt	OfficeTool	Slides, text, images
Plain Text	.txt	Built-in	Encoding detection, line splitting
HTML	.html, .htm	BeautifulSoup	Structured parsing, tag extraction
Markdown	.md, .markdown	Built-in	Title extraction, structured
CSV	.csv	Pandas	Table data, column analysis
JSON	.json	Built-in	Structured data parsing
XML	.xml	Built-in	Hierarchical structure parsing
Images	.jpg, .png, .gif	ImageTool	OCR text recognition

Cloud Storage Support

Supported Cloud Storage Formats

Google Cloud Storage: gs://bucket/path/file.pdf
AWS S3: s3://bucket/path/file.pdf
Azure Blob Storage: azure://container/path/file.pdf
Generic Cloud Storage: cloud://path/file.pdf
Storage ID: doc_123456789abcdef
UUID Identifier: a1b2c3d4-e5f6-7890-abcd-ef1234567890

Cloud Storage Configuration Examples

# Google Cloud Storage
gcs_config = {
    "enable_cloud_storage": True,
    "gcs_bucket_name": "my-gcs-bucket",
    "gcs_project_id": "my-gcp-project",
    "gcs_location": "US"
}

# AWS S3 (via compatible interface)
s3_config = {
    "enable_cloud_storage": True,
    "gcs_bucket_name": "my-s3-bucket",
    "gcs_project_id": "aws-compat-project"
}

parser = DocumentParserTool(gcs_config)

AI Processing Modes

1. Document Summarization (SUMMARIZE)

Generate concise, informative summaries
Highlight key points and themes
Support multiple length settings

2. Information Extraction (EXTRACT_INFO)

Extract specific information based on specified criteria
Structured data output
Support custom extraction rules

3. Content Analysis (ANALYZE)

Deep content analysis
Topic identification, sentiment analysis
Structure and organization analysis

4. Document Translation (TRANSLATE)

Multi-language translation support
Preserve original format
Context-aware translation

5. Document Classification (CLASSIFY)

Automatic document classification
Confidence scoring
Custom classification system

6. Q&A System (ANSWER_QUESTIONS)

Answer questions based on document content
Cite relevant paragraphs
Support complex reasoning

Performance Optimization

1. Caching Strategy

Document parsing result caching
AI response caching
Type detection result caching

2. Concurrent Processing

Asynchronous I/O operations
Parallel processing of multiple documents
Resource pool management

3. Memory Management

Chunk processing for large documents
Automatic cleanup of temporary files
Memory usage monitoring

4. Error Handling

Intelligent retry mechanism
Degradation processing strategy
Detailed error logging

Error Handling

Common Error Types

DocumentParserError: Basic parsing error
UnsupportedDocumentError: Unsupported document type
DownloadError: Document download failure
ParseError: Parsing process error
AIProviderError: AI service error
ProcessingError: Processing flow error

Error Handling Example

try:
    result = parser.parse_document(source="problematic_doc.pdf")
except UnsupportedDocumentError as e:
    print(f"Unsupported document type: {e}")
except DownloadError as e:
    print(f"Download failed: {e}")
except ParseError as e:
    print(f"Parsing failed: {e}")
except Exception as e:
    print(f"Unknown error: {e}")

Migration Guide

Migrating from Old Version

If you previously used the old import path, update as follows:

# Old import path (deprecated)
# from aiecs.tools.task_tools.document_parser_tool import DocumentParserTool
# from aiecs.tools.task_tools.ai_document_orchestrator import AIDocumentOrchestrator

# New import path
from aiecs.tools.docs.document_parser_tool import DocumentParserTool
from aiecs.tools.docs.ai_document_orchestrator import AIDocumentOrchestrator

# Or use lazy loading
from aiecs.tools.docs import document_parser_tool, ai_document_orchestrator

Batch Update Script

# Script to batch update import paths
find . -name "*.py" -exec sed -i 's/from aiecs\.tools\.task_tools\.document_parser_tool/from aiecs.tools.docs.document_parser_tool/g' {} \;
find . -name "*.py" -exec sed -i 's/from aiecs\.tools\.task_tools\.ai_document_orchestrator/from aiecs.tools.docs.ai_document_orchestrator/g' {} \;

Extension Development

1. Add New Document Format Support

# Add new parsing method in DocumentParserTool
def _parse_new_format(self, file_path: str, strategy: ParsingStrategy):
    # Implement parsing logic for new format
    pass

2. Custom AI Processing Templates

# Add new processing template
orchestrator.processing_templates["custom_mode"] = {
    "system_prompt": "Custom system prompt",
    "user_prompt_template": "Custom user prompt template: {content}"
}

3. Integrate New AI Provider

# Extend AI provider support
def _call_custom_ai_provider(self, prompt: str, params: Dict):
    # Implement custom AI provider call
    pass

Best Practices

1. Document Processing

Detect document type first, then select processing strategy
Use chunk processing for large documents
Set reasonable timeout values

2. AI Processing

Select appropriate AI model based on document content
Use caching to avoid duplicate processing
Set reasonable concurrency limits

3. Error Handling

Implement comprehensive error handling logic
Record detailed processing logs
Provide user-friendly error messages

4. Performance Optimization

Use async processing to improve concurrency performance
Reasonably configure caching strategies
Monitor resource usage

Out-of-the-Box Check

Run the following code to verify if the system is ready to use out of the box:

def system_readiness_check():
    """System readiness check"""
    
    print("🔍 AIECS Document Processing System Readiness Check")
    print("=" * 50)
    
    try:
        # 1. Import test
        from aiecs.tools.docs.document_parser_tool import DocumentParserTool
        from aiecs.tools.docs.ai_document_orchestrator import AIDocumentOrchestrator
        print("✓ Module import successful")
        
        # 2. Initialization test
        parser = DocumentParserTool()
        orchestrator = AIDocumentOrchestrator()
        print("✓ Tool initialization successful")
        
        # 3. Feature check
        print(f"✓ Cloud storage support: {parser.settings.enable_cloud_storage}")
        print(f"✓ AI provider: {orchestrator.settings.default_ai_provider}")
        print(f"✓ Concurrency limit: {orchestrator.settings.max_concurrent_requests}")
        
        # 4. Path check
        source_types = [
            ("Local file", "/tmp/test.txt"),
            ("HTTP URL", "https://example.com/file.pdf"),
            ("Cloud storage GCS", "gs://bucket/file.pdf"),
            ("Cloud storage S3", "s3://bucket/file.pdf"),
            ("Storage ID", "doc_123456")
        ]
        
        for name, source in source_types:
            can_handle = (
                os.path.exists(source) or
                parser._is_url(source) or
                parser._is_cloud_storage_path(source) or
                parser._is_storage_id(source)
            )
            print(f"✓ {name} support: {source}")
        
        print("\n🎉 System fully ready, ready to use out of the box!")
        return True
        
    except Exception as e:
        print(f"✗ System check failed: {e}")
        return False

# Run readiness check
if __name__ == "__main__":
    system_readiness_check()

Quick Start

See the complete quick start guide: docs/TOOLS_USED_INSTRUCTION/DOCUMENT_PARSER_QUICK_START.md

Example Code

Basic usage example: examples/document_processing_example.py
Cloud storage example: examples/cloud_storage_document_example.py

Future Plans

Enhanced Document Format Support
- More Office format support
- Intelligent chart and table recognition
- Complex layout document processing
AI Capability Expansion
- Multimodal document understanding
- Document structure reconstruction
- Intelligent document generation
Performance Optimization
- Distributed processing support
- Streaming processing capability
- Edge computing support
Enterprise Features
- Permission control and security auditing
- Large-scale batch processing
- Integrated monitoring and alerting