Document Parser Tool - Modern High-Performance Document Parsing Component
Overview
Document Parser Tool is a modern, standardized high-performance document parsing component that can parse specified documents (URLs, local files, or cloud storage) through AI instructions. The component first determines the document type, then selects an appropriate parsing strategy based on the type, and finally passes the parsing results to AI for intelligent analysis.
🏗️ New Directory Structure
Document processing tools have been moved to a dedicated docs directory for clearer structure:
aiecs/tools/
├── docs/ # 📁 Document processing tools directory
│ ├── __init__.py # Document tools module initialization
│ ├── document_parser_tool.py # 🔧 Core document parser tool
│ └── ai_document_orchestrator.py # 🤖 AI intelligent orchestrator
├── task_tools/ # 📁 Other task tools
│ ├── chart_tool.py
│ ├── scraper_tool.py
│ └── ...
└── __init__.py # Main tool registration
Core Features
1. Intelligent Document Type Detection
Multiple Detection Mechanisms: File extension, MIME type, content feature detection
High Accuracy: Combines multiple detection methods, provides confidence scores
Supported Formats: PDF, DOCX, XLSX, PPTX, TXT, HTML, RTF, CSV, JSON, XML, Markdown, images, etc.
2. Multi-Source Document Retrieval
Cloud Storage Support: Google Cloud Storage, AWS S3, Azure Blob Storage
URL Download: Supports direct download from HTTP/HTTPS links
Local Files: Process documents in local file system
Storage ID: Supports UUID or custom storage identifiers
3. Diverse Parsing Strategies
TEXT_ONLY: Pure text extraction
STRUCTURED: Structured content parsing
FULL_CONTENT: Complete content extraction (default)
METADATA_ONLY: Extract metadata only
4. AI Intelligent Orchestration
Multiple AI Provider Support: OpenAI, Google Vertex AI, xAI
Intelligent Processing Modes: Summarize, information extraction, analysis, translation, classification, Q&A
Batch Processing: Supports concurrent processing of multiple documents
Custom Workflows: Can create custom processing flows
5.1 High-Performance Architecture
Async Processing: Supports asynchronous operations and concurrent processing
Caching Mechanism: Intelligent caching of parsing results
Error Handling: Comprehensive error handling and retry mechanisms
Resource Management: Automatic cleanup of temporary files
Architecture Design
Component Architecture
Document Parsing Component (aiecs/tools/docs/)
├── DocumentParserTool # Core parser tool
│ ├── Document type detector
│ ├── Content parser
│ ├── Metadata extractor
│ └── Output formatter
│
├── AIDocumentOrchestrator # AI intelligent orchestrator
│ ├── AI provider management
│ ├── Processing template system
│ ├── Batch processing engine
│ └── Result post-processor
│
└── Dependent Tool Integration
├── ScraperTool # URL download
├── OfficeTool # Office document processing
└── ImageTool # Image OCR
Workflow
graph TD
A[Input: URL/File Path/Cloud Storage] --> B[Document Type Detection]
B --> C{Detection Result}
C -->|Success| D[Download/Load Document]
C -->|Failure| E[Error Handling]
D --> F[Select Parsing Strategy]
F --> G[Document Parsing]
G --> H[AI Intelligent Processing]
H --> I[Result Formatting]
I --> J[Output Result]
Usage Methods
1. Basic Document Parsing (New Import Path)
# Use new import path
from aiecs.tools.docs.document_parser_tool import DocumentParserTool
# Initialize parser
parser = DocumentParserTool()
# Parse document (supports multiple sources)
result = parser.parse_document(
source="https://example.com/document.pdf", # URL
strategy="full_content",
output_format="json",
extract_metadata=True
)
print(f"Document type: {result['document_type']}")
print(f"Content preview: {result['content'][:200]}...")
1.1 Cloud Storage Document Parsing
# Configure cloud storage support
config = {
"enable_cloud_storage": True,
"gcs_bucket_name": "my-documents",
"gcs_project_id": "my-project"
}
parser = DocumentParserTool(config)
# Support multiple cloud storage formats
cloud_sources = [
"gs://my-bucket/documents/report.pdf", # Google Cloud Storage
"s3://my-bucket/files/presentation.pptx", # AWS S3
"azure://my-container/data/contract.docx", # Azure Blob
"cloud://shared/documents/analysis.xlsx", # Generic cloud storage
"doc_123456789abcdef", # Storage ID
"a1b2c3d4-e5f6-7890-abcd-ef1234567890" # UUID storage ID
]
for source in cloud_sources:
try:
result = parser.parse_document(source=source)
print(f"✓ Successfully parsed: {source}")
except Exception as e:
print(f"✗ Parsing failed: {source} - {e}")
2. Document Type Detection
# Detect document type
detection_result = parser.detect_document_type(
source="https://example.com/unknown_document",
download_sample=True
)
print(f"Detected type: {detection_result['detected_type']}")
print(f"Confidence: {detection_result['confidence']}")
print(f"Detection methods: {detection_result['detection_methods']}")
3. AI Intelligent Analysis (New Import Path)
from aiecs.tools.docs.ai_document_orchestrator import AIDocumentOrchestrator
# Initialize AI orchestrator
orchestrator = AIDocumentOrchestrator()
# AI document analysis
result = orchestrator.process_document(
source="document.pdf",
processing_mode="summarize",
ai_provider="openai"
)
print(f"AI Summary: {result['ai_result']['ai_response']}")
4. Batch Processing
# Batch process multiple documents
batch_result = orchestrator.batch_process_documents(
sources=[
"doc1.pdf",
"https://example.com/doc2.docx",
"gs://bucket/doc3.txt" # Cloud storage support
],
processing_mode="analyze",
max_concurrent=3
)
print(f"Processing successful: {batch_result['successful_documents']}")
print(f"Processing failed: {batch_result['failed_documents']}")
5. Custom Processing Flow
# Create custom processor
custom_analyzer = orchestrator.create_custom_processor(
system_prompt="You are a professional legal document analyst",
user_prompt_template="Analyze the following legal document and extract key information: {content}"
)
# Use custom processor
result = custom_analyzer("legal_document.pdf")
Configuration Options
DocumentParserTool Configuration
config = {
"max_file_size": 50 * 1024 * 1024, # 50MB
"timeout": 30,
"default_encoding": "utf-8",
"max_pages": 1000,
# Cloud storage configuration
"enable_cloud_storage": True,
"gcs_bucket_name": "aiecs-documents",
"gcs_project_id": "my-project"
}
parser = DocumentParserTool(config)
AIDocumentOrchestrator Configuration
config = {
"default_ai_provider": "openai",
"max_chunk_size": 4000,
"max_concurrent_requests": 5,
"default_temperature": 0.1,
"max_tokens": 2000
}
orchestrator = AIDocumentOrchestrator(config)
Supported Document Formats
Format |
Extensions |
Parser |
Features |
|---|---|---|---|
OfficeTool + Custom |
Text extraction, page splitting |
||
Word |
.docx, .doc |
OfficeTool |
Paragraphs, styles, tables |
Excel |
.xlsx, .xls |
OfficeTool |
Worksheets, cell data |
PowerPoint |
.pptx, .ppt |
OfficeTool |
Slides, text, images |
Plain Text |
.txt |
Built-in |
Encoding detection, line splitting |
HTML |
.html, .htm |
BeautifulSoup |
Structured parsing, tag extraction |
Markdown |
.md, .markdown |
Built-in |
Title extraction, structured |
CSV |
.csv |
Pandas |
Table data, column analysis |
JSON |
.json |
Built-in |
Structured data parsing |
XML |
.xml |
Built-in |
Hierarchical structure parsing |
Images |
.jpg, .png, .gif |
ImageTool |
OCR text recognition |
Cloud Storage Support
Supported Cloud Storage Formats
Google Cloud Storage:
gs://bucket/path/file.pdfAWS S3:
s3://bucket/path/file.pdfAzure Blob Storage:
azure://container/path/file.pdfGeneric Cloud Storage:
cloud://path/file.pdfStorage ID:
doc_123456789abcdefUUID Identifier:
a1b2c3d4-e5f6-7890-abcd-ef1234567890
Cloud Storage Configuration Examples
# Google Cloud Storage
gcs_config = {
"enable_cloud_storage": True,
"gcs_bucket_name": "my-gcs-bucket",
"gcs_project_id": "my-gcp-project",
"gcs_location": "US"
}
# AWS S3 (via compatible interface)
s3_config = {
"enable_cloud_storage": True,
"gcs_bucket_name": "my-s3-bucket",
"gcs_project_id": "aws-compat-project"
}
parser = DocumentParserTool(gcs_config)
AI Processing Modes
1. Document Summarization (SUMMARIZE)
Generate concise, informative summaries
Highlight key points and themes
Support multiple length settings
2. Information Extraction (EXTRACT_INFO)
Extract specific information based on specified criteria
Structured data output
Support custom extraction rules
3. Content Analysis (ANALYZE)
Deep content analysis
Topic identification, sentiment analysis
Structure and organization analysis
4. Document Translation (TRANSLATE)
Multi-language translation support
Preserve original format
Context-aware translation
5. Document Classification (CLASSIFY)
Automatic document classification
Confidence scoring
Custom classification system
6. Q&A System (ANSWER_QUESTIONS)
Answer questions based on document content
Cite relevant paragraphs
Support complex reasoning
Performance Optimization
1. Caching Strategy
Document parsing result caching
AI response caching
Type detection result caching
2. Concurrent Processing
Asynchronous I/O operations
Parallel processing of multiple documents
Resource pool management
3. Memory Management
Chunk processing for large documents
Automatic cleanup of temporary files
Memory usage monitoring
4. Error Handling
Intelligent retry mechanism
Degradation processing strategy
Detailed error logging
Error Handling
Common Error Types
DocumentParserError: Basic parsing error
UnsupportedDocumentError: Unsupported document type
DownloadError: Document download failure
ParseError: Parsing process error
AIProviderError: AI service error
ProcessingError: Processing flow error
Error Handling Example
try:
result = parser.parse_document(source="problematic_doc.pdf")
except UnsupportedDocumentError as e:
print(f"Unsupported document type: {e}")
except DownloadError as e:
print(f"Download failed: {e}")
except ParseError as e:
print(f"Parsing failed: {e}")
except Exception as e:
print(f"Unknown error: {e}")
Migration Guide
Migrating from Old Version
If you previously used the old import path, update as follows:
# Old import path (deprecated)
# from aiecs.tools.task_tools.document_parser_tool import DocumentParserTool
# from aiecs.tools.task_tools.ai_document_orchestrator import AIDocumentOrchestrator
# New import path
from aiecs.tools.docs.document_parser_tool import DocumentParserTool
from aiecs.tools.docs.ai_document_orchestrator import AIDocumentOrchestrator
# Or use lazy loading
from aiecs.tools.docs import document_parser_tool, ai_document_orchestrator
Batch Update Script
# Script to batch update import paths
find . -name "*.py" -exec sed -i 's/from aiecs\.tools\.task_tools\.document_parser_tool/from aiecs.tools.docs.document_parser_tool/g' {} \;
find . -name "*.py" -exec sed -i 's/from aiecs\.tools\.task_tools\.ai_document_orchestrator/from aiecs.tools.docs.ai_document_orchestrator/g' {} \;
Extension Development
1. Add New Document Format Support
# Add new parsing method in DocumentParserTool
def _parse_new_format(self, file_path: str, strategy: ParsingStrategy):
# Implement parsing logic for new format
pass
2. Custom AI Processing Templates
# Add new processing template
orchestrator.processing_templates["custom_mode"] = {
"system_prompt": "Custom system prompt",
"user_prompt_template": "Custom user prompt template: {content}"
}
3. Integrate New AI Provider
# Extend AI provider support
def _call_custom_ai_provider(self, prompt: str, params: Dict):
# Implement custom AI provider call
pass
Best Practices
1. Document Processing
Detect document type first, then select processing strategy
Use chunk processing for large documents
Set reasonable timeout values
2. AI Processing
Select appropriate AI model based on document content
Use caching to avoid duplicate processing
Set reasonable concurrency limits
3. Error Handling
Implement comprehensive error handling logic
Record detailed processing logs
Provide user-friendly error messages
4. Performance Optimization
Use async processing to improve concurrency performance
Reasonably configure caching strategies
Monitor resource usage
Out-of-the-Box Check
Run the following code to verify if the system is ready to use out of the box:
def system_readiness_check():
"""System readiness check"""
print("🔍 AIECS Document Processing System Readiness Check")
print("=" * 50)
try:
# 1. Import test
from aiecs.tools.docs.document_parser_tool import DocumentParserTool
from aiecs.tools.docs.ai_document_orchestrator import AIDocumentOrchestrator
print("✓ Module import successful")
# 2. Initialization test
parser = DocumentParserTool()
orchestrator = AIDocumentOrchestrator()
print("✓ Tool initialization successful")
# 3. Feature check
print(f"✓ Cloud storage support: {parser.settings.enable_cloud_storage}")
print(f"✓ AI provider: {orchestrator.settings.default_ai_provider}")
print(f"✓ Concurrency limit: {orchestrator.settings.max_concurrent_requests}")
# 4. Path check
source_types = [
("Local file", "/tmp/test.txt"),
("HTTP URL", "https://example.com/file.pdf"),
("Cloud storage GCS", "gs://bucket/file.pdf"),
("Cloud storage S3", "s3://bucket/file.pdf"),
("Storage ID", "doc_123456")
]
for name, source in source_types:
can_handle = (
os.path.exists(source) or
parser._is_url(source) or
parser._is_cloud_storage_path(source) or
parser._is_storage_id(source)
)
print(f"✓ {name} support: {source}")
print("\n🎉 System fully ready, ready to use out of the box!")
return True
except Exception as e:
print(f"✗ System check failed: {e}")
return False
# Run readiness check
if __name__ == "__main__":
system_readiness_check()
Quick Start
See the complete quick start guide: docs/TOOLS_USED_INSTRUCTION/DOCUMENT_PARSER_QUICK_START.md
Example Code
Basic usage example:
examples/document_processing_example.pyCloud storage example:
examples/cloud_storage_document_example.py
Future Plans
Enhanced Document Format Support
More Office format support
Intelligent chart and table recognition
Complex layout document processing
AI Capability Expansion
Multimodal document understanding
Document structure reconstruction
Intelligent document generation
Performance Optimization
Distributed processing support
Streaming processing capability
Edge computing support
Enterprise Features
Permission control and security auditing
Large-scale batch processing
Integrated monitoring and alerting