Document Parser Tool - Quick Start Guide
🚀 Ready to Use
The document parsing component is now fully ready to use out of the box! Developers can directly use these tools in their projects.
📁 New Directory Structure
aiecs/tools/
├── docs/ # Document processing tools directory
│ ├── __init__.py # Document tools module initialization
│ ├── document_parser_tool.py # Core document parser tool
│ └── ai_document_orchestrator.py # AI intelligent orchestrator
├── task_tools/ # Other task tools
│ ├── chart_tool.py
│ ├── scraper_tool.py
│ └── ...
└── __init__.py # Main tool registration
🔧 Installation and Configuration
1. Basic Installation
# Project already includes all necessary dependencies
pip install -e .
# Or install from PyPI
pip install aiecs
2. Environment Variable Configuration (Optional)
# Document parser configuration
export DOC_PARSER_enable_cloud_storage=true
export DOC_PARSER_gcs_bucket_name=your-bucket-name
export DOC_PARSER_gcs_project_id=your-project-id
# AI orchestrator configuration
export AI_DOC_ORCHESTRATOR_default_ai_provider=openai
export AI_DOC_ORCHESTRATOR_max_chunk_size=4000
💻 Basic Usage
1. Import Tools (New Path)
# Import document processing tools from docs directory
from aiecs.tools.docs.document_parser_tool import DocumentParserTool
from aiecs.tools.docs.ai_document_orchestrator import AIDocumentOrchestrator
# Or use lazy loading
from aiecs.tools.docs import document_parser_tool, ai_document_orchestrator
2. Quick Start Example
#!/usr/bin/env python3
"""
Document processing quick start example
"""
def quick_start_example():
# 1. Initialize tools
from aiecs.tools.docs.document_parser_tool import DocumentParserTool
from aiecs.tools.docs.ai_document_orchestrator import AIDocumentOrchestrator
parser = DocumentParserTool()
orchestrator = AIDocumentOrchestrator()
# 2. Process local document
result = orchestrator.process_document(
source="test_document.txt",
processing_mode="summarize"
)
print("AI Summary:", result['ai_result']['ai_response'])
if __name__ == "__main__":
quick_start_example()
3. Supported Document Sources
# Support multiple document sources
sources = [
"/path/to/local/file.pdf", # Local file
"https://example.com/document.pdf", # URL link
"gs://bucket/document.pdf", # Google Cloud Storage
"s3://bucket/document.pdf", # AWS S3
"azure://container/document.pdf", # Azure Blob
"doc_123456789", # Storage ID
]
for source in sources:
try:
result = parser.parse_document(source=source)
print(f"✓ Successfully parsed: {source}")
except Exception as e:
print(f"✗ Parsing failed: {source} - {e}")
🌐 Cloud Storage Configuration
Google Cloud Storage
config = {
"enable_cloud_storage": True,
"gcs_bucket_name": "my-documents",
"gcs_project_id": "my-project-id"
}
parser = DocumentParserTool(config)
Process Cloud Storage Documents
# Directly process documents in cloud storage
cloud_doc = "gs://my-bucket/reports/annual_report.pdf"
result = orchestrator.process_document(
source=cloud_doc,
processing_mode="extract_info",
processing_params={
"extraction_criteria": "Financial data, key metrics, conclusions"
}
)
🎯 Real-World Application Examples
1. Batch Process Documents
def batch_process_documents():
orchestrator = AIDocumentOrchestrator()
documents = [
"gs://docs/report1.pdf",
"gs://docs/report2.pdf",
"s3://legal/contract.docx"
]
result = orchestrator.batch_process_documents(
sources=documents,
processing_mode="analyze",
max_concurrent=3
)
print(f"Successfully processed: {result['successful_documents']}")
return result
# Run batch processing
batch_result = batch_process_documents()
2. Custom AI Analysis
def custom_document_analysis():
orchestrator = AIDocumentOrchestrator()
# Create custom analyzer
legal_analyzer = orchestrator.create_custom_processor(
system_prompt="You are a professional legal document analyst",
user_prompt_template="Analyze the following legal document and extract key clauses: {content}"
)
# Use custom analyzer
result = legal_analyzer("contract.pdf")
return result
# Run custom analysis
analysis_result = custom_document_analysis()
3. Real-time Document Processing
async def realtime_document_processing():
orchestrator = AIDocumentOrchestrator()
# Asynchronously process multiple documents
tasks = [
orchestrator.process_document_async(
source=doc,
processing_mode="summarize"
)
for doc in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
]
results = await asyncio.gather(*tasks)
return results
# Run async processing
import asyncio
async_results = asyncio.run(realtime_document_processing())
🔍 Troubleshooting
Common Issues and Solutions
1. Import Errors
# Wrong old path
# from aiecs.tools.task_tools.document_parser_tool import DocumentParserTool
# Correct new path
from aiecs.tools.docs.document_parser_tool import DocumentParserTool
2. Permission Issues
# If encountering temporary file permission issues
export TMPDIR=/tmp/aiecs_temp
mkdir -p $TMPDIR
chmod 755 $TMPDIR
3. Cloud Storage Configuration
# Ensure cloud storage configuration is correct
config = {
"enable_cloud_storage": True,
"gcs_bucket_name": "your-bucket",
"gcs_project_id": "your-project"
}
# Test configuration
parser = DocumentParserTool(config)
print("Cloud storage configuration:", parser.settings.enable_cloud_storage)
📊 Feature Checklist
Run the following code to check if all features are working:
def system_check():
"""System feature check"""
print("🔍 AIECS Document Processing System Check")
print("=" * 40)
# 1. Import test
try:
from aiecs.tools.docs.document_parser_tool import DocumentParserTool
from aiecs.tools.docs.ai_document_orchestrator import AIDocumentOrchestrator
print("✓ Module import successful")
except ImportError as e:
print(f"✗ Module import failed: {e}")
return
# 2. Initialization test
try:
parser = DocumentParserTool()
orchestrator = AIDocumentOrchestrator()
print("✓ Tool initialization successful")
except Exception as e:
print(f"✗ Tool initialization failed: {e}")
return
# 3. Configuration test
print(f"✓ Cloud storage support: {parser.settings.enable_cloud_storage}")
print(f"✓ Temporary directory: {parser.settings.temp_dir}")
print(f"✓ AI provider: {orchestrator.settings.default_ai_provider}")
# 4. Feature test
test_sources = [
("Local path", "/tmp/test.txt"),
("HTTP URL", "https://example.com/file.pdf"),
("Cloud storage", "gs://bucket/file.pdf"),
("Storage ID", "doc_123456")
]
for name, source in test_sources:
is_supported = (
not parser._is_url(source) or
parser._is_cloud_storage_path(source) or
parser._is_storage_id(source)
)
status = "✓" if is_supported else "✗"
print(f"{status} {name} support: {source}")
print("\n🎉 System check completed!")
# Run system check
system_check()
🚀 Production Deployment Recommendations
1. Performance Configuration
# Recommended production environment configuration
production_config = {
"max_file_size": 100 * 1024 * 1024, # 100MB
"timeout": 120, # 2 minute timeout
"max_concurrent_requests": 10, # Concurrent request limit
"enable_cloud_storage": True, # Enable cloud storage
"max_chunk_size": 8000 # AI processing chunk size
}
2. Error Handling
def robust_document_processing(source):
"""Robust document processing"""
try:
orchestrator = AIDocumentOrchestrator()
result = orchestrator.process_document(
source=source,
processing_mode="summarize"
)
return {"status": "success", "result": result}
except Exception as e:
logger.error(f"Document processing failed: {source} - {e}")
return {"status": "error", "error": str(e)}
3. Monitoring and Logging
import logging
# Configure detailed logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# Enable debug logging for specific modules
logging.getLogger('aiecs.tools.docs').setLevel(logging.DEBUG)
📚 More Resources
Complete API Documentation:
docs/TOOLS_USED_INSTRUCTION/DOCUMENT_PARSER_TOOL.mdExample Code:
examples/document_processing_example.pyCloud Storage Examples:
examples/cloud_storage_document_example.pyTool Architecture Guide:
docs/TOOLS_USED_INSTRUCTION/TOOL_SPECIAL_SPECIAL_INSTRUCTIONS.md
🎯 Summary
The document parsing component now:
✅ Ready to Use - Can be directly used in projects
✅ Clear Structure - Document tools independently in docs directory
✅ Complete Features - Supports multiple document sources and AI processing modes
✅ High Performance - Async processing, intelligent caching, concurrency control
✅ Easy to Extend - Supports custom processing workflows and AI providers
Developers can now directly use this modern document parsing component to build their own AI document processing applications!