# Tools Architecture This directory contains tools that provide various functionalities for the application. The tools architecture has been refactored to separate business logic from performance optimization concerns and uses a layered architecture to organize different types of tools. ## Directory Structure ``` app/tools/ ├── __init__.py # Tool registry and discovery mechanism ├── base_tool.py # Base tool class ├── temp_file_manager.py # Temporary file management tool ├── README.md # This document ├── task_tools/ # Task-oriented tools │ ├── __init__.py │ ├── chart_tool.py # Chart and visualization tools │ ├── classfire_tool.py # Classification and categorization tools │ ├── image_tool.py # Image processing tools │ ├── office_tool.py # Office document processing tools │ ├── pandas_tool.py # Data analysis and processing tools │ ├── report_tool.py # Report generation tools │ ├── research_tool.py # Research and information gathering tools │ ├── scraper_tool.py # Web scraping tools │ ├── search_api.py # Search engine API integration tools │ └── stats_tool.py # Statistical analysis tools ├── general_tools/ # General tools (reserved) ├── rag_tools/ # RAG-related tools (reserved) └── out_source/ # External integration tools (reserved) ``` ## New Architecture The new architecture includes the following components: 1. **Tool Executor** (`app/core/tool_executor.py`): A centralized execution framework that handles the following cross-cutting concerns: - Input validation - Caching - Concurrency - Error handling - Performance optimization - Logging 2. **Base Tool Class** (`app/tools/base_tool.py`): A base class that all tools should inherit from, providing: - Integration with the tool executor - Schema-based input validation - Standardized error handling - Automatic schema discovery 3. **Tool Registry** (`app/tools/__init__.py`): Handles tool registration and retrieval: - Tool registration - Tool retrieval - Automatic tool discovery - Layered module imports 4. **Layered Tool Organization**: - **task_tools**: Specialized task-oriented tools for specific business scenarios - **general_tools**: General tools providing basic functionality - **rag_tools**: RAG (Retrieval-Augmented Generation) related tools - **out_source**: External service integration tools ## Tool Categories ### Task Tools Located in the `task_tools/` directory, containing tools specialized for specific tasks: - **chart_tool**: Chart generation and data visualization - **classfire_tool**: Data classification and categorization - **image_tool**: Image processing and manipulation - **office_tool**: Office document processing (Word, Excel, PowerPoint) - **pandas_tool**: Data analysis and DataFrame operations - **report_tool**: Report generation and formatting - **research_tool**: Research and information gathering - **scraper_tool**: Web data scraping - **search_api**: Search engine API integration - **stats_tool**: Statistical analysis and computation ## Using Base Tool Class To create a new tool, inherit from the `BaseTool` class and implement your business logic methods: ```python from typing import Dict, Any, Optional from pydantic import BaseModel, Field from aiecs.tools import register_tool from aiecs.tools.base_tool import BaseTool @register_tool("my_tool") class MyTool(BaseTool): """My tool description""" # Define input schema for operations class OperationSchema(BaseModel): """Operation schema""" param1: str = Field(description="Parameter 1") param2: int = Field(description="Parameter 2") def __init__(self, config: Optional[Dict[str, Any]] = None): """Initialize tool""" super().__init__(config) # Additional initialization def operation(self, param1: str, param2: int) -> Dict[str, Any]: """ Implement your business logic here Args: param1: Parameter 1 param2: Parameter 2 Returns: Operation result """ # Your business logic return {"result": f"Processing {param1} and {param2}"} ``` ## Using Decorators for Performance Optimization The tool executor provides several decorators that you can use to add performance optimizations to methods: ```python from aiecs.tools.tool_executor import cache_result, run_in_executor, measure_execution_time @cache_result() # Cache the result of this method def cached_operation(self, param1: str) -> Dict[str, Any]: # This result will be cached based on param1 return {"result": f"Cached result {param1}"} @run_in_executor # Run this method in a thread pool def cpu_intensive_operation(self, param1: str) -> Dict[str, Any]: # This method will be executed in a separate thread return {"result": f"CPU-intensive result {param1}"} @measure_execution_time # Record the execution time of this method def monitored_operation(self, param1: str) -> Dict[str, Any]: # The execution time of this method will be recorded return {"result": f"Monitored result {param1}"} ``` ## Migrating Existing Tools To migrate existing tools to the new architecture: 1. Make your tool class inherit from `BaseTool` 2. Define Pydantic schemas for your operations 3. Remove any custom caching, validation, or error handling code 4. Use decorators for performance optimization 5. Update the `run` method to use the base class implementation ### Before: ```python @register_tool("example") class ExampleTool: def __init__(self): self._cache = {} def run(self, op: str, **kwargs): if op == "operation": return self.operation(**kwargs) else: raise ValueError(f"Unsupported operation: {op}") def operation(self, param1: str, param2: int): # Custom caching cache_key = f"{param1}_{param2}" if cache_key in self._cache: return self._cache[cache_key] # Custom validation if not isinstance(param1, str): raise ValueError("param1 must be a string") if not isinstance(param2, int): raise ValueError("param2 must be an integer") # Business logic result = {"result": f"Processing {param1} and {param2}"} # Cache result self._cache[cache_key] = result return result ``` ### After: ```python from typing import Dict, Any, Optional from pydantic import BaseModel, Field from aiecs.tools import register_tool from aiecs.tools.base_tool import BaseTool from aiecs.tools.tool_executor import cache_result @register_tool("example") class ExampleTool(BaseTool): """Example tool""" class OperationSchema(BaseModel): """Operation schema""" param1: str = Field(description="Parameter 1") param2: int = Field(description="Parameter 2") @cache_result() def operation(self, param1: str, param2: int) -> Dict[str, Any]: """ Process parameters Args: param1: Parameter 1 param2: Parameter 2 Returns: Operation result """ # Focus only on business logic return {"result": f"Processing {param1} and {param2}"} ``` ## Benefits of the New Architecture The new architecture provides several benefits: 1. **Separation of Concerns**: Business logic is separated from cross-cutting concerns like caching, validation, and error handling. 2. **Reduced Duplication**: Common functionality is implemented once in the tool executor and base tool, rather than being duplicated across individual tools. 3. **Consistent Behavior**: All tools behave consistently in terms of validation, error handling, and performance optimization. 4. **Improved Maintainability**: Tools are easier to maintain because they focus only on specific business logic. 5. **Enhanced Performance**: The tool executor provides optimized implementations of caching, concurrency, and other performance features. 6. **Better Testing**: Business logic can be tested independently of cross-cutting concerns. 7. **Easier Onboarding**: New developers can focus on implementing business logic without worrying about performance optimization details. ## Usage Examples ```python # Get tool instance from aiecs.tools import get_tool # Get chart tool chart_tool = get_tool("chart") # Use tool result = chart_tool.run("visualize", file_path="data.csv", plot_type="histogram", x="age", title="Age Distribution" ) # Or call method directly result = chart_tool.visualize( file_path="data.csv", plot_type="histogram", x="age", title="Age Distribution" ) ``` ## Multi-Task Service Integration The tool system is fully integrated with the MultiTaskTools service in `app/services/multi_task/tools.py`: ```python from aiecs.services.multi_task.tools import MultiTaskTools # Initialize multi-task tools service multi_tools = MultiTaskTools() # Get all available tools available_tools = multi_tools.get_available_tools() print("Available tools:", available_tools) # Get operations for a specific tool chart_operations = multi_tools.get_available_operations("chart") print("Chart tool operations:", chart_operations) # Get operation details operation_info = multi_tools.get_operation_info("chart.visualize") print("Operation info:", operation_info) # Execute tool operation result = await multi_tools.execute_tool( "chart", "visualize", file_path="data.csv", plot_type="histogram", x="age" ) ``` ## Task Tool Usage Examples ### Data Processing Pipeline ```python from aiecs.tools import get_tool # 1. Data analysis tool pandas_tool = get_tool("pandas") df_result = pandas_tool.read_csv(file_path="data.csv") # 2. Statistical analysis tool stats_tool = get_tool("stats") stats_result = stats_tool.descriptive_stats(data=df_result["data"]) # 3. Chart generation tool chart_tool = get_tool("chart") chart_result = chart_tool.visualize( data=df_result["data"], plot_type="histogram", x="age" ) # 4. Report generation tool report_tool = get_tool("report") report_result = report_tool.generate_report( data=stats_result, charts=[chart_result], template="statistical_summary" ) ``` ### Research and Information Gathering ```python # Research tool research_tool = get_tool("research") research_result = research_tool.search_papers( query="machine learning", max_results=10 ) # Web scraping tool scraper_tool = get_tool("scraper") web_data = scraper_tool.scrape_url( url="https://example.com", selectors=["h1", "p"] ) # Search API tool search_tool = get_tool("search_api") search_results = search_tool.web_search( query="artificial intelligence trends 2024", num_results=5 ) ``` ### Office Document Processing ```python # Office tool office_tool = get_tool("office") # Process Excel file excel_result = office_tool.read_excel( file_path="data.xlsx", sheet_name="Sheet1" ) # Generate Word report word_result = office_tool.create_word_document( content=report_result["content"], template="business_report" ) # Create PowerPoint presentation ppt_result = office_tool.create_presentation( slides_data=chart_result["charts"], template="data_analysis" ) ``` ## Tool Discovery and Registration The system automatically discovers and registers all tools: ```python from aiecs.tools import list_tools, discover_tools # List all registered tools all_tools = list_tools() print("Registered tools:", all_tools) # Manually trigger tool discovery (usually not needed, system does this automatically) discover_tools("aiecs.tools") # View tools by category task_tools = [tool for tool in all_tools if "task_tools" in str(type(get_tool(tool)))] print("Task tools:", task_tools) ``` ## Best Practices ### 1. Tool Composition Combine multiple tools to complete complex tasks: ```python def data_analysis_pipeline(csv_file: str): """Complete data analysis pipeline""" # Data loading and cleaning pandas_tool = get_tool("pandas") data = pandas_tool.read_csv(csv_file) cleaned_data = pandas_tool.clean_data(data["data"]) # Statistical analysis stats_tool = get_tool("stats") statistics = stats_tool.comprehensive_analysis(cleaned_data["data"]) # Visualization chart_tool = get_tool("chart") charts = chart_tool.create_dashboard( data=cleaned_data["data"], chart_types=["histogram", "boxplot", "correlation"] ) # Generate report report_tool = get_tool("report") final_report = report_tool.generate_comprehensive_report( data=statistics, visualizations=charts, template="data_analysis" ) return final_report ``` ### 2. Error Handling Use appropriate error handling: ```python from aiecs.tools import get_tool from aiecs.tools.tool_executor import ToolExecutionError try: tool = get_tool("pandas") result = tool.read_csv("nonexistent.csv") except ToolExecutionError as e: print(f"Tool execution error: {e}") except ValueError as e: print(f"Tool does not exist: {e}") ``` ### 3. Asynchronous Operations Use asynchronous execution for time-consuming operations: ```python import asyncio from aiecs.services.multi_task.tools import MultiTaskTools async def async_data_processing(): multi_tools = MultiTaskTools() # Execute multiple operations in parallel tasks = [ multi_tools.execute_tool("scraper", "scrape_url", url="https://site1.com"), multi_tools.execute_tool("scraper", "scrape_url", url="https://site2.com"), multi_tools.execute_tool("research", "search_papers", query="AI") ] results = await asyncio.gather(*tasks) return results ``` ## Extending the Tool System ### Adding New Task Tools 1. Create a new tool file in the `task_tools/` directory 2. Inherit from the `BaseTool` class 3. Register using the `@register_tool` decorator 4. Add import in `task_tools/__init__.py` ```python # task_tools/my_new_tool.py from aiecs.tools import register_tool from aiecs.tools.base_tool import BaseTool @register_tool("my_new_tool") class MyNewTool(BaseTool): """New tool description""" def my_operation(self, param: str) -> dict: """Operation description""" return {"result": f"Processing {param}"} ``` ### Creating New Tool Categories 1. Create a new directory under `app/tools/` 2. Add an `__init__.py` file 3. Add import in the main `__init__.py` 4. Tools will be automatically discovered and registered ## Special Tool Usage Instructions ### Image Tool The Image Tool provides comprehensive image processing capabilities, including loading, OCR text recognition, metadata extraction, resizing, and filter application. #### System Dependency Requirements **Important**: The Image Tool requires system-level Tesseract OCR engine and Pillow image processing library system dependencies. #### 1. Tesseract OCR Engine **Ubuntu/Debian systems**: ```bash sudo apt-get update sudo apt-get install tesseract-ocr tesseract-ocr-eng ``` **macOS systems**: ```bash brew install tesseract ``` **Verify installation**: ```bash tesseract --version ``` #### 2. Pillow Image Processing Library System Dependencies **Ubuntu/Debian systems**: ```bash # Basic image processing libraries sudo apt-get install libjpeg-dev zlib1g-dev libpng-dev libtiff-dev libwebp-dev libopenjp2-7-dev # Complete image processing libraries (recommended) sudo apt-get install libimageio-dev libfreetype6-dev liblcms2-dev libtiff5-dev libjpeg8-dev libopenjp2-7-dev libwebp-dev libharfbuzz-dev libfribidi-dev libxcb1-dev ``` **macOS systems**: ```bash brew install libjpeg zlib libpng libtiff webp openjpeg freetype lcms2 ``` **Verify installation**: ```bash python -c "from PIL import Image; print('PIL version:', Image.__version__)" ``` #### 3. Multi-language OCR Support **Install additional language packs**: ```bash # Ubuntu/Debian systems sudo apt-get install tesseract-ocr-chi-sim # Simplified Chinese sudo apt-get install tesseract-ocr-chi-tra # Traditional Chinese sudo apt-get install tesseract-ocr-fra # French sudo apt-get install tesseract-ocr-deu # German sudo apt-get install tesseract-ocr-jpn # Japanese sudo apt-get install tesseract-ocr-kor # Korean sudo apt-get install tesseract-ocr-rus # Russian sudo apt-get install tesseract-ocr-spa # Spanish ``` **View installed language packs**: ```bash tesseract --list-langs ``` **Using multi-language OCR**: ```python # English OCR text = tool.ocr("/path/to/image.jpg", lang='eng') # Chinese OCR text = tool.ocr("/path/to/image.jpg", lang='chi_sim') # Japanese OCR text = tool.ocr("/path/to/image.jpg", lang='jpn') ``` #### Features 1. **Image Loading**: Supports multiple formats (JPG, PNG, BMP, TIFF, GIF) 2. **OCR Text Recognition**: Text extraction based on Tesseract engine 3. **Metadata Extraction**: Get image dimensions, mode, and EXIF information 4. **Image Resizing**: High-quality resizing 5. **Filter Effects**: Blur, sharpen, edge enhancement, and other effects #### Usage Examples ```python from aiecs.tools.task_tools.image_tool import ImageTool # Initialize tool tool = ImageTool() # Load image information result = tool.load("/path/to/image.jpg") print(f"Size: {result['size']}, Mode: {result['mode']}") # OCR text recognition text = tool.ocr("/path/to/image.png", lang='eng') print(f"Recognized text: {text}") # Extract metadata metadata = tool.metadata("/path/to/image.jpg", include_exif=True) print(f"EXIF info: {metadata.get('exif', {})}") # Resize image tool.resize("/path/to/input.jpg", "/path/to/output.jpg", 800, 600) # Apply filter tool.filter("/path/to/input.jpg", "/path/to/blurred.jpg", "blur") ``` #### Security Features - File extension whitelist validation - File size limits (default 50MB) - Path normalization and security checks - Complete error handling and logging ### ClassFire Tool (Text Classification and Keyword Extraction Tool) The ClassFire Tool provides powerful text classification, keyword extraction, and text summarization capabilities, supporting both English and Chinese text processing. #### Model Dependency Requirements **Important**: The ClassFire Tool requires downloading and installing the following models to function properly. #### 1. spaCy Model Dependencies **Models Used**: - **English Model**: `en_core_web_sm` - Used for part-of-speech tagging, named entity recognition, and keyword extraction for English text - **Chinese Model**: `zh_core_web_sm` - Used for part-of-speech tagging, named entity recognition, and keyword extraction for Chinese text **Installation Method**: ```bash # Install using Poetry environment poetry run python -m spacy download en_core_web_sm poetry run python -m spacy download zh_core_web_sm # Or install using pip pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl pip install https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-3.7.0/zh_core_web_sm-3.7.0-py3-none-any.whl ``` **Usage Reasons**: - **Part-of-Speech Tagging**: Identifies nouns, verbs, adjectives, etc., for keyword extraction - **Named Entity Recognition**: Identifies entities like person names, place names, organization names, improving keyword quality - **Language Detection**: Automatically detects text language and selects appropriate processing strategy - **Text Preprocessing**: Provides standardized text processing pipeline #### 2. Transformers Model Dependencies **Models Used**: - **English Summarization Model**: `facebook/bart-large-cnn` - Used for English text summarization - **Multilingual Summarization Model**: `t5-base` - Used for Chinese text summarization **Model Download**: ```bash # Models will be automatically downloaded to ~/.cache/huggingface/hub/ on first use # No manual installation needed, but ensure network connection is available ``` **Installation Verification**: ```python from transformers import pipeline # Test English summarization model summarizer_en = pipeline("summarization", model="facebook/bart-large-cnn") result = summarizer_en("Your text here...", max_length=100, min_length=30) # Test multilingual summarization model summarizer_zh = pipeline("summarization", model="t5-base") result = summarizer_zh("您的中文文本...", max_new_tokens=50, min_new_tokens=10) ``` **Usage Reasons**: - **High-Quality Summarization**: BART and T5 are state-of-the-art summarization models - **Multilingual Support**: T5 supports multiple languages, including Chinese - **Configurable Length**: Supports custom summary length and minimum length - **Asynchronous Processing**: Supports asynchronous calls, improving processing efficiency #### 3. NLTK Data Package Dependencies **Required Data Packages**: - `stopwords` - Stopword data for keyword filtering - `punkt` - Sentence tokenizer for text preprocessing - `wordnet` - Lexical database for word similarity calculation - `averaged_perceptron_tagger` - Part-of-speech tagger **Automatic Download**: ```bash # Use the provided script to automatically download all NLP data poetry run python aiecs/scripts/download_nlp_data.py ``` #### Features 1. **Text Classification**: Text classification based on pre-trained models 2. **Keyword Extraction**: Supports RAKE (English) and spaCy (English/Chinese) keyword extraction 3. **Text Summarization**: Supports English and Chinese text summarization 4. **Language Detection**: Automatically detects text language 5. **Asynchronous Processing**: Supports asynchronous calls, improving performance #### Usage Examples ```python from aiecs.tools.task_tools.classfire_tool import ClassifierTool # Initialize tool tool = ClassifierTool() # Text classification result = await tool.classify("This is a positive review about the product.") print(f"Classification result: {result}") # Keyword extraction keywords = await tool.extract_keywords("Natural language processing is important.", top_k=5) print(f"Keywords: {keywords}") # Text summarization summary = await tool.summarize("Your long text here...", max_length=100) print(f"Summary: {summary}") # Chinese processing chinese_keywords = await tool.extract_keywords("自然语言处理是人工智能的重要领域。", top_k=3) print(f"Chinese keywords: {chinese_keywords}") ``` #### Performance Optimization - **Model Caching**: Models are cached after first load, improving subsequent call speed - **Asynchronous Processing**: All main features support asynchronous calls - **Memory Management**: Supports model unloading and reloading to save memory - **Error Handling**: Comprehensive error handling and fallback mechanisms #### Notes - **First Use**: Models will be automatically downloaded on first use, which may take some time - **Network Requirements**: Network connection required to download Transformers models - **Memory Requirements**: Model loading requires certain memory space - **Language Support**: Currently mainly supports English and Chinese, limited support for other languages ### Office Tool (Office Document Processing Tool) The Office Tool provides comprehensive office document processing capabilities, supporting reading, writing, and conversion of various document formats, including Word, PowerPoint, Excel, PDF, and image files. #### System Dependency Requirements **Important**: The Office Tool requires Java Runtime Environment and Tesseract OCR engine to function properly. #### 1. Java Runtime Environment (Required) **Purpose**: Apache Tika document parsing library requires Java Runtime Environment. **Ubuntu/Debian systems**: ```bash # Install OpenJDK 11 (recommended) sudo apt-get update sudo apt-get install openjdk-11-jdk # Or install OpenJDK 17 sudo apt-get install openjdk-17-jdk # Verify installation java -version javac -version ``` **macOS systems**: ```bash # Install using Homebrew brew install openjdk@11 # Or install OpenJDK 17 brew install openjdk@17 ``` **Environment Variable Setup**: ```bash # Set JAVA_HOME environment variable export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 # Or for OpenJDK 17 export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 # Add to ~/.bashrc or ~/.zshrc echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc ``` #### 2. Tesseract OCR Engine (Required for OCR Functionality) **Purpose**: Text recognition functionality in image files. **Ubuntu/Debian systems**: ```bash sudo apt-get update sudo apt-get install tesseract-ocr tesseract-ocr-eng # Chinese OCR support sudo apt-get install tesseract-ocr-chi-sim # Simplified Chinese sudo apt-get install tesseract-ocr-chi-tra # Traditional Chinese ``` **macOS systems**: ```bash brew install tesseract ``` **Verify installation**: ```bash tesseract --version tesseract --list-langs ``` #### 3. Python Package Dependencies **Core Document Processing Libraries**: - **pandas** (>=2.2.3) - Excel file data processing - **openpyxl** (>=3.1.5) - Excel file read/write - **python-docx** (>=1.1.2) - Word document processing - **python-pptx** (>=1.0.2) - PowerPoint document processing - **pdfplumber** (>=0.11.7) - PDF text extraction **Content Parsing Libraries**: - **tika** (>=3.2.2) - Universal document parsing (requires Java 11+) - **pytesseract** (>=0.3.13) - OCR text recognition - **Pillow** (>=11.2.1) - Image processing #### Features 1. **Document Reading**: Supports DOCX, PPTX, XLSX, PDF formats 2. **Document Writing**: Create and edit Word, PowerPoint, Excel documents 3. **Text Extraction**: Extract text content from various document formats 4. **OCR Functionality**: Recognize text from image files 5. **Multi-format Support**: Process legacy Office documents and other formats #### Usage Examples ```python from aiecs.tools.task_tools.office_tool import OfficeTool # Initialize tool tool = OfficeTool() # Read Word document docx_content = tool.read_docx("/path/to/document.docx") print(f"Document content: {docx_content['text']}") # Read Excel file xlsx_data = tool.read_xlsx("/path/to/spreadsheet.xlsx") print(f"Spreadsheet data: {xlsx_data}") # Extract text (supports multiple formats) text = tool.extract_text("/path/to/document.pdf") print(f"Extracted text: {text}") # Create Word document tool.write_docx("Hello World!", "/path/to/output.docx") # Create PowerPoint presentation slides = ["Title Slide", "Content Page 1", "Content Page 2"] tool.write_pptx(slides, "/path/to/presentation.pptx") ``` #### OCR Functionality **Supported Image Formats**: - PNG, JPG, JPEG, TIFF, BMP, GIF **Language Support**: - English (eng) - Simplified Chinese (chi_sim) - Traditional Chinese (chi_tra) **Usage Example**: ```python # Extract text from image image_text = tool.extract_text("/path/to/image.png") print(f"Recognized text: {image_text}") ``` #### Performance Optimization - **Tika Caching**: Tika JAR file will be downloaded and cached on first use - **Memory Management**: Pay attention to memory usage when processing large files - **Concurrency Limits**: Recommend limiting the number of documents processed simultaneously - **Error Handling**: Comprehensive error handling and fallback mechanisms #### Notes - **Java Version**: Requires Java 11 or higher (Tika 3.x requirement) - **Memory Requirements**: Tika requires sufficient memory when processing large files - **File Size**: Default maximum file size is 100MB - **Encoding Issues**: Some documents may have encoding issues - **OCR Accuracy**: Image quality affects OCR recognition accuracy ### Stats Tool (Statistical Analysis Tool) The Stats Tool provides comprehensive statistical analysis capabilities, supporting various statistical tests, data preprocessing, regression analysis, time series analysis, and other advanced statistical functions. #### System Dependency Requirements **Important**: The Stats Tool requires system-level C libraries to support reading special file formats, particularly SAS, SPSS, and Stata files. #### 1. pyreadstat System Dependencies (Special File Format Support) **Purpose**: Read and write SAS, SPSS, Stata files (.sav, .sas7bdat, .por formats) **Ubuntu/Debian systems**: ```bash # Install libreadstat development library sudo apt-get update sudo apt-get install libreadstat-dev # Install build tools (if not already installed) sudo apt-get install build-essential python3-dev # Reinstall pyreadstat pip install --no-cache-dir --force-reinstall pyreadstat ``` **macOS systems**: ```bash # Install using Homebrew brew install readstat # Reinstall pyreadstat pip install --no-cache-dir --force-reinstall pyreadstat ``` **CentOS/RHEL systems**: ```bash # Install development tools sudo yum groupinstall "Development Tools" sudo yum install python3-devel # Install readstat library (may need to compile from source) # Or use conda to install conda install -c conda-forge readstat ``` **Verify installation**: ```python import pyreadstat print("pyreadstat version:", pyreadstat.__version__) # Test read functionality try: # No actual file needed here, just testing import print("pyreadstat installed successfully") except Exception as e: print("pyreadstat installation failed:", e) ``` #### 2. Excel File Support System Dependencies **Purpose**: Read and write Excel files (.xlsx, .xls formats) **Ubuntu/Debian systems**: ```bash # Install system libraries required by openpyxl sudo apt-get install libxml2-dev libxslt1-dev # Verify installation python -c "import openpyxl; print('openpyxl available')" ``` **macOS systems**: ```bash # Usually no additional installation needed, system already includes required libraries brew install libxml2 libxslt ``` #### 3. Python Package Dependencies **Core Statistical Libraries**: - **pandas** (>=2.2.3) - Data processing and analysis - **numpy** (>=2.2.6) - Numerical computation - **scipy** (>=1.15.3) - Scientific computing and statistical functions - **scikit-learn** (>=1.5.0) - Machine learning library (data preprocessing) - **statsmodels** (>=0.14.4) - Statistical models and tests **Special File Format Support**: - **pyreadstat** (>=1.2.9) - SAS, SPSS, Stata file support - **openpyxl** (>=3.1.5) - Excel file support **Configuration Management**: - **pydantic** (>=2.11.5) - Data validation - **pydantic-settings** (>=2.9.1) - Settings management #### Features 1. **Descriptive Statistics**: Basic statistics, skewness, kurtosis, percentiles 2. **Hypothesis Testing**: t-tests, chi-square tests, ANOVA, non-parametric tests 3. **Correlation Analysis**: Pearson, Spearman, Kendall correlation coefficients 4. **Regression Analysis**: OLS, Logit, Probit, Poisson regression 5. **Time Series**: ARIMA, SARIMA models and forecasting 6. **Data Preprocessing**: Standardization, missing value handling, data cleaning 7. **Multi-format Support**: CSV, Excel, JSON, Parquet, Feather, SAS, SPSS, Stata #### Supported File Formats | Format | Extension | Dependency Library | System Requirements | |--------|-----------|-------------------|---------------------| | **CSV** | `.csv` | pandas | None | | **Excel** | `.xlsx`, `.xls` | openpyxl | libxml2, libxslt | | **JSON** | `.json` | pandas | None | | **Parquet** | `.parquet` | pandas | None | | **Feather** | `.feather` | pandas | None | | **SPSS** | `.sav`, `.por` | pyreadstat | libreadstat | | **SAS** | `.sas7bdat` | pyreadstat | libreadstat | #### Usage Examples ```python from aiecs.tools import get_tool # Get statistics tool stats_tool = get_tool("stats") # Read data data_info = stats_tool.read_data("data.sav") # SPSS file print(f"Number of variables: {len(data_info['variables'])}") print(f"Number of observations: {data_info['observations']}") # Descriptive statistics desc_stats = stats_tool.describe( file_path="data.sav", variables=["age", "income", "education"], include_percentiles=True, percentiles=[0.1, 0.9] ) # t-test ttest_result = stats_tool.ttest( file_path="data.sav", var1="group1_score", var2="group2_score", equal_var=True ) # Correlation analysis correlation = stats_tool.correlation( file_path="data.sav", variables=["var1", "var2", "var3"], method="pearson" ) # Regression analysis regression = stats_tool.regression( file_path="data.sav", formula="y ~ x1 + x2 + x3", regression_type="ols" ) # Data preprocessing preprocessed = stats_tool.preprocess( file_path="data.sav", variables=["var1", "var2"], operation="scale", scaler_type="standard" ) ``` #### Environment Variable Configuration Stats Tool can be configured via the following environment variables: ```bash # Maximum file size limit (MB) export STATS_TOOL_MAX_FILE_SIZE_MB=200 # Allowed file extensions export STATS_TOOL_ALLOWED_EXTENSIONS=".sav,.sas7bdat,.por,.csv,.xlsx,.xls,.json,.parquet,.feather" ``` #### Troubleshooting #### pyreadstat Installation Issues **Problem**: `ImportError: No module named 'pyreadstat'` or compilation errors **Solution**: ```bash # 1. Install system dependencies sudo apt-get install libreadstat-dev build-essential python3-dev # 2. Reinstall pip uninstall pyreadstat pip install --no-cache-dir pyreadstat # 3. Verify installation python -c "import pyreadstat; print('Success')" ``` **Problem**: `OSError: libreadstat.so: cannot open shared object file` **Solution**: ```bash # Check library file location ldconfig -p | grep readstat # If not found, reinstall system library sudo apt-get install --reinstall libreadstat0 ``` #### Excel File Reading Issues **Problem**: `ImportError: No module named 'openpyxl'` **Solution**: ```bash # Install openpyxl pip install openpyxl # Install system dependencies sudo apt-get install libxml2-dev libxslt1-dev ``` #### Memory Usage Issues **Problem**: Insufficient memory when processing large files **Solution**: ```python # Use nrows parameter to limit number of rows read data_info = stats_tool.read_data("large_file.csv", nrows=10000) # Adjust environment variable export STATS_TOOL_MAX_FILE_SIZE_MB=500 ``` #### File Permission Issues **Problem**: Cannot read file **Solution**: ```bash # Check file permissions ls -la data.sav # Modify permissions chmod 644 data.sav # Check file path python -c "import os; print(os.path.exists('data.sav'))" ``` ### Report Tool (Multi-format Report Generation Tool) The Report Tool provides comprehensive report generation capabilities, supporting HTML, Excel, PowerPoint, Word, Markdown, image, and PDF format report generation. #### System Dependency Requirements **Important**: Some features of the Report Tool require system-level graphics libraries and font libraries. #### 1. WeasyPrint System Dependencies (Required for PDF Functionality) **Purpose**: HTML to PDF functionality requires WeasyPrint system-level dependencies. **Ubuntu/Debian systems**: ```bash # Install system libraries required by WeasyPrint sudo apt-get update sudo apt-get install libcairo2-dev libpango1.0-dev libgdk-pixbuf2.0-dev libffi-dev shared-mime-info # Complete installation (recommended) sudo apt-get install libcairo2-dev libpango1.0-dev libgdk-pixbuf2.0-dev libffi-dev shared-mime-info libxml2-dev libxslt1-dev ``` **macOS systems**: ```bash # Install using Homebrew brew install cairo pango gdk-pixbuf libffi ``` **Verify installation**: ```bash # Check system libraries pkg-config --modversion cairo pkg-config --modversion pango ``` #### 2. Matplotlib System Dependencies (Required for Chart Functionality) **Purpose**: Chart generation functionality requires font and image processing libraries. **Ubuntu/Debian systems**: ```bash # Install system libraries required by Matplotlib sudo apt-get install libfreetype6-dev libpng-dev libjpeg-dev libtiff-dev libwebp-dev # Chinese font support sudo apt-get install fonts-wqy-zenhei fonts-wqy-microhei ``` **macOS systems**: ```bash # Install using Homebrew brew install freetype libpng libjpeg libtiff webp ``` **Verify installation**: ```bash python -c "import matplotlib.pyplot as plt; plt.figure(); print('Matplotlib working')" ``` #### 3. Python Package Dependencies **Core Report Generation Libraries**: - **jinja2** (>=3.1.6) - Template engine - **weasyprint** (>=65.1) - HTML to PDF - **matplotlib** (>=3.10.3) - Chart generation - **bleach** (>=6.2.0) - HTML sanitization - **markdown** (>=3.8) - Markdown processing **Document Processing Libraries**: - **pandas** (>=2.2.3) - Data processing - **openpyxl** (>=3.1.5) - Excel file processing - **python-docx** (>=1.1.2) - Word document processing - **python-pptx** (>=1.0.2) - PowerPoint document processing #### Features 1. **HTML Reports**: Generated using Jinja2 template engine ✅ 2. **PDF Reports**: HTML to PDF conversion using WeasyPrint ⚠️ **Temporarily Disabled** 3. **Excel Reports**: Multi-sheet Excel file generation ✅ 4. **PowerPoint Reports**: Custom slide presentations ✅ 5. **Word Reports**: Styled Word documents ✅ 6. **Markdown Reports**: Markdown format reports ✅ 7. **Image Reports**: Chart generation using Matplotlib ✅ #### Usage Examples ```python from aiecs.tools.task_tools.report_tool import ReportTool # Initialize tool tool = ReportTool() # Generate HTML report html_result = tool.generate_html( template_path="report_template.html", context={"title": "Monthly Report", "data": data}, output_path="/path/to/report.html" ) # Generate PDF report (⚠️ Temporarily disabled - requires system dependencies and code modification) # pdf_result = tool.generate_pdf( # html=html_content, # output_path="/path/to/report.pdf", # page_size="A4" # ) # Generate Excel report excel_result = tool.generate_excel( sheets={"Data": df, "Summary": summary_df}, output_path="/path/to/report.xlsx" ) # Generate chart chart_result = tool.generate_image( chart_type="bar", data=chart_data, output_path="/path/to/chart.png", title="Sales Data" ) ``` #### PDF Functionality Notes **⚠️ Important Notice**: HTML to PDF functionality is **temporarily disabled** due to WeasyPrint system dependency issues. **Current Status**: - PDF generation functionality is completely unavailable - Calling `generate_pdf()` method will throw an error - Requires manual installation of system dependencies and code modification to enable **Disable Reason**: - Missing system-level graphics libraries required by WeasyPrint - Deployment environment complexity makes dependency installation difficult - To ensure stability of other features **Enable Method**: 1. **Install WeasyPrint system dependencies**: ```bash sudo apt-get install libcairo2-dev libpango1.0-dev libgdk-pixbuf2.0-dev libffi-dev shared-mime-info ``` 2. **Modify code**: - Uncomment `from weasyprint import HTML` import statement - Uncomment implementation code in `generate_pdf` method - Remove error throwing statements 3. **Verify installation**: Ensure all system libraries are correctly installed **Supported Features** (after enabling): - HTML to PDF conversion - Custom page sizes - CSS style support - Template variable substitution **Alternatives**: - Use `generate_html()` to generate HTML reports - Use browser to manually print to PDF - Use other PDF generation tools #### Performance Optimization - **Template Caching**: Jinja2 templates are automatically cached - **Temporary File Management**: Automatic cleanup of temporary files - **Batch Generation**: Supports parallel generation of multiple reports - **Memory Management**: Optimized for large file processing #### Notes - **WeasyPrint Dependencies**: PDF functionality requires complete system library support - **Font Support**: Chart generation requires system font libraries - **Template Security**: Automatic HTML content sanitization to prevent XSS attacks - **File Size**: Pay attention to memory usage when processing large files - **Concurrency Limits**: Recommend limiting the number of reports generated simultaneously --- ## Scraper Tool (Web Scraping Tool) ### Feature Overview The Scraper Tool is a powerful web scraping tool that supports multiple HTTP clients, JavaScript rendering, HTML parsing, and advanced crawling functionality. **Main Features**: - **HTTP Requests**: Supports httpx, urllib, and other clients - **JavaScript Rendering**: Uses Playwright for dynamic content scraping - **HTML Parsing**: Uses BeautifulSoup and lxml for content parsing - **Advanced Crawling**: Integrates Scrapy for complex crawling projects - **Multi-format Output**: Supports text, JSON, HTML, Markdown, CSV output ### Special Dependency Instructions #### 1. Playwright Browser Dependencies **Purpose**: JavaScript rendering functionality (`render()` method) **Dependency Contents**: - **Python Package**: `playwright` (already installed) - **Browser Binaries**: Chromium, Firefox, WebKit - **System Dependencies**: System libraries required for browser operation **Installation Steps**: 1. **Download Browsers**: ```bash cd /home/coder1/python-middleware-dev poetry run playwright install ``` 2. **Install System Dependencies**: ```bash # Method 1: Use Playwright automatic installation poetry run playwright install-deps # Method 2: Manual installation (Ubuntu/Debian) sudo apt-get install libatk1.0-0 \ libatk-bridge2.0-0 \ libcups2 \ libxkbcommon0 \ libatspi2.0-0 \ libxcomposite1 \ libxdamage1 \ libxfixes3 \ libxrandr2 \ libgbm1 \ libasound2 ``` # Method 3: Root account installation (recommended) # Temporarily install playwright-python package pip install playwright # Run playwright command to install system dependencies python -m playwright install-deps # After dependencies are installed, uninstall temporary playwright-python package to keep root environment clean pip uninstall playwright -y **Browser Storage Location**: - **Path**: `~/.cache/ms-playwright/` - **Size**: Approximately 400-500MB (all browsers) - **Contains**: Chromium, Firefox, WebKit, FFMPEG **Feature Support**: - **Page Rendering**: Waits for JavaScript execution to complete - **Element Waiting**: Waits for specific CSS selectors - **Page Scrolling**: Scrolls to bottom of page - **Screenshot Functionality**: Saves page screenshots - **Multi-browser**: Supports Chromium, Firefox, WebKit #### 2. Scrapy Advanced Crawling Dependencies **Purpose**: Advanced crawling functionality (`crawl_scrapy()` method) **Dependency Contents**: - **Python Package**: `scrapy` (needs to be installed) - **Project Structure**: Requires complete Scrapy project **Installation Steps**: ```bash cd /home/coder1/python-middleware-dev poetry add scrapy ``` **Feature Support**: - **Project-based Crawling**: Supports complete Scrapy project structure - **Data Pipelines**: Data cleaning, deduplication, storage - **Middlewares**: Request/response processing - **Scheduler**: Intelligent request scheduling - **Monitoring**: Detailed logging and statistics #### 3. Other Dependencies **Python Package Dependencies**: - **httpx**: Asynchronous HTTP client - **beautifulsoup4**: HTML/XML parsing - **lxml**: Fast XML and HTML processing **System Dependencies**: - **Network Connection**: Download browsers and access target websites - **Memory**: Browser operation requires sufficient memory - **Disk Space**: Browser files approximately 500MB ### Usage Examples #### Basic HTTP Requests (No Browser Required) ```python from aiecs.tools.scraper_tool import ScraperTool scraper = ScraperTool() # Use fetch method for HTTP requests result = await scraper.fetch("https://example.com") # Access content html_content = result.get("content", "") ``` #### JavaScript Rendering (Requires Playwright) ```python # Need to install Playwright browsers first result = await scraper.render( url="https://spa-app.com", wait_time=5, screenshot=True ) ``` #### Advanced Crawling (Requires Scrapy) ```python # Need to install Scrapy first result = scraper.crawl_scrapy( project_path="/path/to/scrapy/project", spider_name="my_spider", output_path="output.json" ) ``` ### Feature Classification | Feature Type | Method Name | Requires Browser | Requires Scrapy | Dependencies | |-------------|-------------|------------------|------------------|--------------| | **Basic HTTP** | `get_httpx()` | ❌ Not required | ❌ Not required | httpx | | **Basic HTTP** | `get_urllib()` | ❌ Not required | ❌ Not required | urllib | | **HTML Parsing** | `parse_html()` | ❌ Not required | ❌ Not required | BeautifulSoup | | **JavaScript Rendering** | `render()` | ✅ Required | ❌ Not required | Playwright + browsers | | **Advanced Crawling** | `crawl_scrapy()` | ❌ Not required | ✅ Required | Scrapy | ### Notes #### Playwright Related - **Browser Download**: First use requires downloading browsers (approximately 500MB) - **System Dependencies**: Requires installation of system-level graphics libraries - **Memory Usage**: Browser operation requires sufficient memory - **Network Requirements**: Network connection required to download browsers #### Scrapy Related - **Project Structure**: Requires complete Scrapy project directory - **Spider Definition**: Requires pre-defined crawling logic - **Output Format**: Supports multiple output formats (JSON, CSV, XML) #### General Notes - **Network Limits**: Comply with website robots.txt and access frequency limits - **Legal Compliance**: Ensure scraping behavior complies with relevant laws and regulations - **Resource Management**: Reasonably control concurrent request numbers - **Error Handling**: Implement appropriate retry and error handling mechanisms ### Troubleshooting #### Playwright Issues ```bash # Check if browsers are installed poetry run playwright install --list # Reinstall browsers poetry run playwright install --force # Check system dependencies poetry run playwright install-deps ``` #### Scrapy Issues ```bash # Check if Scrapy is installed poetry run scrapy --version # Create test project poetry run scrapy startproject test_project ``` #### Network Issues - **Proxy Settings**: Configure HTTP proxy - **Timeout Settings**: Adjust request timeout duration - **Retry Mechanism**: Implement automatic retry logic