Tools Architecture

This directory contains tools that provide various functionalities for the application. The tools architecture has been refactored to separate business logic from performance optimization concerns and uses a layered architecture to organize different types of tools.

Directory Structure

app/tools/
├── __init__.py              # Tool registry and discovery mechanism
├── base_tool.py            # Base tool class
├── temp_file_manager.py    # Temporary file management tool
├── README.md               # This document
├── task_tools/             # Task-oriented tools
│   ├── __init__.py
│   ├── chart_tool.py       # Chart and visualization tools
│   ├── classfire_tool.py   # Classification and categorization tools
│   ├── image_tool.py       # Image processing tools
│   ├── office_tool.py      # Office document processing tools
│   ├── pandas_tool.py      # Data analysis and processing tools
│   ├── report_tool.py      # Report generation tools
│   ├── research_tool.py    # Research and information gathering tools
│   ├── scraper_tool.py     # Web scraping tools
│   ├── search_api.py       # Search engine API integration tools
│   └── stats_tool.py       # Statistical analysis tools
├── general_tools/          # General tools (reserved)
├── rag_tools/             # RAG-related tools (reserved)
└── out_source/            # External integration tools (reserved)

New Architecture

The new architecture includes the following components:

  1. Tool Executor (app/core/tool_executor.py): A centralized execution framework that handles the following cross-cutting concerns:

    • Input validation

    • Caching

    • Concurrency

    • Error handling

    • Performance optimization

    • Logging

  2. Base Tool Class (app/tools/base_tool.py): A base class that all tools should inherit from, providing:

    • Integration with the tool executor

    • Schema-based input validation

    • Standardized error handling

    • Automatic schema discovery

  3. Tool Registry (app/tools/__init__.py): Handles tool registration and retrieval:

    • Tool registration

    • Tool retrieval

    • Automatic tool discovery

    • Layered module imports

  4. Layered Tool Organization:

    • task_tools: Specialized task-oriented tools for specific business scenarios

    • general_tools: General tools providing basic functionality

    • rag_tools: RAG (Retrieval-Augmented Generation) related tools

    • out_source: External service integration tools

Tool Categories

Task Tools

Located in the task_tools/ directory, containing tools specialized for specific tasks:

  • chart_tool: Chart generation and data visualization

  • classfire_tool: Data classification and categorization

  • image_tool: Image processing and manipulation

  • office_tool: Office document processing (Word, Excel, PowerPoint)

  • pandas_tool: Data analysis and DataFrame operations

  • report_tool: Report generation and formatting

  • research_tool: Research and information gathering

  • scraper_tool: Web data scraping

  • search_api: Search engine API integration

  • stats_tool: Statistical analysis and computation

Using Base Tool Class

To create a new tool, inherit from the BaseTool class and implement your business logic methods:

from typing import Dict, Any, Optional
from pydantic import BaseModel, Field

from aiecs.tools import register_tool
from aiecs.tools.base_tool import BaseTool

@register_tool("my_tool")
class MyTool(BaseTool):
    """My tool description"""

    # Define input schema for operations
    class OperationSchema(BaseModel):
        """Operation schema"""
        param1: str = Field(description="Parameter 1")
        param2: int = Field(description="Parameter 2")

    def __init__(self, config: Optional[Dict[str, Any]] = None):
        """Initialize tool"""
        super().__init__(config)
        # Additional initialization

    def operation(self, param1: str, param2: int) -> Dict[str, Any]:
        """
        Implement your business logic here

        Args:
            param1: Parameter 1
            param2: Parameter 2

        Returns:
            Operation result
        """
        # Your business logic
        return {"result": f"Processing {param1} and {param2}"}

Using Decorators for Performance Optimization

The tool executor provides several decorators that you can use to add performance optimizations to methods:

from aiecs.tools.tool_executor import cache_result, run_in_executor, measure_execution_time

@cache_result()  # Cache the result of this method
def cached_operation(self, param1: str) -> Dict[str, Any]:
    # This result will be cached based on param1
    return {"result": f"Cached result {param1}"}

@run_in_executor  # Run this method in a thread pool
def cpu_intensive_operation(self, param1: str) -> Dict[str, Any]:
    # This method will be executed in a separate thread
    return {"result": f"CPU-intensive result {param1}"}

@measure_execution_time  # Record the execution time of this method
def monitored_operation(self, param1: str) -> Dict[str, Any]:
    # The execution time of this method will be recorded
    return {"result": f"Monitored result {param1}"}

Migrating Existing Tools

To migrate existing tools to the new architecture:

  1. Make your tool class inherit from BaseTool

  2. Define Pydantic schemas for your operations

  3. Remove any custom caching, validation, or error handling code

  4. Use decorators for performance optimization

  5. Update the run method to use the base class implementation

Before:

@register_tool("example")
class ExampleTool:
    def __init__(self):
        self._cache = {}

    def run(self, op: str, **kwargs):
        if op == "operation":
            return self.operation(**kwargs)
        else:
            raise ValueError(f"Unsupported operation: {op}")

    def operation(self, param1: str, param2: int):
        # Custom caching
        cache_key = f"{param1}_{param2}"
        if cache_key in self._cache:
            return self._cache[cache_key]

        # Custom validation
        if not isinstance(param1, str):
            raise ValueError("param1 must be a string")
        if not isinstance(param2, int):
            raise ValueError("param2 must be an integer")

        # Business logic
        result = {"result": f"Processing {param1} and {param2}"}

        # Cache result
        self._cache[cache_key] = result

        return result

After:

from typing import Dict, Any, Optional
from pydantic import BaseModel, Field

from aiecs.tools import register_tool
from aiecs.tools.base_tool import BaseTool
from aiecs.tools.tool_executor import cache_result

@register_tool("example")
class ExampleTool(BaseTool):
    """Example tool"""

    class OperationSchema(BaseModel):
        """Operation schema"""
        param1: str = Field(description="Parameter 1")
        param2: int = Field(description="Parameter 2")

    @cache_result()
    def operation(self, param1: str, param2: int) -> Dict[str, Any]:
        """
        Process parameters

        Args:
            param1: Parameter 1
            param2: Parameter 2

        Returns:
            Operation result
        """
        # Focus only on business logic
        return {"result": f"Processing {param1} and {param2}"}

Benefits of the New Architecture

The new architecture provides several benefits:

  1. Separation of Concerns: Business logic is separated from cross-cutting concerns like caching, validation, and error handling.

  2. Reduced Duplication: Common functionality is implemented once in the tool executor and base tool, rather than being duplicated across individual tools.

  3. Consistent Behavior: All tools behave consistently in terms of validation, error handling, and performance optimization.

  4. Improved Maintainability: Tools are easier to maintain because they focus only on specific business logic.

  5. Enhanced Performance: The tool executor provides optimized implementations of caching, concurrency, and other performance features.

  6. Better Testing: Business logic can be tested independently of cross-cutting concerns.

  7. Easier Onboarding: New developers can focus on implementing business logic without worrying about performance optimization details.

Usage Examples

# Get tool instance
from aiecs.tools import get_tool

# Get chart tool
chart_tool = get_tool("chart")

# Use tool
result = chart_tool.run("visualize",
    file_path="data.csv",
    plot_type="histogram",
    x="age",
    title="Age Distribution"
)

# Or call method directly
result = chart_tool.visualize(
    file_path="data.csv",
    plot_type="histogram",
    x="age",
    title="Age Distribution"
)

Multi-Task Service Integration

The tool system is fully integrated with the MultiTaskTools service in app/services/multi_task/tools.py:

from aiecs.services.multi_task.tools import MultiTaskTools

# Initialize multi-task tools service
multi_tools = MultiTaskTools()

# Get all available tools
available_tools = multi_tools.get_available_tools()
print("Available tools:", available_tools)

# Get operations for a specific tool
chart_operations = multi_tools.get_available_operations("chart")
print("Chart tool operations:", chart_operations)

# Get operation details
operation_info = multi_tools.get_operation_info("chart.visualize")
print("Operation info:", operation_info)

# Execute tool operation
result = await multi_tools.execute_tool(
    "chart",
    "visualize",
    file_path="data.csv",
    plot_type="histogram",
    x="age"
)

Task Tool Usage Examples

Data Processing Pipeline

from aiecs.tools import get_tool

# 1. Data analysis tool
pandas_tool = get_tool("pandas")
df_result = pandas_tool.read_csv(file_path="data.csv")

# 2. Statistical analysis tool
stats_tool = get_tool("stats")
stats_result = stats_tool.descriptive_stats(data=df_result["data"])

# 3. Chart generation tool
chart_tool = get_tool("chart")
chart_result = chart_tool.visualize(
    data=df_result["data"],
    plot_type="histogram",
    x="age"
)

# 4. Report generation tool
report_tool = get_tool("report")
report_result = report_tool.generate_report(
    data=stats_result,
    charts=[chart_result],
    template="statistical_summary"
)

Research and Information Gathering

# Research tool
research_tool = get_tool("research")
research_result = research_tool.search_papers(
    query="machine learning",
    max_results=10
)

# Web scraping tool
scraper_tool = get_tool("scraper")
web_data = scraper_tool.scrape_url(
    url="https://example.com",
    selectors=["h1", "p"]
)

# Search API tool
search_tool = get_tool("search_api")
search_results = search_tool.web_search(
    query="artificial intelligence trends 2024",
    num_results=5
)

Office Document Processing

# Office tool
office_tool = get_tool("office")

# Process Excel file
excel_result = office_tool.read_excel(
    file_path="data.xlsx",
    sheet_name="Sheet1"
)

# Generate Word report
word_result = office_tool.create_word_document(
    content=report_result["content"],
    template="business_report"
)

# Create PowerPoint presentation
ppt_result = office_tool.create_presentation(
    slides_data=chart_result["charts"],
    template="data_analysis"
)

Tool Discovery and Registration

The system automatically discovers and registers all tools:

from aiecs.tools import list_tools, discover_tools

# List all registered tools
all_tools = list_tools()
print("Registered tools:", all_tools)

# Manually trigger tool discovery (usually not needed, system does this automatically)
discover_tools("aiecs.tools")

# View tools by category
task_tools = [tool for tool in all_tools if "task_tools" in str(type(get_tool(tool)))]
print("Task tools:", task_tools)

Best Practices

1. Tool Composition

Combine multiple tools to complete complex tasks:

def data_analysis_pipeline(csv_file: str):
    """Complete data analysis pipeline"""

    # Data loading and cleaning
    pandas_tool = get_tool("pandas")
    data = pandas_tool.read_csv(csv_file)
    cleaned_data = pandas_tool.clean_data(data["data"])

    # Statistical analysis
    stats_tool = get_tool("stats")
    statistics = stats_tool.comprehensive_analysis(cleaned_data["data"])

    # Visualization
    chart_tool = get_tool("chart")
    charts = chart_tool.create_dashboard(
        data=cleaned_data["data"],
        chart_types=["histogram", "boxplot", "correlation"]
    )

    # Generate report
    report_tool = get_tool("report")
    final_report = report_tool.generate_comprehensive_report(
        data=statistics,
        visualizations=charts,
        template="data_analysis"
    )

    return final_report

2. Error Handling

Use appropriate error handling:

from aiecs.tools import get_tool
from aiecs.tools.tool_executor import ToolExecutionError

try:
    tool = get_tool("pandas")
    result = tool.read_csv("nonexistent.csv")
except ToolExecutionError as e:
    print(f"Tool execution error: {e}")
except ValueError as e:
    print(f"Tool does not exist: {e}")

3. Asynchronous Operations

Use asynchronous execution for time-consuming operations:

import asyncio
from aiecs.services.multi_task.tools import MultiTaskTools

async def async_data_processing():
    multi_tools = MultiTaskTools()

    # Execute multiple operations in parallel
    tasks = [
        multi_tools.execute_tool("scraper", "scrape_url", url="https://site1.com"),
        multi_tools.execute_tool("scraper", "scrape_url", url="https://site2.com"),
        multi_tools.execute_tool("research", "search_papers", query="AI")
    ]

    results = await asyncio.gather(*tasks)
    return results

Extending the Tool System

Adding New Task Tools

  1. Create a new tool file in the task_tools/ directory

  2. Inherit from the BaseTool class

  3. Register using the @register_tool decorator

  4. Add import in task_tools/__init__.py

# task_tools/my_new_tool.py
from aiecs.tools import register_tool
from aiecs.tools.base_tool import BaseTool

@register_tool("my_new_tool")
class MyNewTool(BaseTool):
    """New tool description"""

    def my_operation(self, param: str) -> dict:
        """Operation description"""
        return {"result": f"Processing {param}"}

Creating New Tool Categories

  1. Create a new directory under app/tools/

  2. Add an __init__.py file

  3. Add import in the main __init__.py

  4. Tools will be automatically discovered and registered

Special Tool Usage Instructions

Image Tool

The Image Tool provides comprehensive image processing capabilities, including loading, OCR text recognition, metadata extraction, resizing, and filter application.

System Dependency Requirements

Important: The Image Tool requires system-level Tesseract OCR engine and Pillow image processing library system dependencies.

1. Tesseract OCR Engine

Ubuntu/Debian systems:

sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng

macOS systems:

brew install tesseract

Verify installation:

tesseract --version

2. Pillow Image Processing Library System Dependencies

Ubuntu/Debian systems:

# Basic image processing libraries
sudo apt-get install libjpeg-dev zlib1g-dev libpng-dev libtiff-dev libwebp-dev libopenjp2-7-dev

# Complete image processing libraries (recommended)
sudo apt-get install libimageio-dev libfreetype6-dev liblcms2-dev libtiff5-dev libjpeg8-dev libopenjp2-7-dev libwebp-dev libharfbuzz-dev libfribidi-dev libxcb1-dev

macOS systems:

brew install libjpeg zlib libpng libtiff webp openjpeg freetype lcms2

Verify installation:

python -c "from PIL import Image; print('PIL version:', Image.__version__)"

3. Multi-language OCR Support

Install additional language packs:

# Ubuntu/Debian systems
sudo apt-get install tesseract-ocr-chi-sim    # Simplified Chinese
sudo apt-get install tesseract-ocr-chi-tra    # Traditional Chinese
sudo apt-get install tesseract-ocr-fra        # French
sudo apt-get install tesseract-ocr-deu        # German
sudo apt-get install tesseract-ocr-jpn        # Japanese
sudo apt-get install tesseract-ocr-kor        # Korean
sudo apt-get install tesseract-ocr-rus        # Russian
sudo apt-get install tesseract-ocr-spa        # Spanish

View installed language packs:

tesseract --list-langs

Using multi-language OCR:

# English OCR
text = tool.ocr("/path/to/image.jpg", lang='eng')

# Chinese OCR
text = tool.ocr("/path/to/image.jpg", lang='chi_sim')

# Japanese OCR
text = tool.ocr("/path/to/image.jpg", lang='jpn')

Features

  1. Image Loading: Supports multiple formats (JPG, PNG, BMP, TIFF, GIF)

  2. OCR Text Recognition: Text extraction based on Tesseract engine

  3. Metadata Extraction: Get image dimensions, mode, and EXIF information

  4. Image Resizing: High-quality resizing

  5. Filter Effects: Blur, sharpen, edge enhancement, and other effects

Usage Examples

from aiecs.tools.task_tools.image_tool import ImageTool

# Initialize tool
tool = ImageTool()

# Load image information
result = tool.load("/path/to/image.jpg")
print(f"Size: {result['size']}, Mode: {result['mode']}")

# OCR text recognition
text = tool.ocr("/path/to/image.png", lang='eng')
print(f"Recognized text: {text}")

# Extract metadata
metadata = tool.metadata("/path/to/image.jpg", include_exif=True)
print(f"EXIF info: {metadata.get('exif', {})}")

# Resize image
tool.resize("/path/to/input.jpg", "/path/to/output.jpg", 800, 600)

# Apply filter
tool.filter("/path/to/input.jpg", "/path/to/blurred.jpg", "blur")

Security Features

  • File extension whitelist validation

  • File size limits (default 50MB)

  • Path normalization and security checks

  • Complete error handling and logging

ClassFire Tool (Text Classification and Keyword Extraction Tool)

The ClassFire Tool provides powerful text classification, keyword extraction, and text summarization capabilities, supporting both English and Chinese text processing.

Model Dependency Requirements

Important: The ClassFire Tool requires downloading and installing the following models to function properly.

1. spaCy Model Dependencies

Models Used:

  • English Model: en_core_web_sm - Used for part-of-speech tagging, named entity recognition, and keyword extraction for English text

  • Chinese Model: zh_core_web_sm - Used for part-of-speech tagging, named entity recognition, and keyword extraction for Chinese text

Installation Method:

# Install using Poetry environment
poetry run python -m spacy download en_core_web_sm
poetry run python -m spacy download zh_core_web_sm

# Or install using pip
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl
pip install https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-3.7.0/zh_core_web_sm-3.7.0-py3-none-any.whl

Usage Reasons:

  • Part-of-Speech Tagging: Identifies nouns, verbs, adjectives, etc., for keyword extraction

  • Named Entity Recognition: Identifies entities like person names, place names, organization names, improving keyword quality

  • Language Detection: Automatically detects text language and selects appropriate processing strategy

  • Text Preprocessing: Provides standardized text processing pipeline

2. Transformers Model Dependencies

Models Used:

  • English Summarization Model: facebook/bart-large-cnn - Used for English text summarization

  • Multilingual Summarization Model: t5-base - Used for Chinese text summarization

Model Download:

# Models will be automatically downloaded to ~/.cache/huggingface/hub/ on first use
# No manual installation needed, but ensure network connection is available

Installation Verification:

from transformers import pipeline

# Test English summarization model
summarizer_en = pipeline("summarization", model="facebook/bart-large-cnn")
result = summarizer_en("Your text here...", max_length=100, min_length=30)

# Test multilingual summarization model
summarizer_zh = pipeline("summarization", model="t5-base")
result = summarizer_zh("您的中文文本...", max_new_tokens=50, min_new_tokens=10)

Usage Reasons:

  • High-Quality Summarization: BART and T5 are state-of-the-art summarization models

  • Multilingual Support: T5 supports multiple languages, including Chinese

  • Configurable Length: Supports custom summary length and minimum length

  • Asynchronous Processing: Supports asynchronous calls, improving processing efficiency

3. NLTK Data Package Dependencies

Required Data Packages:

  • stopwords - Stopword data for keyword filtering

  • punkt - Sentence tokenizer for text preprocessing

  • wordnet - Lexical database for word similarity calculation

  • averaged_perceptron_tagger - Part-of-speech tagger

Automatic Download:

# Use the provided script to automatically download all NLP data
poetry run python aiecs/scripts/download_nlp_data.py

Features

  1. Text Classification: Text classification based on pre-trained models

  2. Keyword Extraction: Supports RAKE (English) and spaCy (English/Chinese) keyword extraction

  3. Text Summarization: Supports English and Chinese text summarization

  4. Language Detection: Automatically detects text language

  5. Asynchronous Processing: Supports asynchronous calls, improving performance

Usage Examples

from aiecs.tools.task_tools.classfire_tool import ClassifierTool

# Initialize tool
tool = ClassifierTool()

# Text classification
result = await tool.classify("This is a positive review about the product.")
print(f"Classification result: {result}")

# Keyword extraction
keywords = await tool.extract_keywords("Natural language processing is important.", top_k=5)
print(f"Keywords: {keywords}")

# Text summarization
summary = await tool.summarize("Your long text here...", max_length=100)
print(f"Summary: {summary}")

# Chinese processing
chinese_keywords = await tool.extract_keywords("自然语言处理是人工智能的重要领域。", top_k=3)
print(f"Chinese keywords: {chinese_keywords}")

Performance Optimization

  • Model Caching: Models are cached after first load, improving subsequent call speed

  • Asynchronous Processing: All main features support asynchronous calls

  • Memory Management: Supports model unloading and reloading to save memory

  • Error Handling: Comprehensive error handling and fallback mechanisms

Notes

  • First Use: Models will be automatically downloaded on first use, which may take some time

  • Network Requirements: Network connection required to download Transformers models

  • Memory Requirements: Model loading requires certain memory space

  • Language Support: Currently mainly supports English and Chinese, limited support for other languages

Office Tool (Office Document Processing Tool)

The Office Tool provides comprehensive office document processing capabilities, supporting reading, writing, and conversion of various document formats, including Word, PowerPoint, Excel, PDF, and image files.

System Dependency Requirements

Important: The Office Tool requires Java Runtime Environment and Tesseract OCR engine to function properly.

1. Java Runtime Environment (Required)

Purpose: Apache Tika document parsing library requires Java Runtime Environment.

Ubuntu/Debian systems:

# Install OpenJDK 11 (recommended)
sudo apt-get update
sudo apt-get install openjdk-11-jdk

# Or install OpenJDK 17
sudo apt-get install openjdk-17-jdk

# Verify installation
java -version
javac -version

macOS systems:

# Install using Homebrew
brew install openjdk@11

# Or install OpenJDK 17
brew install openjdk@17

Environment Variable Setup:

# Set JAVA_HOME environment variable
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
# Or for OpenJDK 17
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64

# Add to ~/.bashrc or ~/.zshrc
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc

2. Tesseract OCR Engine (Required for OCR Functionality)

Purpose: Text recognition functionality in image files.

Ubuntu/Debian systems:

sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng

# Chinese OCR support
sudo apt-get install tesseract-ocr-chi-sim    # Simplified Chinese
sudo apt-get install tesseract-ocr-chi-tra    # Traditional Chinese

macOS systems:

brew install tesseract

Verify installation:

tesseract --version
tesseract --list-langs

3. Python Package Dependencies

Core Document Processing Libraries:

  • pandas (>=2.2.3) - Excel file data processing

  • openpyxl (>=3.1.5) - Excel file read/write

  • python-docx (>=1.1.2) - Word document processing

  • python-pptx (>=1.0.2) - PowerPoint document processing

  • pdfplumber (>=0.11.7) - PDF text extraction

Content Parsing Libraries:

  • tika (>=3.2.2) - Universal document parsing (requires Java 11+)

  • pytesseract (>=0.3.13) - OCR text recognition

  • Pillow (>=11.2.1) - Image processing

Features

  1. Document Reading: Supports DOCX, PPTX, XLSX, PDF formats

  2. Document Writing: Create and edit Word, PowerPoint, Excel documents

  3. Text Extraction: Extract text content from various document formats

  4. OCR Functionality: Recognize text from image files

  5. Multi-format Support: Process legacy Office documents and other formats

Usage Examples

from aiecs.tools.task_tools.office_tool import OfficeTool

# Initialize tool
tool = OfficeTool()

# Read Word document
docx_content = tool.read_docx("/path/to/document.docx")
print(f"Document content: {docx_content['text']}")

# Read Excel file
xlsx_data = tool.read_xlsx("/path/to/spreadsheet.xlsx")
print(f"Spreadsheet data: {xlsx_data}")

# Extract text (supports multiple formats)
text = tool.extract_text("/path/to/document.pdf")
print(f"Extracted text: {text}")

# Create Word document
tool.write_docx("Hello World!", "/path/to/output.docx")

# Create PowerPoint presentation
slides = ["Title Slide", "Content Page 1", "Content Page 2"]
tool.write_pptx(slides, "/path/to/presentation.pptx")

OCR Functionality

Supported Image Formats:

  • PNG, JPG, JPEG, TIFF, BMP, GIF

Language Support:

  • English (eng)

  • Simplified Chinese (chi_sim)

  • Traditional Chinese (chi_tra)

Usage Example:

# Extract text from image
image_text = tool.extract_text("/path/to/image.png")
print(f"Recognized text: {image_text}")

Performance Optimization

  • Tika Caching: Tika JAR file will be downloaded and cached on first use

  • Memory Management: Pay attention to memory usage when processing large files

  • Concurrency Limits: Recommend limiting the number of documents processed simultaneously

  • Error Handling: Comprehensive error handling and fallback mechanisms

Notes

  • Java Version: Requires Java 11 or higher (Tika 3.x requirement)

  • Memory Requirements: Tika requires sufficient memory when processing large files

  • File Size: Default maximum file size is 100MB

  • Encoding Issues: Some documents may have encoding issues

  • OCR Accuracy: Image quality affects OCR recognition accuracy

Stats Tool (Statistical Analysis Tool)

The Stats Tool provides comprehensive statistical analysis capabilities, supporting various statistical tests, data preprocessing, regression analysis, time series analysis, and other advanced statistical functions.

System Dependency Requirements

Important: The Stats Tool requires system-level C libraries to support reading special file formats, particularly SAS, SPSS, and Stata files.

1. pyreadstat System Dependencies (Special File Format Support)

Purpose: Read and write SAS, SPSS, Stata files (.sav, .sas7bdat, .por formats)

Ubuntu/Debian systems:

# Install libreadstat development library
sudo apt-get update
sudo apt-get install libreadstat-dev

# Install build tools (if not already installed)
sudo apt-get install build-essential python3-dev

# Reinstall pyreadstat
pip install --no-cache-dir --force-reinstall pyreadstat

macOS systems:

# Install using Homebrew
brew install readstat

# Reinstall pyreadstat
pip install --no-cache-dir --force-reinstall pyreadstat

CentOS/RHEL systems:

# Install development tools
sudo yum groupinstall "Development Tools"
sudo yum install python3-devel

# Install readstat library (may need to compile from source)
# Or use conda to install
conda install -c conda-forge readstat

Verify installation:

import pyreadstat
print("pyreadstat version:", pyreadstat.__version__)

# Test read functionality
try:
    # No actual file needed here, just testing import
    print("pyreadstat installed successfully")
except Exception as e:
    print("pyreadstat installation failed:", e)

2. Excel File Support System Dependencies

Purpose: Read and write Excel files (.xlsx, .xls formats)

Ubuntu/Debian systems:

# Install system libraries required by openpyxl
sudo apt-get install libxml2-dev libxslt1-dev

# Verify installation
python -c "import openpyxl; print('openpyxl available')"

macOS systems:

# Usually no additional installation needed, system already includes required libraries
brew install libxml2 libxslt

3. Python Package Dependencies

Core Statistical Libraries:

  • pandas (>=2.2.3) - Data processing and analysis

  • numpy (>=2.2.6) - Numerical computation

  • scipy (>=1.15.3) - Scientific computing and statistical functions

  • scikit-learn (>=1.5.0) - Machine learning library (data preprocessing)

  • statsmodels (>=0.14.4) - Statistical models and tests

Special File Format Support:

  • pyreadstat (>=1.2.9) - SAS, SPSS, Stata file support

  • openpyxl (>=3.1.5) - Excel file support

Configuration Management:

  • pydantic (>=2.11.5) - Data validation

  • pydantic-settings (>=2.9.1) - Settings management

Features

  1. Descriptive Statistics: Basic statistics, skewness, kurtosis, percentiles

  2. Hypothesis Testing: t-tests, chi-square tests, ANOVA, non-parametric tests

  3. Correlation Analysis: Pearson, Spearman, Kendall correlation coefficients

  4. Regression Analysis: OLS, Logit, Probit, Poisson regression

  5. Time Series: ARIMA, SARIMA models and forecasting

  6. Data Preprocessing: Standardization, missing value handling, data cleaning

  7. Multi-format Support: CSV, Excel, JSON, Parquet, Feather, SAS, SPSS, Stata

Supported File Formats

Format

Extension

Dependency Library

System Requirements

CSV

.csv

pandas

None

Excel

.xlsx, .xls

openpyxl

libxml2, libxslt

JSON

.json

pandas

None

Parquet

.parquet

pandas

None

Feather

.feather

pandas

None

SPSS

.sav, .por

pyreadstat

libreadstat

SAS

.sas7bdat

pyreadstat

libreadstat

Usage Examples

from aiecs.tools import get_tool

# Get statistics tool
stats_tool = get_tool("stats")

# Read data
data_info = stats_tool.read_data("data.sav")  # SPSS file
print(f"Number of variables: {len(data_info['variables'])}")
print(f"Number of observations: {data_info['observations']}")

# Descriptive statistics
desc_stats = stats_tool.describe(
    file_path="data.sav",
    variables=["age", "income", "education"],
    include_percentiles=True,
    percentiles=[0.1, 0.9]
)

# t-test
ttest_result = stats_tool.ttest(
    file_path="data.sav",
    var1="group1_score",
    var2="group2_score",
    equal_var=True
)

# Correlation analysis
correlation = stats_tool.correlation(
    file_path="data.sav",
    variables=["var1", "var2", "var3"],
    method="pearson"
)

# Regression analysis
regression = stats_tool.regression(
    file_path="data.sav",
    formula="y ~ x1 + x2 + x3",
    regression_type="ols"
)

# Data preprocessing
preprocessed = stats_tool.preprocess(
    file_path="data.sav",
    variables=["var1", "var2"],
    operation="scale",
    scaler_type="standard"
)

Environment Variable Configuration

Stats Tool can be configured via the following environment variables:

# Maximum file size limit (MB)
export STATS_TOOL_MAX_FILE_SIZE_MB=200

# Allowed file extensions
export STATS_TOOL_ALLOWED_EXTENSIONS=".sav,.sas7bdat,.por,.csv,.xlsx,.xls,.json,.parquet,.feather"

Troubleshooting

pyreadstat Installation Issues

Problem: ImportError: No module named 'pyreadstat' or compilation errors

Solution:

# 1. Install system dependencies
sudo apt-get install libreadstat-dev build-essential python3-dev

# 2. Reinstall
pip uninstall pyreadstat
pip install --no-cache-dir pyreadstat

# 3. Verify installation
python -c "import pyreadstat; print('Success')"

Problem: OSError: libreadstat.so: cannot open shared object file

Solution:

# Check library file location
ldconfig -p | grep readstat

# If not found, reinstall system library
sudo apt-get install --reinstall libreadstat0

Excel File Reading Issues

Problem: ImportError: No module named 'openpyxl'

Solution:

# Install openpyxl
pip install openpyxl

# Install system dependencies
sudo apt-get install libxml2-dev libxslt1-dev

Memory Usage Issues

Problem: Insufficient memory when processing large files

Solution:

# Use nrows parameter to limit number of rows read
data_info = stats_tool.read_data("large_file.csv", nrows=10000)

# Adjust environment variable
export STATS_TOOL_MAX_FILE_SIZE_MB=500

File Permission Issues

Problem: Cannot read file

Solution:

# Check file permissions
ls -la data.sav

# Modify permissions
chmod 644 data.sav

# Check file path
python -c "import os; print(os.path.exists('data.sav'))"

Report Tool (Multi-format Report Generation Tool)

The Report Tool provides comprehensive report generation capabilities, supporting HTML, Excel, PowerPoint, Word, Markdown, image, and PDF format report generation.

System Dependency Requirements

Important: Some features of the Report Tool require system-level graphics libraries and font libraries.

1. WeasyPrint System Dependencies (Required for PDF Functionality)

Purpose: HTML to PDF functionality requires WeasyPrint system-level dependencies.

Ubuntu/Debian systems:

# Install system libraries required by WeasyPrint
sudo apt-get update
sudo apt-get install libcairo2-dev libpango1.0-dev libgdk-pixbuf2.0-dev libffi-dev shared-mime-info

# Complete installation (recommended)
sudo apt-get install libcairo2-dev libpango1.0-dev libgdk-pixbuf2.0-dev libffi-dev shared-mime-info libxml2-dev libxslt1-dev

macOS systems:

# Install using Homebrew
brew install cairo pango gdk-pixbuf libffi

Verify installation:

# Check system libraries
pkg-config --modversion cairo
pkg-config --modversion pango

2. Matplotlib System Dependencies (Required for Chart Functionality)

Purpose: Chart generation functionality requires font and image processing libraries.

Ubuntu/Debian systems:

# Install system libraries required by Matplotlib
sudo apt-get install libfreetype6-dev libpng-dev libjpeg-dev libtiff-dev libwebp-dev

# Chinese font support
sudo apt-get install fonts-wqy-zenhei fonts-wqy-microhei

macOS systems:

# Install using Homebrew
brew install freetype libpng libjpeg libtiff webp

Verify installation:

python -c "import matplotlib.pyplot as plt; plt.figure(); print('Matplotlib working')"

3. Python Package Dependencies

Core Report Generation Libraries:

  • jinja2 (>=3.1.6) - Template engine

  • weasyprint (>=65.1) - HTML to PDF

  • matplotlib (>=3.10.3) - Chart generation

  • bleach (>=6.2.0) - HTML sanitization

  • markdown (>=3.8) - Markdown processing

Document Processing Libraries:

  • pandas (>=2.2.3) - Data processing

  • openpyxl (>=3.1.5) - Excel file processing

  • python-docx (>=1.1.2) - Word document processing

  • python-pptx (>=1.0.2) - PowerPoint document processing

Features

  1. HTML Reports: Generated using Jinja2 template engine ✅

  2. PDF Reports: HTML to PDF conversion using WeasyPrint ⚠️ Temporarily Disabled

  3. Excel Reports: Multi-sheet Excel file generation ✅

  4. PowerPoint Reports: Custom slide presentations ✅

  5. Word Reports: Styled Word documents ✅

  6. Markdown Reports: Markdown format reports ✅

  7. Image Reports: Chart generation using Matplotlib ✅

Usage Examples

from aiecs.tools.task_tools.report_tool import ReportTool

# Initialize tool
tool = ReportTool()

# Generate HTML report
html_result = tool.generate_html(
    template_path="report_template.html",
    context={"title": "Monthly Report", "data": data},
    output_path="/path/to/report.html"
)

# Generate PDF report (⚠️ Temporarily disabled - requires system dependencies and code modification)
# pdf_result = tool.generate_pdf(
#     html=html_content,
#     output_path="/path/to/report.pdf",
#     page_size="A4"
# )

# Generate Excel report
excel_result = tool.generate_excel(
    sheets={"Data": df, "Summary": summary_df},
    output_path="/path/to/report.xlsx"
)

# Generate chart
chart_result = tool.generate_image(
    chart_type="bar",
    data=chart_data,
    output_path="/path/to/chart.png",
    title="Sales Data"
)

PDF Functionality Notes

⚠️ Important Notice: HTML to PDF functionality is temporarily disabled due to WeasyPrint system dependency issues.

Current Status:

  • PDF generation functionality is completely unavailable

  • Calling generate_pdf() method will throw an error

  • Requires manual installation of system dependencies and code modification to enable

Disable Reason:

  • Missing system-level graphics libraries required by WeasyPrint

  • Deployment environment complexity makes dependency installation difficult

  • To ensure stability of other features

Enable Method:

  1. Install WeasyPrint system dependencies:

    sudo apt-get install libcairo2-dev libpango1.0-dev libgdk-pixbuf2.0-dev libffi-dev shared-mime-info
    
  2. Modify code:

    • Uncomment from weasyprint import HTML import statement

    • Uncomment implementation code in generate_pdf method

    • Remove error throwing statements

  3. Verify installation: Ensure all system libraries are correctly installed

Supported Features (after enabling):

  • HTML to PDF conversion

  • Custom page sizes

  • CSS style support

  • Template variable substitution

Alternatives:

  • Use generate_html() to generate HTML reports

  • Use browser to manually print to PDF

  • Use other PDF generation tools

Performance Optimization

  • Template Caching: Jinja2 templates are automatically cached

  • Temporary File Management: Automatic cleanup of temporary files

  • Batch Generation: Supports parallel generation of multiple reports

  • Memory Management: Optimized for large file processing

Notes

  • WeasyPrint Dependencies: PDF functionality requires complete system library support

  • Font Support: Chart generation requires system font libraries

  • Template Security: Automatic HTML content sanitization to prevent XSS attacks

  • File Size: Pay attention to memory usage when processing large files

  • Concurrency Limits: Recommend limiting the number of reports generated simultaneously


Scraper Tool (Web Scraping Tool)

Feature Overview

The Scraper Tool is a powerful web scraping tool that supports multiple HTTP clients, JavaScript rendering, HTML parsing, and advanced crawling functionality.

Main Features:

  • HTTP Requests: Supports httpx, urllib, and other clients

  • JavaScript Rendering: Uses Playwright for dynamic content scraping

  • HTML Parsing: Uses BeautifulSoup and lxml for content parsing

  • Advanced Crawling: Integrates Scrapy for complex crawling projects

  • Multi-format Output: Supports text, JSON, HTML, Markdown, CSV output

Special Dependency Instructions

1. Playwright Browser Dependencies

Purpose: JavaScript rendering functionality (render() method)

Dependency Contents:

  • Python Package: playwright (already installed)

  • Browser Binaries: Chromium, Firefox, WebKit

  • System Dependencies: System libraries required for browser operation

Installation Steps:

  1. Download Browsers:

    cd /home/coder1/python-middleware-dev
    poetry run playwright install
    
  2. Install System Dependencies:

    # Method 1: Use Playwright automatic installation
    poetry run playwright install-deps
    
    # Method 2: Manual installation (Ubuntu/Debian)
    sudo apt-get install libatk1.0-0 \
        libatk-bridge2.0-0 \
        libcups2 \
        libxkbcommon0 \
        libatspi2.0-0 \
        libxcomposite1 \
        libxdamage1 \
        libxfixes3 \
        libxrandr2 \
        libgbm1 \
        libasound2
    

    Method 3: Root account installation (recommended)

    Temporarily install playwright-python package

    pip install playwright

    Run playwright command to install system dependencies

    python -m playwright install-deps

    After dependencies are installed, uninstall temporary playwright-python package to keep root environment clean

    pip uninstall playwright -y

Browser Storage Location:

  • Path: ~/.cache/ms-playwright/

  • Size: Approximately 400-500MB (all browsers)

  • Contains: Chromium, Firefox, WebKit, FFMPEG

Feature Support:

  • Page Rendering: Waits for JavaScript execution to complete

  • Element Waiting: Waits for specific CSS selectors

  • Page Scrolling: Scrolls to bottom of page

  • Screenshot Functionality: Saves page screenshots

  • Multi-browser: Supports Chromium, Firefox, WebKit

2. Scrapy Advanced Crawling Dependencies

Purpose: Advanced crawling functionality (crawl_scrapy() method)

Dependency Contents:

  • Python Package: scrapy (needs to be installed)

  • Project Structure: Requires complete Scrapy project

Installation Steps:

cd /home/coder1/python-middleware-dev
poetry add scrapy

Feature Support:

  • Project-based Crawling: Supports complete Scrapy project structure

  • Data Pipelines: Data cleaning, deduplication, storage

  • Middlewares: Request/response processing

  • Scheduler: Intelligent request scheduling

  • Monitoring: Detailed logging and statistics

3. Other Dependencies

Python Package Dependencies:

  • httpx: Asynchronous HTTP client

  • beautifulsoup4: HTML/XML parsing

  • lxml: Fast XML and HTML processing

System Dependencies:

  • Network Connection: Download browsers and access target websites

  • Memory: Browser operation requires sufficient memory

  • Disk Space: Browser files approximately 500MB

Usage Examples

Basic HTTP Requests (No Browser Required)

from aiecs.tools.scraper_tool import ScraperTool

scraper = ScraperTool()

# Use fetch method for HTTP requests
result = await scraper.fetch("https://example.com")

# Access content
html_content = result.get("content", "")

JavaScript Rendering (Requires Playwright)

# Need to install Playwright browsers first
result = await scraper.render(
    url="https://spa-app.com",
    wait_time=5,
    screenshot=True
)

Advanced Crawling (Requires Scrapy)

# Need to install Scrapy first
result = scraper.crawl_scrapy(
    project_path="/path/to/scrapy/project",
    spider_name="my_spider",
    output_path="output.json"
)

Feature Classification

Feature Type

Method Name

Requires Browser

Requires Scrapy

Dependencies

Basic HTTP

get_httpx()

❌ Not required

❌ Not required

httpx

Basic HTTP

get_urllib()

❌ Not required

❌ Not required

urllib

HTML Parsing

parse_html()

❌ Not required

❌ Not required

BeautifulSoup

JavaScript Rendering

render()

✅ Required

❌ Not required

Playwright + browsers

Advanced Crawling

crawl_scrapy()

❌ Not required

✅ Required

Scrapy

Notes

General Notes

  • Network Limits: Comply with website robots.txt and access frequency limits

  • Legal Compliance: Ensure scraping behavior complies with relevant laws and regulations

  • Resource Management: Reasonably control concurrent request numbers

  • Error Handling: Implement appropriate retry and error handling mechanisms

Troubleshooting

Playwright Issues

# Check if browsers are installed
poetry run playwright install --list

# Reinstall browsers
poetry run playwright install --force

# Check system dependencies
poetry run playwright install-deps

Scrapy Issues

# Check if Scrapy is installed
poetry run scrapy --version

# Create test project
poetry run scrapy startproject test_project

Network Issues

  • Proxy Settings: Configure HTTP proxy

  • Timeout Settings: Adjust request timeout duration

  • Retry Mechanism: Implement automatic retry logic