Tools Architecture
This directory contains tools that provide various functionalities for the application. The tools architecture has been refactored to separate business logic from performance optimization concerns and uses a layered architecture to organize different types of tools.
Directory Structure
app/tools/
├── __init__.py # Tool registry and discovery mechanism
├── base_tool.py # Base tool class
├── temp_file_manager.py # Temporary file management tool
├── README.md # This document
├── task_tools/ # Task-oriented tools
│ ├── __init__.py
│ ├── chart_tool.py # Chart and visualization tools
│ ├── classfire_tool.py # Classification and categorization tools
│ ├── image_tool.py # Image processing tools
│ ├── office_tool.py # Office document processing tools
│ ├── pandas_tool.py # Data analysis and processing tools
│ ├── report_tool.py # Report generation tools
│ ├── research_tool.py # Research and information gathering tools
│ ├── scraper_tool.py # Web scraping tools
│ ├── search_api.py # Search engine API integration tools
│ └── stats_tool.py # Statistical analysis tools
├── general_tools/ # General tools (reserved)
├── rag_tools/ # RAG-related tools (reserved)
└── out_source/ # External integration tools (reserved)
New Architecture
The new architecture includes the following components:
Tool Executor (
app/core/tool_executor.py): A centralized execution framework that handles the following cross-cutting concerns:Input validation
Caching
Concurrency
Error handling
Performance optimization
Logging
Base Tool Class (
app/tools/base_tool.py): A base class that all tools should inherit from, providing:Integration with the tool executor
Schema-based input validation
Standardized error handling
Automatic schema discovery
Tool Registry (
app/tools/__init__.py): Handles tool registration and retrieval:Tool registration
Tool retrieval
Automatic tool discovery
Layered module imports
Layered Tool Organization:
task_tools: Specialized task-oriented tools for specific business scenarios
general_tools: General tools providing basic functionality
rag_tools: RAG (Retrieval-Augmented Generation) related tools
out_source: External service integration tools
Tool Categories
Task Tools
Located in the task_tools/ directory, containing tools specialized for specific tasks:
chart_tool: Chart generation and data visualization
classfire_tool: Data classification and categorization
image_tool: Image processing and manipulation
office_tool: Office document processing (Word, Excel, PowerPoint)
pandas_tool: Data analysis and DataFrame operations
report_tool: Report generation and formatting
research_tool: Research and information gathering
scraper_tool: Web data scraping
search_api: Search engine API integration
stats_tool: Statistical analysis and computation
Using Base Tool Class
To create a new tool, inherit from the BaseTool class and implement your business logic methods:
from typing import Dict, Any, Optional
from pydantic import BaseModel, Field
from aiecs.tools import register_tool
from aiecs.tools.base_tool import BaseTool
@register_tool("my_tool")
class MyTool(BaseTool):
"""My tool description"""
# Define input schema for operations
class OperationSchema(BaseModel):
"""Operation schema"""
param1: str = Field(description="Parameter 1")
param2: int = Field(description="Parameter 2")
def __init__(self, config: Optional[Dict[str, Any]] = None):
"""Initialize tool"""
super().__init__(config)
# Additional initialization
def operation(self, param1: str, param2: int) -> Dict[str, Any]:
"""
Implement your business logic here
Args:
param1: Parameter 1
param2: Parameter 2
Returns:
Operation result
"""
# Your business logic
return {"result": f"Processing {param1} and {param2}"}
Using Decorators for Performance Optimization
The tool executor provides several decorators that you can use to add performance optimizations to methods:
from aiecs.tools.tool_executor import cache_result, run_in_executor, measure_execution_time
@cache_result() # Cache the result of this method
def cached_operation(self, param1: str) -> Dict[str, Any]:
# This result will be cached based on param1
return {"result": f"Cached result {param1}"}
@run_in_executor # Run this method in a thread pool
def cpu_intensive_operation(self, param1: str) -> Dict[str, Any]:
# This method will be executed in a separate thread
return {"result": f"CPU-intensive result {param1}"}
@measure_execution_time # Record the execution time of this method
def monitored_operation(self, param1: str) -> Dict[str, Any]:
# The execution time of this method will be recorded
return {"result": f"Monitored result {param1}"}
Migrating Existing Tools
To migrate existing tools to the new architecture:
Make your tool class inherit from
BaseToolDefine Pydantic schemas for your operations
Remove any custom caching, validation, or error handling code
Use decorators for performance optimization
Update the
runmethod to use the base class implementation
Before:
@register_tool("example")
class ExampleTool:
def __init__(self):
self._cache = {}
def run(self, op: str, **kwargs):
if op == "operation":
return self.operation(**kwargs)
else:
raise ValueError(f"Unsupported operation: {op}")
def operation(self, param1: str, param2: int):
# Custom caching
cache_key = f"{param1}_{param2}"
if cache_key in self._cache:
return self._cache[cache_key]
# Custom validation
if not isinstance(param1, str):
raise ValueError("param1 must be a string")
if not isinstance(param2, int):
raise ValueError("param2 must be an integer")
# Business logic
result = {"result": f"Processing {param1} and {param2}"}
# Cache result
self._cache[cache_key] = result
return result
After:
from typing import Dict, Any, Optional
from pydantic import BaseModel, Field
from aiecs.tools import register_tool
from aiecs.tools.base_tool import BaseTool
from aiecs.tools.tool_executor import cache_result
@register_tool("example")
class ExampleTool(BaseTool):
"""Example tool"""
class OperationSchema(BaseModel):
"""Operation schema"""
param1: str = Field(description="Parameter 1")
param2: int = Field(description="Parameter 2")
@cache_result()
def operation(self, param1: str, param2: int) -> Dict[str, Any]:
"""
Process parameters
Args:
param1: Parameter 1
param2: Parameter 2
Returns:
Operation result
"""
# Focus only on business logic
return {"result": f"Processing {param1} and {param2}"}
Benefits of the New Architecture
The new architecture provides several benefits:
Separation of Concerns: Business logic is separated from cross-cutting concerns like caching, validation, and error handling.
Reduced Duplication: Common functionality is implemented once in the tool executor and base tool, rather than being duplicated across individual tools.
Consistent Behavior: All tools behave consistently in terms of validation, error handling, and performance optimization.
Improved Maintainability: Tools are easier to maintain because they focus only on specific business logic.
Enhanced Performance: The tool executor provides optimized implementations of caching, concurrency, and other performance features.
Better Testing: Business logic can be tested independently of cross-cutting concerns.
Easier Onboarding: New developers can focus on implementing business logic without worrying about performance optimization details.
Usage Examples
# Get tool instance
from aiecs.tools import get_tool
# Get chart tool
chart_tool = get_tool("chart")
# Use tool
result = chart_tool.run("visualize",
file_path="data.csv",
plot_type="histogram",
x="age",
title="Age Distribution"
)
# Or call method directly
result = chart_tool.visualize(
file_path="data.csv",
plot_type="histogram",
x="age",
title="Age Distribution"
)
Multi-Task Service Integration
The tool system is fully integrated with the MultiTaskTools service in app/services/multi_task/tools.py:
from aiecs.services.multi_task.tools import MultiTaskTools
# Initialize multi-task tools service
multi_tools = MultiTaskTools()
# Get all available tools
available_tools = multi_tools.get_available_tools()
print("Available tools:", available_tools)
# Get operations for a specific tool
chart_operations = multi_tools.get_available_operations("chart")
print("Chart tool operations:", chart_operations)
# Get operation details
operation_info = multi_tools.get_operation_info("chart.visualize")
print("Operation info:", operation_info)
# Execute tool operation
result = await multi_tools.execute_tool(
"chart",
"visualize",
file_path="data.csv",
plot_type="histogram",
x="age"
)
Task Tool Usage Examples
Data Processing Pipeline
from aiecs.tools import get_tool
# 1. Data analysis tool
pandas_tool = get_tool("pandas")
df_result = pandas_tool.read_csv(file_path="data.csv")
# 2. Statistical analysis tool
stats_tool = get_tool("stats")
stats_result = stats_tool.descriptive_stats(data=df_result["data"])
# 3. Chart generation tool
chart_tool = get_tool("chart")
chart_result = chart_tool.visualize(
data=df_result["data"],
plot_type="histogram",
x="age"
)
# 4. Report generation tool
report_tool = get_tool("report")
report_result = report_tool.generate_report(
data=stats_result,
charts=[chart_result],
template="statistical_summary"
)
Research and Information Gathering
# Research tool
research_tool = get_tool("research")
research_result = research_tool.search_papers(
query="machine learning",
max_results=10
)
# Web scraping tool
scraper_tool = get_tool("scraper")
web_data = scraper_tool.scrape_url(
url="https://example.com",
selectors=["h1", "p"]
)
# Search API tool
search_tool = get_tool("search_api")
search_results = search_tool.web_search(
query="artificial intelligence trends 2024",
num_results=5
)
Office Document Processing
# Office tool
office_tool = get_tool("office")
# Process Excel file
excel_result = office_tool.read_excel(
file_path="data.xlsx",
sheet_name="Sheet1"
)
# Generate Word report
word_result = office_tool.create_word_document(
content=report_result["content"],
template="business_report"
)
# Create PowerPoint presentation
ppt_result = office_tool.create_presentation(
slides_data=chart_result["charts"],
template="data_analysis"
)
Tool Discovery and Registration
The system automatically discovers and registers all tools:
from aiecs.tools import list_tools, discover_tools
# List all registered tools
all_tools = list_tools()
print("Registered tools:", all_tools)
# Manually trigger tool discovery (usually not needed, system does this automatically)
discover_tools("aiecs.tools")
# View tools by category
task_tools = [tool for tool in all_tools if "task_tools" in str(type(get_tool(tool)))]
print("Task tools:", task_tools)
Best Practices
1. Tool Composition
Combine multiple tools to complete complex tasks:
def data_analysis_pipeline(csv_file: str):
"""Complete data analysis pipeline"""
# Data loading and cleaning
pandas_tool = get_tool("pandas")
data = pandas_tool.read_csv(csv_file)
cleaned_data = pandas_tool.clean_data(data["data"])
# Statistical analysis
stats_tool = get_tool("stats")
statistics = stats_tool.comprehensive_analysis(cleaned_data["data"])
# Visualization
chart_tool = get_tool("chart")
charts = chart_tool.create_dashboard(
data=cleaned_data["data"],
chart_types=["histogram", "boxplot", "correlation"]
)
# Generate report
report_tool = get_tool("report")
final_report = report_tool.generate_comprehensive_report(
data=statistics,
visualizations=charts,
template="data_analysis"
)
return final_report
2. Error Handling
Use appropriate error handling:
from aiecs.tools import get_tool
from aiecs.tools.tool_executor import ToolExecutionError
try:
tool = get_tool("pandas")
result = tool.read_csv("nonexistent.csv")
except ToolExecutionError as e:
print(f"Tool execution error: {e}")
except ValueError as e:
print(f"Tool does not exist: {e}")
3. Asynchronous Operations
Use asynchronous execution for time-consuming operations:
import asyncio
from aiecs.services.multi_task.tools import MultiTaskTools
async def async_data_processing():
multi_tools = MultiTaskTools()
# Execute multiple operations in parallel
tasks = [
multi_tools.execute_tool("scraper", "scrape_url", url="https://site1.com"),
multi_tools.execute_tool("scraper", "scrape_url", url="https://site2.com"),
multi_tools.execute_tool("research", "search_papers", query="AI")
]
results = await asyncio.gather(*tasks)
return results
Extending the Tool System
Adding New Task Tools
Create a new tool file in the
task_tools/directoryInherit from the
BaseToolclassRegister using the
@register_tooldecoratorAdd import in
task_tools/__init__.py
# task_tools/my_new_tool.py
from aiecs.tools import register_tool
from aiecs.tools.base_tool import BaseTool
@register_tool("my_new_tool")
class MyNewTool(BaseTool):
"""New tool description"""
def my_operation(self, param: str) -> dict:
"""Operation description"""
return {"result": f"Processing {param}"}
Creating New Tool Categories
Create a new directory under
app/tools/Add an
__init__.pyfileAdd import in the main
__init__.pyTools will be automatically discovered and registered
Special Tool Usage Instructions
Image Tool
The Image Tool provides comprehensive image processing capabilities, including loading, OCR text recognition, metadata extraction, resizing, and filter application.
System Dependency Requirements
Important: The Image Tool requires system-level Tesseract OCR engine and Pillow image processing library system dependencies.
1. Tesseract OCR Engine
Ubuntu/Debian systems:
sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng
macOS systems:
brew install tesseract
Verify installation:
tesseract --version
2. Pillow Image Processing Library System Dependencies
Ubuntu/Debian systems:
# Basic image processing libraries
sudo apt-get install libjpeg-dev zlib1g-dev libpng-dev libtiff-dev libwebp-dev libopenjp2-7-dev
# Complete image processing libraries (recommended)
sudo apt-get install libimageio-dev libfreetype6-dev liblcms2-dev libtiff5-dev libjpeg8-dev libopenjp2-7-dev libwebp-dev libharfbuzz-dev libfribidi-dev libxcb1-dev
macOS systems:
brew install libjpeg zlib libpng libtiff webp openjpeg freetype lcms2
Verify installation:
python -c "from PIL import Image; print('PIL version:', Image.__version__)"
3. Multi-language OCR Support
Install additional language packs:
# Ubuntu/Debian systems
sudo apt-get install tesseract-ocr-chi-sim # Simplified Chinese
sudo apt-get install tesseract-ocr-chi-tra # Traditional Chinese
sudo apt-get install tesseract-ocr-fra # French
sudo apt-get install tesseract-ocr-deu # German
sudo apt-get install tesseract-ocr-jpn # Japanese
sudo apt-get install tesseract-ocr-kor # Korean
sudo apt-get install tesseract-ocr-rus # Russian
sudo apt-get install tesseract-ocr-spa # Spanish
View installed language packs:
tesseract --list-langs
Using multi-language OCR:
# English OCR
text = tool.ocr("/path/to/image.jpg", lang='eng')
# Chinese OCR
text = tool.ocr("/path/to/image.jpg", lang='chi_sim')
# Japanese OCR
text = tool.ocr("/path/to/image.jpg", lang='jpn')
Features
Image Loading: Supports multiple formats (JPG, PNG, BMP, TIFF, GIF)
OCR Text Recognition: Text extraction based on Tesseract engine
Metadata Extraction: Get image dimensions, mode, and EXIF information
Image Resizing: High-quality resizing
Filter Effects: Blur, sharpen, edge enhancement, and other effects
Usage Examples
from aiecs.tools.task_tools.image_tool import ImageTool
# Initialize tool
tool = ImageTool()
# Load image information
result = tool.load("/path/to/image.jpg")
print(f"Size: {result['size']}, Mode: {result['mode']}")
# OCR text recognition
text = tool.ocr("/path/to/image.png", lang='eng')
print(f"Recognized text: {text}")
# Extract metadata
metadata = tool.metadata("/path/to/image.jpg", include_exif=True)
print(f"EXIF info: {metadata.get('exif', {})}")
# Resize image
tool.resize("/path/to/input.jpg", "/path/to/output.jpg", 800, 600)
# Apply filter
tool.filter("/path/to/input.jpg", "/path/to/blurred.jpg", "blur")
Security Features
File extension whitelist validation
File size limits (default 50MB)
Path normalization and security checks
Complete error handling and logging
ClassFire Tool (Text Classification and Keyword Extraction Tool)
The ClassFire Tool provides powerful text classification, keyword extraction, and text summarization capabilities, supporting both English and Chinese text processing.
Model Dependency Requirements
Important: The ClassFire Tool requires downloading and installing the following models to function properly.
1. spaCy Model Dependencies
Models Used:
English Model:
en_core_web_sm- Used for part-of-speech tagging, named entity recognition, and keyword extraction for English textChinese Model:
zh_core_web_sm- Used for part-of-speech tagging, named entity recognition, and keyword extraction for Chinese text
Installation Method:
# Install using Poetry environment
poetry run python -m spacy download en_core_web_sm
poetry run python -m spacy download zh_core_web_sm
# Or install using pip
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl
pip install https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-3.7.0/zh_core_web_sm-3.7.0-py3-none-any.whl
Usage Reasons:
Part-of-Speech Tagging: Identifies nouns, verbs, adjectives, etc., for keyword extraction
Named Entity Recognition: Identifies entities like person names, place names, organization names, improving keyword quality
Language Detection: Automatically detects text language and selects appropriate processing strategy
Text Preprocessing: Provides standardized text processing pipeline
2. Transformers Model Dependencies
Models Used:
English Summarization Model:
facebook/bart-large-cnn- Used for English text summarizationMultilingual Summarization Model:
t5-base- Used for Chinese text summarization
Model Download:
# Models will be automatically downloaded to ~/.cache/huggingface/hub/ on first use
# No manual installation needed, but ensure network connection is available
Installation Verification:
from transformers import pipeline
# Test English summarization model
summarizer_en = pipeline("summarization", model="facebook/bart-large-cnn")
result = summarizer_en("Your text here...", max_length=100, min_length=30)
# Test multilingual summarization model
summarizer_zh = pipeline("summarization", model="t5-base")
result = summarizer_zh("您的中文文本...", max_new_tokens=50, min_new_tokens=10)
Usage Reasons:
High-Quality Summarization: BART and T5 are state-of-the-art summarization models
Multilingual Support: T5 supports multiple languages, including Chinese
Configurable Length: Supports custom summary length and minimum length
Asynchronous Processing: Supports asynchronous calls, improving processing efficiency
3. NLTK Data Package Dependencies
Required Data Packages:
stopwords- Stopword data for keyword filteringpunkt- Sentence tokenizer for text preprocessingwordnet- Lexical database for word similarity calculationaveraged_perceptron_tagger- Part-of-speech tagger
Automatic Download:
# Use the provided script to automatically download all NLP data
poetry run python aiecs/scripts/download_nlp_data.py
Features
Text Classification: Text classification based on pre-trained models
Keyword Extraction: Supports RAKE (English) and spaCy (English/Chinese) keyword extraction
Text Summarization: Supports English and Chinese text summarization
Language Detection: Automatically detects text language
Asynchronous Processing: Supports asynchronous calls, improving performance
Usage Examples
from aiecs.tools.task_tools.classfire_tool import ClassifierTool
# Initialize tool
tool = ClassifierTool()
# Text classification
result = await tool.classify("This is a positive review about the product.")
print(f"Classification result: {result}")
# Keyword extraction
keywords = await tool.extract_keywords("Natural language processing is important.", top_k=5)
print(f"Keywords: {keywords}")
# Text summarization
summary = await tool.summarize("Your long text here...", max_length=100)
print(f"Summary: {summary}")
# Chinese processing
chinese_keywords = await tool.extract_keywords("自然语言处理是人工智能的重要领域。", top_k=3)
print(f"Chinese keywords: {chinese_keywords}")
Performance Optimization
Model Caching: Models are cached after first load, improving subsequent call speed
Asynchronous Processing: All main features support asynchronous calls
Memory Management: Supports model unloading and reloading to save memory
Error Handling: Comprehensive error handling and fallback mechanisms
Notes
First Use: Models will be automatically downloaded on first use, which may take some time
Network Requirements: Network connection required to download Transformers models
Memory Requirements: Model loading requires certain memory space
Language Support: Currently mainly supports English and Chinese, limited support for other languages
Office Tool (Office Document Processing Tool)
The Office Tool provides comprehensive office document processing capabilities, supporting reading, writing, and conversion of various document formats, including Word, PowerPoint, Excel, PDF, and image files.
System Dependency Requirements
Important: The Office Tool requires Java Runtime Environment and Tesseract OCR engine to function properly.
1. Java Runtime Environment (Required)
Purpose: Apache Tika document parsing library requires Java Runtime Environment.
Ubuntu/Debian systems:
# Install OpenJDK 11 (recommended)
sudo apt-get update
sudo apt-get install openjdk-11-jdk
# Or install OpenJDK 17
sudo apt-get install openjdk-17-jdk
# Verify installation
java -version
javac -version
macOS systems:
# Install using Homebrew
brew install openjdk@11
# Or install OpenJDK 17
brew install openjdk@17
Environment Variable Setup:
# Set JAVA_HOME environment variable
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
# Or for OpenJDK 17
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
# Add to ~/.bashrc or ~/.zshrc
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc
2. Tesseract OCR Engine (Required for OCR Functionality)
Purpose: Text recognition functionality in image files.
Ubuntu/Debian systems:
sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng
# Chinese OCR support
sudo apt-get install tesseract-ocr-chi-sim # Simplified Chinese
sudo apt-get install tesseract-ocr-chi-tra # Traditional Chinese
macOS systems:
brew install tesseract
Verify installation:
tesseract --version
tesseract --list-langs
3. Python Package Dependencies
Core Document Processing Libraries:
pandas (>=2.2.3) - Excel file data processing
openpyxl (>=3.1.5) - Excel file read/write
python-docx (>=1.1.2) - Word document processing
python-pptx (>=1.0.2) - PowerPoint document processing
pdfplumber (>=0.11.7) - PDF text extraction
Content Parsing Libraries:
tika (>=3.2.2) - Universal document parsing (requires Java 11+)
pytesseract (>=0.3.13) - OCR text recognition
Pillow (>=11.2.1) - Image processing
Features
Document Reading: Supports DOCX, PPTX, XLSX, PDF formats
Document Writing: Create and edit Word, PowerPoint, Excel documents
Text Extraction: Extract text content from various document formats
OCR Functionality: Recognize text from image files
Multi-format Support: Process legacy Office documents and other formats
Usage Examples
from aiecs.tools.task_tools.office_tool import OfficeTool
# Initialize tool
tool = OfficeTool()
# Read Word document
docx_content = tool.read_docx("/path/to/document.docx")
print(f"Document content: {docx_content['text']}")
# Read Excel file
xlsx_data = tool.read_xlsx("/path/to/spreadsheet.xlsx")
print(f"Spreadsheet data: {xlsx_data}")
# Extract text (supports multiple formats)
text = tool.extract_text("/path/to/document.pdf")
print(f"Extracted text: {text}")
# Create Word document
tool.write_docx("Hello World!", "/path/to/output.docx")
# Create PowerPoint presentation
slides = ["Title Slide", "Content Page 1", "Content Page 2"]
tool.write_pptx(slides, "/path/to/presentation.pptx")
OCR Functionality
Supported Image Formats:
PNG, JPG, JPEG, TIFF, BMP, GIF
Language Support:
English (eng)
Simplified Chinese (chi_sim)
Traditional Chinese (chi_tra)
Usage Example:
# Extract text from image
image_text = tool.extract_text("/path/to/image.png")
print(f"Recognized text: {image_text}")
Performance Optimization
Tika Caching: Tika JAR file will be downloaded and cached on first use
Memory Management: Pay attention to memory usage when processing large files
Concurrency Limits: Recommend limiting the number of documents processed simultaneously
Error Handling: Comprehensive error handling and fallback mechanisms
Notes
Java Version: Requires Java 11 or higher (Tika 3.x requirement)
Memory Requirements: Tika requires sufficient memory when processing large files
File Size: Default maximum file size is 100MB
Encoding Issues: Some documents may have encoding issues
OCR Accuracy: Image quality affects OCR recognition accuracy
Stats Tool (Statistical Analysis Tool)
The Stats Tool provides comprehensive statistical analysis capabilities, supporting various statistical tests, data preprocessing, regression analysis, time series analysis, and other advanced statistical functions.
System Dependency Requirements
Important: The Stats Tool requires system-level C libraries to support reading special file formats, particularly SAS, SPSS, and Stata files.
1. pyreadstat System Dependencies (Special File Format Support)
Purpose: Read and write SAS, SPSS, Stata files (.sav, .sas7bdat, .por formats)
Ubuntu/Debian systems:
# Install libreadstat development library
sudo apt-get update
sudo apt-get install libreadstat-dev
# Install build tools (if not already installed)
sudo apt-get install build-essential python3-dev
# Reinstall pyreadstat
pip install --no-cache-dir --force-reinstall pyreadstat
macOS systems:
# Install using Homebrew
brew install readstat
# Reinstall pyreadstat
pip install --no-cache-dir --force-reinstall pyreadstat
CentOS/RHEL systems:
# Install development tools
sudo yum groupinstall "Development Tools"
sudo yum install python3-devel
# Install readstat library (may need to compile from source)
# Or use conda to install
conda install -c conda-forge readstat
Verify installation:
import pyreadstat
print("pyreadstat version:", pyreadstat.__version__)
# Test read functionality
try:
# No actual file needed here, just testing import
print("pyreadstat installed successfully")
except Exception as e:
print("pyreadstat installation failed:", e)
2. Excel File Support System Dependencies
Purpose: Read and write Excel files (.xlsx, .xls formats)
Ubuntu/Debian systems:
# Install system libraries required by openpyxl
sudo apt-get install libxml2-dev libxslt1-dev
# Verify installation
python -c "import openpyxl; print('openpyxl available')"
macOS systems:
# Usually no additional installation needed, system already includes required libraries
brew install libxml2 libxslt
3. Python Package Dependencies
Core Statistical Libraries:
pandas (>=2.2.3) - Data processing and analysis
numpy (>=2.2.6) - Numerical computation
scipy (>=1.15.3) - Scientific computing and statistical functions
scikit-learn (>=1.5.0) - Machine learning library (data preprocessing)
statsmodels (>=0.14.4) - Statistical models and tests
Special File Format Support:
pyreadstat (>=1.2.9) - SAS, SPSS, Stata file support
openpyxl (>=3.1.5) - Excel file support
Configuration Management:
pydantic (>=2.11.5) - Data validation
pydantic-settings (>=2.9.1) - Settings management
Features
Descriptive Statistics: Basic statistics, skewness, kurtosis, percentiles
Hypothesis Testing: t-tests, chi-square tests, ANOVA, non-parametric tests
Correlation Analysis: Pearson, Spearman, Kendall correlation coefficients
Regression Analysis: OLS, Logit, Probit, Poisson regression
Time Series: ARIMA, SARIMA models and forecasting
Data Preprocessing: Standardization, missing value handling, data cleaning
Multi-format Support: CSV, Excel, JSON, Parquet, Feather, SAS, SPSS, Stata
Supported File Formats
Format |
Extension |
Dependency Library |
System Requirements |
|---|---|---|---|
CSV |
|
pandas |
None |
Excel |
|
openpyxl |
libxml2, libxslt |
JSON |
|
pandas |
None |
Parquet |
|
pandas |
None |
Feather |
|
pandas |
None |
SPSS |
|
pyreadstat |
libreadstat |
SAS |
|
pyreadstat |
libreadstat |
Usage Examples
from aiecs.tools import get_tool
# Get statistics tool
stats_tool = get_tool("stats")
# Read data
data_info = stats_tool.read_data("data.sav") # SPSS file
print(f"Number of variables: {len(data_info['variables'])}")
print(f"Number of observations: {data_info['observations']}")
# Descriptive statistics
desc_stats = stats_tool.describe(
file_path="data.sav",
variables=["age", "income", "education"],
include_percentiles=True,
percentiles=[0.1, 0.9]
)
# t-test
ttest_result = stats_tool.ttest(
file_path="data.sav",
var1="group1_score",
var2="group2_score",
equal_var=True
)
# Correlation analysis
correlation = stats_tool.correlation(
file_path="data.sav",
variables=["var1", "var2", "var3"],
method="pearson"
)
# Regression analysis
regression = stats_tool.regression(
file_path="data.sav",
formula="y ~ x1 + x2 + x3",
regression_type="ols"
)
# Data preprocessing
preprocessed = stats_tool.preprocess(
file_path="data.sav",
variables=["var1", "var2"],
operation="scale",
scaler_type="standard"
)
Environment Variable Configuration
Stats Tool can be configured via the following environment variables:
# Maximum file size limit (MB)
export STATS_TOOL_MAX_FILE_SIZE_MB=200
# Allowed file extensions
export STATS_TOOL_ALLOWED_EXTENSIONS=".sav,.sas7bdat,.por,.csv,.xlsx,.xls,.json,.parquet,.feather"
Troubleshooting
pyreadstat Installation Issues
Problem: ImportError: No module named 'pyreadstat' or compilation errors
Solution:
# 1. Install system dependencies
sudo apt-get install libreadstat-dev build-essential python3-dev
# 2. Reinstall
pip uninstall pyreadstat
pip install --no-cache-dir pyreadstat
# 3. Verify installation
python -c "import pyreadstat; print('Success')"
Problem: OSError: libreadstat.so: cannot open shared object file
Solution:
# Check library file location
ldconfig -p | grep readstat
# If not found, reinstall system library
sudo apt-get install --reinstall libreadstat0
Excel File Reading Issues
Problem: ImportError: No module named 'openpyxl'
Solution:
# Install openpyxl
pip install openpyxl
# Install system dependencies
sudo apt-get install libxml2-dev libxslt1-dev
Memory Usage Issues
Problem: Insufficient memory when processing large files
Solution:
# Use nrows parameter to limit number of rows read
data_info = stats_tool.read_data("large_file.csv", nrows=10000)
# Adjust environment variable
export STATS_TOOL_MAX_FILE_SIZE_MB=500
File Permission Issues
Problem: Cannot read file
Solution:
# Check file permissions
ls -la data.sav
# Modify permissions
chmod 644 data.sav
# Check file path
python -c "import os; print(os.path.exists('data.sav'))"
Report Tool (Multi-format Report Generation Tool)
The Report Tool provides comprehensive report generation capabilities, supporting HTML, Excel, PowerPoint, Word, Markdown, image, and PDF format report generation.
System Dependency Requirements
Important: Some features of the Report Tool require system-level graphics libraries and font libraries.
1. WeasyPrint System Dependencies (Required for PDF Functionality)
Purpose: HTML to PDF functionality requires WeasyPrint system-level dependencies.
Ubuntu/Debian systems:
# Install system libraries required by WeasyPrint
sudo apt-get update
sudo apt-get install libcairo2-dev libpango1.0-dev libgdk-pixbuf2.0-dev libffi-dev shared-mime-info
# Complete installation (recommended)
sudo apt-get install libcairo2-dev libpango1.0-dev libgdk-pixbuf2.0-dev libffi-dev shared-mime-info libxml2-dev libxslt1-dev
macOS systems:
# Install using Homebrew
brew install cairo pango gdk-pixbuf libffi
Verify installation:
# Check system libraries
pkg-config --modversion cairo
pkg-config --modversion pango
2. Matplotlib System Dependencies (Required for Chart Functionality)
Purpose: Chart generation functionality requires font and image processing libraries.
Ubuntu/Debian systems:
# Install system libraries required by Matplotlib
sudo apt-get install libfreetype6-dev libpng-dev libjpeg-dev libtiff-dev libwebp-dev
# Chinese font support
sudo apt-get install fonts-wqy-zenhei fonts-wqy-microhei
macOS systems:
# Install using Homebrew
brew install freetype libpng libjpeg libtiff webp
Verify installation:
python -c "import matplotlib.pyplot as plt; plt.figure(); print('Matplotlib working')"
3. Python Package Dependencies
Core Report Generation Libraries:
jinja2 (>=3.1.6) - Template engine
weasyprint (>=65.1) - HTML to PDF
matplotlib (>=3.10.3) - Chart generation
bleach (>=6.2.0) - HTML sanitization
markdown (>=3.8) - Markdown processing
Document Processing Libraries:
pandas (>=2.2.3) - Data processing
openpyxl (>=3.1.5) - Excel file processing
python-docx (>=1.1.2) - Word document processing
python-pptx (>=1.0.2) - PowerPoint document processing
Features
HTML Reports: Generated using Jinja2 template engine ✅
PDF Reports: HTML to PDF conversion using WeasyPrint ⚠️ Temporarily Disabled
Excel Reports: Multi-sheet Excel file generation ✅
PowerPoint Reports: Custom slide presentations ✅
Word Reports: Styled Word documents ✅
Markdown Reports: Markdown format reports ✅
Image Reports: Chart generation using Matplotlib ✅
Usage Examples
from aiecs.tools.task_tools.report_tool import ReportTool
# Initialize tool
tool = ReportTool()
# Generate HTML report
html_result = tool.generate_html(
template_path="report_template.html",
context={"title": "Monthly Report", "data": data},
output_path="/path/to/report.html"
)
# Generate PDF report (⚠️ Temporarily disabled - requires system dependencies and code modification)
# pdf_result = tool.generate_pdf(
# html=html_content,
# output_path="/path/to/report.pdf",
# page_size="A4"
# )
# Generate Excel report
excel_result = tool.generate_excel(
sheets={"Data": df, "Summary": summary_df},
output_path="/path/to/report.xlsx"
)
# Generate chart
chart_result = tool.generate_image(
chart_type="bar",
data=chart_data,
output_path="/path/to/chart.png",
title="Sales Data"
)
PDF Functionality Notes
⚠️ Important Notice: HTML to PDF functionality is temporarily disabled due to WeasyPrint system dependency issues.
Current Status:
PDF generation functionality is completely unavailable
Calling
generate_pdf()method will throw an errorRequires manual installation of system dependencies and code modification to enable
Disable Reason:
Missing system-level graphics libraries required by WeasyPrint
Deployment environment complexity makes dependency installation difficult
To ensure stability of other features
Enable Method:
Install WeasyPrint system dependencies:
sudo apt-get install libcairo2-dev libpango1.0-dev libgdk-pixbuf2.0-dev libffi-dev shared-mime-info
Modify code:
Uncomment
from weasyprint import HTMLimport statementUncomment implementation code in
generate_pdfmethodRemove error throwing statements
Verify installation: Ensure all system libraries are correctly installed
Supported Features (after enabling):
HTML to PDF conversion
Custom page sizes
CSS style support
Template variable substitution
Alternatives:
Use
generate_html()to generate HTML reportsUse browser to manually print to PDF
Use other PDF generation tools
Performance Optimization
Template Caching: Jinja2 templates are automatically cached
Temporary File Management: Automatic cleanup of temporary files
Batch Generation: Supports parallel generation of multiple reports
Memory Management: Optimized for large file processing
Notes
WeasyPrint Dependencies: PDF functionality requires complete system library support
Font Support: Chart generation requires system font libraries
Template Security: Automatic HTML content sanitization to prevent XSS attacks
File Size: Pay attention to memory usage when processing large files
Concurrency Limits: Recommend limiting the number of reports generated simultaneously
Scraper Tool (Web Scraping Tool)
Feature Overview
The Scraper Tool is a powerful web scraping tool that supports multiple HTTP clients, JavaScript rendering, HTML parsing, and advanced crawling functionality.
Main Features:
HTTP Requests: Supports httpx, urllib, and other clients
JavaScript Rendering: Uses Playwright for dynamic content scraping
HTML Parsing: Uses BeautifulSoup and lxml for content parsing
Advanced Crawling: Integrates Scrapy for complex crawling projects
Multi-format Output: Supports text, JSON, HTML, Markdown, CSV output
Special Dependency Instructions
1. Playwright Browser Dependencies
Purpose: JavaScript rendering functionality (render() method)
Dependency Contents:
Python Package:
playwright(already installed)Browser Binaries: Chromium, Firefox, WebKit
System Dependencies: System libraries required for browser operation
Installation Steps:
Download Browsers:
cd /home/coder1/python-middleware-dev poetry run playwright install
Install System Dependencies:
# Method 1: Use Playwright automatic installation poetry run playwright install-deps # Method 2: Manual installation (Ubuntu/Debian) sudo apt-get install libatk1.0-0 \ libatk-bridge2.0-0 \ libcups2 \ libxkbcommon0 \ libatspi2.0-0 \ libxcomposite1 \ libxdamage1 \ libxfixes3 \ libxrandr2 \ libgbm1 \ libasound2
Method 3: Root account installation (recommended)
Temporarily install playwright-python package
pip install playwright
Run playwright command to install system dependencies
python -m playwright install-deps
After dependencies are installed, uninstall temporary playwright-python package to keep root environment clean
pip uninstall playwright -y
Browser Storage Location:
Path:
~/.cache/ms-playwright/Size: Approximately 400-500MB (all browsers)
Contains: Chromium, Firefox, WebKit, FFMPEG
Feature Support:
Page Rendering: Waits for JavaScript execution to complete
Element Waiting: Waits for specific CSS selectors
Page Scrolling: Scrolls to bottom of page
Screenshot Functionality: Saves page screenshots
Multi-browser: Supports Chromium, Firefox, WebKit
2. Scrapy Advanced Crawling Dependencies
Purpose: Advanced crawling functionality (crawl_scrapy() method)
Dependency Contents:
Python Package:
scrapy(needs to be installed)Project Structure: Requires complete Scrapy project
Installation Steps:
cd /home/coder1/python-middleware-dev
poetry add scrapy
Feature Support:
Project-based Crawling: Supports complete Scrapy project structure
Data Pipelines: Data cleaning, deduplication, storage
Middlewares: Request/response processing
Scheduler: Intelligent request scheduling
Monitoring: Detailed logging and statistics
3. Other Dependencies
Python Package Dependencies:
httpx: Asynchronous HTTP client
beautifulsoup4: HTML/XML parsing
lxml: Fast XML and HTML processing
System Dependencies:
Network Connection: Download browsers and access target websites
Memory: Browser operation requires sufficient memory
Disk Space: Browser files approximately 500MB
Usage Examples
Basic HTTP Requests (No Browser Required)
from aiecs.tools.scraper_tool import ScraperTool
scraper = ScraperTool()
# Use fetch method for HTTP requests
result = await scraper.fetch("https://example.com")
# Access content
html_content = result.get("content", "")
JavaScript Rendering (Requires Playwright)
# Need to install Playwright browsers first
result = await scraper.render(
url="https://spa-app.com",
wait_time=5,
screenshot=True
)
Advanced Crawling (Requires Scrapy)
# Need to install Scrapy first
result = scraper.crawl_scrapy(
project_path="/path/to/scrapy/project",
spider_name="my_spider",
output_path="output.json"
)
Feature Classification
Feature Type |
Method Name |
Requires Browser |
Requires Scrapy |
Dependencies |
|---|---|---|---|---|
Basic HTTP |
|
❌ Not required |
❌ Not required |
httpx |
Basic HTTP |
|
❌ Not required |
❌ Not required |
urllib |
HTML Parsing |
|
❌ Not required |
❌ Not required |
BeautifulSoup |
JavaScript Rendering |
|
✅ Required |
❌ Not required |
Playwright + browsers |
Advanced Crawling |
|
❌ Not required |
✅ Required |
Scrapy |
Notes
General Notes
Network Limits: Comply with website robots.txt and access frequency limits
Legal Compliance: Ensure scraping behavior complies with relevant laws and regulations
Resource Management: Reasonably control concurrent request numbers
Error Handling: Implement appropriate retry and error handling mechanisms
Troubleshooting
Playwright Issues
# Check if browsers are installed
poetry run playwright install --list
# Reinstall browsers
poetry run playwright install --force
# Check system dependencies
poetry run playwright install-deps
Scrapy Issues
# Check if Scrapy is installed
poetry run scrapy --version
# Create test project
poetry run scrapy startproject test_project
Network Issues
Proxy Settings: Configure HTTP proxy
Timeout Settings: Adjust request timeout duration
Retry Mechanism: Implement automatic retry logic