Classifier Tool Configuration Guide
Overview
The Classifier Tool provides text classification, NLP operations, and analysis capabilities. It supports multiple languages (English and Chinese) and can be configured via environment variables using the CLASSIFIER_TOOL_ prefix or through programmatic configuration.
Using .env Files in Your Project
When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Classifier Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.
Setting Up .env Files
1. Install python-dotenv:
pip install python-dotenv
2. Create a .env file in your project root:
# .env file in your project root
CLASSIFIER_TOOL_MAX_WORKERS=16
CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=3600
CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=10
CLASSIFIER_TOOL_MAX_TEXT_LENGTH=10000
CLASSIFIER_TOOL_SPACY_MODEL_EN=en_core_web_sm
CLASSIFIER_TOOL_SPACY_MODEL_ZH=zh_core_web_sm
CLASSIFIER_TOOL_ALLOWED_MODELS=["en_core_web_sm","zh_core_web_sm"]
CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=true
CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=100
CLASSIFIER_TOOL_RATE_LIMIT_WINDOW=60
CLASSIFIER_TOOL_USE_RAKE_FOR_ENGLISH=true
3. Load the .env file in your application:
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv
# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()
# Now import and use aiecs tools
from aiecs.tools.task_tools.classfire_tool import ClassifierTool
# The tool will automatically use the environment variables
classifier = ClassifierTool()
Multiple Environment Files
You can use different .env files for different environments:
import os
from dotenv import load_dotenv
# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')
if env == 'production':
load_dotenv('.env.production')
elif env == 'staging':
load_dotenv('.env.staging')
else:
load_dotenv('.env.development')
from aiecs.tools.task_tools.classfire_tool import ClassifierTool
classifier = ClassifierTool()
Example .env.production:
# Production settings - stricter limits and better models
CLASSIFIER_TOOL_MAX_WORKERS=32
CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=true
CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=100
CLASSIFIER_TOOL_SPACY_MODEL_EN=en_core_web_md
CLASSIFIER_TOOL_MAX_TEXT_LENGTH=8000
Example .env.development:
# Development settings - relaxed limits for testing
CLASSIFIER_TOOL_MAX_WORKERS=4
CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=false
CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=300
CLASSIFIER_TOOL_SPACY_MODEL_EN=en_core_web_sm
Best Practices for .env Files
Never commit .env files to version control - Add
.envto your.gitignore:# .gitignore .env .env.local .env.*.local .env.production .env.staging
Provide a template - Create
.env.examplewith documented dummy values:# .env.example # Classifier Tool Configuration # Maximum number of worker threads CLASSIFIER_TOOL_MAX_WORKERS=16 # Cache settings (in seconds) CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=3600 CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=10 # Text processing limits CLASSIFIER_TOOL_MAX_TEXT_LENGTH=10000 # SpaCy models CLASSIFIER_TOOL_SPACY_MODEL_EN=en_core_web_sm CLASSIFIER_TOOL_SPACY_MODEL_ZH=zh_core_web_sm # Security settings CLASSIFIER_TOOL_ALLOWED_MODELS=["en_core_web_sm","zh_core_web_sm"] # Rate limiting CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=true CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=100 CLASSIFIER_TOOL_RATE_LIMIT_WINDOW=60 # Feature flags CLASSIFIER_TOOL_USE_RAKE_FOR_ENGLISH=true
Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format complex types correctly:
Booleans:
true,false,1,0,yes,noLists: Use JSON array format with double quotes:
["item1","item2"]Numbers: Plain integers or floats:
100,3600
Configuration Options
1. Max Workers
Environment Variable: CLASSIFIER_TOOL_MAX_WORKERS
Type: Integer
Default: min(32, (os.cpu_count() or 4) * 2)
Description: Maximum number of worker threads for parallel processing. The default dynamically adjusts based on CPU count.
Example:
export CLASSIFIER_TOOL_MAX_WORKERS=16
2. Pipeline Cache TTL
Environment Variable: CLASSIFIER_TOOL_PIPELINE_CACHE_TTL
Type: Integer
Default: 3600 (1 hour)
Description: Time-to-live for pipeline cache in seconds. Pipelines are expensive to load, so caching improves performance.
Example:
export CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=7200 # 2 hours
3. Pipeline Cache Size
Environment Variable: CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE
Type: Integer
Default: 10
Description: Maximum number of pipeline entries to cache. Each pipeline can consume significant memory.
Example:
export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=5
4. Max Text Length
Environment Variable: CLASSIFIER_TOOL_MAX_TEXT_LENGTH
Type: Integer
Default: 10000 characters
Description: Maximum allowed text length for processing. This is a security and performance constraint.
Example:
export CLASSIFIER_TOOL_MAX_TEXT_LENGTH=5000
5. SpaCy Model (English)
Environment Variable: CLASSIFIER_TOOL_SPACY_MODEL_EN
Type: String
Default: "en_core_web_sm"
Description: SpaCy model to use for English text processing.
Available Models:
en_core_web_sm- Small model (default, faster)en_core_web_md- Medium model (more accurate)en_core_web_lg- Large model (most accurate, slower)
Example:
export CLASSIFIER_TOOL_SPACY_MODEL_EN="en_core_web_md"
6. SpaCy Model (Chinese)
Environment Variable: CLASSIFIER_TOOL_SPACY_MODEL_ZH
Type: String
Default: "zh_core_web_sm"
Description: SpaCy model to use for Chinese text processing.
Available Models:
zh_core_web_sm- Small model (default)zh_core_web_md- Medium modelzh_core_web_lg- Large model
Example:
export CLASSIFIER_TOOL_SPACY_MODEL_ZH="zh_core_web_md"
7. Allowed Models
Environment Variable: CLASSIFIER_TOOL_ALLOWED_MODELS
Type: List[str]
Default: ["en_core_web_sm", "zh_core_web_sm"]
Description: List of allowed spaCy models that can be loaded. This is a security feature to prevent arbitrary model loading.
Format: JSON array string
Example:
export CLASSIFIER_TOOL_ALLOWED_MODELS='["en_core_web_sm","en_core_web_md","zh_core_web_sm"]'
8. Rate Limit Enabled
Environment Variable: CLASSIFIER_TOOL_RATE_LIMIT_ENABLED
Type: Boolean
Default: true
Description: Enable or disable rate limiting for API requests.
Format: Pydantic accepts various boolean representations: true, false, 1, 0, yes, no, on, off
Example:
export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=false
9. Rate Limit Requests
Environment Variable: CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS
Type: Integer
Default: 100
Description: Maximum number of requests allowed per rate limit window.
Example:
export CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=200
10. Rate Limit Window
Environment Variable: CLASSIFIER_TOOL_RATE_LIMIT_WINDOW
Type: Integer
Default: 60 seconds
Description: Time window (in seconds) for rate limiting.
Example:
export CLASSIFIER_TOOL_RATE_LIMIT_WINDOW=120 # 2 minutes
11. Use RAKE for English
Environment Variable: CLASSIFIER_TOOL_USE_RAKE_FOR_ENGLISH
Type: Boolean
Default: true
Description: Use RAKE (Rapid Automatic Keyword Extraction) algorithm for English keyword extraction. If disabled, falls back to spaCy-based extraction.
Example:
export CLASSIFIER_TOOL_USE_RAKE_FOR_ENGLISH=false
Usage Examples
Example 1: Basic Environment Configuration
# Configure for high-performance processing
export CLASSIFIER_TOOL_MAX_WORKERS=32
export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=20
export CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=500
# Use larger models for better accuracy
export CLASSIFIER_TOOL_SPACY_MODEL_EN="en_core_web_md"
# Run your application
python app.py
Example 2: Development Environment
# Disable rate limiting for testing
export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=false
# Use smaller cache for memory-constrained systems
export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=3
export CLASSIFIER_TOOL_MAX_WORKERS=4
# Shorter cache TTL for rapid iteration
export CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=300 # 5 minutes
Example 3: Production Environment
# Strict rate limiting
export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=true
export CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=100
export CLASSIFIER_TOOL_RATE_LIMIT_WINDOW=60
# Optimized performance
export CLASSIFIER_TOOL_MAX_WORKERS=24
export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=15
export CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=7200
# Security: limit text length
export CLASSIFIER_TOOL_MAX_TEXT_LENGTH=8000
Example 4: Programmatic Configuration
from aiecs.tools.task_tools.classfire_tool import ClassifierTool
# Initialize with custom configuration
classifier = ClassifierTool(config={
'max_workers': 16,
'pipeline_cache_ttl': 3600,
'pipeline_cache_size': 10,
'max_text_length': 5000,
'spacy_model_en': 'en_core_web_md',
'spacy_model_zh': 'zh_core_web_sm',
'rate_limit_enabled': True,
'rate_limit_requests': 200,
'rate_limit_window': 60,
'use_rake_for_english': True
})
Example 5: Mixed Configuration
# Set environment defaults
export CLASSIFIER_TOOL_MAX_WORKERS=20
export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=true
# Override specific settings programmatically
classifier = ClassifierTool(config={
'rate_limit_enabled': False, # Override env var
'spacy_model_en': 'en_core_web_lg' # Use larger model
})
Configuration Priority
Configuration values are resolved in the following order (highest to lowest priority):
Programmatic config - Values passed to the constructor
Environment variables - Values set via
CLASSIFIER_TOOL_*variablesDefault values - Built-in defaults as specified above
Data Type Parsing
Boolean Values
Pydantic accepts multiple boolean representations:
True:
true,1,yes,on,True,TRUEFalse:
false,0,no,off,False,FALSE
Example:
export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=yes # Parsed as True
List Values
Lists must be provided as JSON array strings:
# Correct
export CLASSIFIER_TOOL_ALLOWED_MODELS='["en_core_web_sm","zh_core_web_sm"]'
# Incorrect (will not parse)
export CLASSIFIER_TOOL_ALLOWED_MODELS="en_core_web_sm,zh_core_web_sm"
Integer Values
Integers should be provided as numeric strings:
export CLASSIFIER_TOOL_MAX_WORKERS=16
export CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=3600
Validation
Automatic Type Validation
Pydantic automatically validates configuration values:
Integer fields must contain valid integers
Boolean fields must contain valid boolean representations
List fields must contain valid JSON arrays
String fields accept any string value
Custom Validation
The tool includes custom validators for:
max_text_length: Applied to all text inputs
allowed_models: Checked when loading models
rate_limit_requests: Must be positive
Security Validation
Text inputs are validated for:
Maximum length constraints
Potentially malicious SQL injection patterns
Other security threats
Performance Tuning
Memory Optimization
For memory-constrained environments:
export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=3
export CLASSIFIER_TOOL_MAX_WORKERS=4
export CLASSIFIER_TOOL_SPACY_MODEL_EN="en_core_web_sm"
Speed Optimization
For high-throughput environments:
export CLASSIFIER_TOOL_MAX_WORKERS=32
export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=20
export CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=7200
Accuracy Optimization
For maximum accuracy (at the cost of speed/memory):
export CLASSIFIER_TOOL_SPACY_MODEL_EN="en_core_web_lg"
export CLASSIFIER_TOOL_SPACY_MODEL_ZH="zh_core_web_lg"
export CLASSIFIER_TOOL_ALLOWED_MODELS='["en_core_web_lg","zh_core_web_lg"]'
Model Installation
Before using specific models, ensure they are installed:
# Install spaCy models
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg
python -m spacy download zh_core_web_sm
python -m spacy download zh_core_web_md
python -m spacy download zh_core_web_lg
Troubleshooting
Issue: Model not found
Error: OSError: [E050] Can't find model 'en_core_web_md'
Solution:
# Download the required model
python -m spacy download en_core_web_md
# Or set to an installed model
export CLASSIFIER_TOOL_SPACY_MODEL_EN="en_core_web_sm"
Issue: Rate limit exceeded
Error: Rate limit exceeded. Please try again later.
Solution:
# Increase rate limits
export CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=500
export CLASSIFIER_TOOL_RATE_LIMIT_WINDOW=60
# Or disable for testing
export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=false
Issue: Out of memory
Cause: Too many cached pipelines or workers
Solution:
# Reduce cache and workers
export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=3
export CLASSIFIER_TOOL_MAX_WORKERS=4
# Use smaller models
export CLASSIFIER_TOOL_SPACY_MODEL_EN="en_core_web_sm"
Issue: Boolean environment variable not working
Cause: Incorrect boolean format
Solution:
# Use recognized boolean values
export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=true # or false, 1, 0, yes, no
# NOT: "True", "FALSE" (with quotes can cause issues)
Issue: List parsing error
Cause: Invalid JSON format for list values
Solution:
# Use proper JSON array syntax
export CLASSIFIER_TOOL_ALLOWED_MODELS='["en_core_web_sm","zh_core_web_sm"]'
# Make sure to use double quotes inside the array
# Single quotes for the shell, double quotes for JSON
Best Practices
Resource Management:
Set
max_workersto 2x CPU count for I/O-bound tasksLimit
pipeline_cache_sizebased on available memoryUse appropriate
pipeline_cache_ttlfor your workload
Security:
Keep
rate_limit_enabled=truein productionRestrict
allowed_modelsto only necessary modelsSet conservative
max_text_lengthlimits
Performance:
Use smaller models (
_sm) for faster processingUse larger models (
_lg) when accuracy is criticalTune cache settings based on usage patterns
Language Support:
The tool auto-detects language if not specified
Pre-load models for languages you frequently use
Consider separate instances for different languages
Operations Supported
The Classifier Tool supports the following operations:
classify: Sentiment classification
tokenize: Text tokenization
pos_tag: Part-of-speech tagging
ner: Named entity recognition
lemmatize: Token lemmatization
dependency_parse: Dependency parsing
keyword_extract: Keyword/phrase extraction
summarize: Text summarization
batch_process: Batch processing of multiple texts
Support
For issues or questions about Classifier Tool configuration:
Check the tool source code for implementation details
Review spaCy documentation for model-specific information
Consult the main documentation for architecture overview