# Classifier Tool Configuration Guide ## Overview The Classifier Tool provides text classification, NLP operations, and analysis capabilities. It supports multiple languages (English and Chinese) and can be configured via environment variables using the `CLASSIFIER_TOOL_` prefix or through programmatic configuration. ## Using .env Files in Your Project When using aiecs as a dependency in your project, you can store configuration in a `.env` file for convenience. The Classifier Tool reads from environment variables that are already loaded into the process, so you need to load the `.env` file in your application before importing aiecs tools. ### Setting Up .env Files **1. Install python-dotenv:** ```bash pip install python-dotenv ``` **2. Create a `.env` file in your project root:** ```bash # .env file in your project root CLASSIFIER_TOOL_MAX_WORKERS=16 CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=3600 CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=10 CLASSIFIER_TOOL_MAX_TEXT_LENGTH=10000 CLASSIFIER_TOOL_SPACY_MODEL_EN=en_core_web_sm CLASSIFIER_TOOL_SPACY_MODEL_ZH=zh_core_web_sm CLASSIFIER_TOOL_ALLOWED_MODELS=["en_core_web_sm","zh_core_web_sm"] CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=true CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=100 CLASSIFIER_TOOL_RATE_LIMIT_WINDOW=60 CLASSIFIER_TOOL_USE_RAKE_FOR_ENGLISH=true ``` **3. Load the .env file in your application:** ```python # main.py or app.py - at the top of your entry point from dotenv import load_dotenv # Load environment variables from .env file # This must be done BEFORE importing aiecs tools load_dotenv() # Now import and use aiecs tools from aiecs.tools.task_tools.classfire_tool import ClassifierTool # The tool will automatically use the environment variables classifier = ClassifierTool() ``` ### Multiple Environment Files You can use different `.env` files for different environments: ```python import os from dotenv import load_dotenv # Load environment-specific configuration env = os.getenv('APP_ENV', 'development') if env == 'production': load_dotenv('.env.production') elif env == 'staging': load_dotenv('.env.staging') else: load_dotenv('.env.development') from aiecs.tools.task_tools.classfire_tool import ClassifierTool classifier = ClassifierTool() ``` **Example `.env.production`:** ```bash # Production settings - stricter limits and better models CLASSIFIER_TOOL_MAX_WORKERS=32 CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=true CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=100 CLASSIFIER_TOOL_SPACY_MODEL_EN=en_core_web_md CLASSIFIER_TOOL_MAX_TEXT_LENGTH=8000 ``` **Example `.env.development`:** ```bash # Development settings - relaxed limits for testing CLASSIFIER_TOOL_MAX_WORKERS=4 CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=false CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=300 CLASSIFIER_TOOL_SPACY_MODEL_EN=en_core_web_sm ``` ### Best Practices for .env Files 1. **Never commit .env files to version control** - Add `.env` to your `.gitignore`: ```gitignore # .gitignore .env .env.local .env.*.local .env.production .env.staging ``` 2. **Provide a template** - Create `.env.example` with documented dummy values: ```bash # .env.example # Classifier Tool Configuration # Maximum number of worker threads CLASSIFIER_TOOL_MAX_WORKERS=16 # Cache settings (in seconds) CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=3600 CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=10 # Text processing limits CLASSIFIER_TOOL_MAX_TEXT_LENGTH=10000 # SpaCy models CLASSIFIER_TOOL_SPACY_MODEL_EN=en_core_web_sm CLASSIFIER_TOOL_SPACY_MODEL_ZH=zh_core_web_sm # Security settings CLASSIFIER_TOOL_ALLOWED_MODELS=["en_core_web_sm","zh_core_web_sm"] # Rate limiting CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=true CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=100 CLASSIFIER_TOOL_RATE_LIMIT_WINDOW=60 # Feature flags CLASSIFIER_TOOL_USE_RAKE_FOR_ENGLISH=true ``` 3. **Document your variables** - Add comments explaining each setting 4. **Use load_dotenv() early** - Call it at the very top of your entry point, before any aiecs imports 5. **Format complex types correctly**: - Booleans: `true`, `false`, `1`, `0`, `yes`, `no` - Lists: Use JSON array format with double quotes: `["item1","item2"]` - Numbers: Plain integers or floats: `100`, `3600` ## Configuration Options ### 1. Max Workers **Environment Variable:** `CLASSIFIER_TOOL_MAX_WORKERS` **Type:** Integer **Default:** `min(32, (os.cpu_count() or 4) * 2)` **Description:** Maximum number of worker threads for parallel processing. The default dynamically adjusts based on CPU count. **Example:** ```bash export CLASSIFIER_TOOL_MAX_WORKERS=16 ``` ### 2. Pipeline Cache TTL **Environment Variable:** `CLASSIFIER_TOOL_PIPELINE_CACHE_TTL` **Type:** Integer **Default:** `3600` (1 hour) **Description:** Time-to-live for pipeline cache in seconds. Pipelines are expensive to load, so caching improves performance. **Example:** ```bash export CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=7200 # 2 hours ``` ### 3. Pipeline Cache Size **Environment Variable:** `CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE` **Type:** Integer **Default:** `10` **Description:** Maximum number of pipeline entries to cache. Each pipeline can consume significant memory. **Example:** ```bash export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=5 ``` ### 4. Max Text Length **Environment Variable:** `CLASSIFIER_TOOL_MAX_TEXT_LENGTH` **Type:** Integer **Default:** `10000` characters **Description:** Maximum allowed text length for processing. This is a security and performance constraint. **Example:** ```bash export CLASSIFIER_TOOL_MAX_TEXT_LENGTH=5000 ``` ### 5. SpaCy Model (English) **Environment Variable:** `CLASSIFIER_TOOL_SPACY_MODEL_EN` **Type:** String **Default:** `"en_core_web_sm"` **Description:** SpaCy model to use for English text processing. **Available Models:** - `en_core_web_sm` - Small model (default, faster) - `en_core_web_md` - Medium model (more accurate) - `en_core_web_lg` - Large model (most accurate, slower) **Example:** ```bash export CLASSIFIER_TOOL_SPACY_MODEL_EN="en_core_web_md" ``` ### 6. SpaCy Model (Chinese) **Environment Variable:** `CLASSIFIER_TOOL_SPACY_MODEL_ZH` **Type:** String **Default:** `"zh_core_web_sm"` **Description:** SpaCy model to use for Chinese text processing. **Available Models:** - `zh_core_web_sm` - Small model (default) - `zh_core_web_md` - Medium model - `zh_core_web_lg` - Large model **Example:** ```bash export CLASSIFIER_TOOL_SPACY_MODEL_ZH="zh_core_web_md" ``` ### 7. Allowed Models **Environment Variable:** `CLASSIFIER_TOOL_ALLOWED_MODELS` **Type:** List[str] **Default:** `["en_core_web_sm", "zh_core_web_sm"]` **Description:** List of allowed spaCy models that can be loaded. This is a security feature to prevent arbitrary model loading. **Format:** JSON array string **Example:** ```bash export CLASSIFIER_TOOL_ALLOWED_MODELS='["en_core_web_sm","en_core_web_md","zh_core_web_sm"]' ``` ### 8. Rate Limit Enabled **Environment Variable:** `CLASSIFIER_TOOL_RATE_LIMIT_ENABLED` **Type:** Boolean **Default:** `true` **Description:** Enable or disable rate limiting for API requests. **Format:** Pydantic accepts various boolean representations: `true`, `false`, `1`, `0`, `yes`, `no`, `on`, `off` **Example:** ```bash export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=false ``` ### 9. Rate Limit Requests **Environment Variable:** `CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS` **Type:** Integer **Default:** `100` **Description:** Maximum number of requests allowed per rate limit window. **Example:** ```bash export CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=200 ``` ### 10. Rate Limit Window **Environment Variable:** `CLASSIFIER_TOOL_RATE_LIMIT_WINDOW` **Type:** Integer **Default:** `60` seconds **Description:** Time window (in seconds) for rate limiting. **Example:** ```bash export CLASSIFIER_TOOL_RATE_LIMIT_WINDOW=120 # 2 minutes ``` ### 11. Use RAKE for English **Environment Variable:** `CLASSIFIER_TOOL_USE_RAKE_FOR_ENGLISH` **Type:** Boolean **Default:** `true` **Description:** Use RAKE (Rapid Automatic Keyword Extraction) algorithm for English keyword extraction. If disabled, falls back to spaCy-based extraction. **Example:** ```bash export CLASSIFIER_TOOL_USE_RAKE_FOR_ENGLISH=false ``` ## Usage Examples ### Example 1: Basic Environment Configuration ```bash # Configure for high-performance processing export CLASSIFIER_TOOL_MAX_WORKERS=32 export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=20 export CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=500 # Use larger models for better accuracy export CLASSIFIER_TOOL_SPACY_MODEL_EN="en_core_web_md" # Run your application python app.py ``` ### Example 2: Development Environment ```bash # Disable rate limiting for testing export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=false # Use smaller cache for memory-constrained systems export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=3 export CLASSIFIER_TOOL_MAX_WORKERS=4 # Shorter cache TTL for rapid iteration export CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=300 # 5 minutes ``` ### Example 3: Production Environment ```bash # Strict rate limiting export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=true export CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=100 export CLASSIFIER_TOOL_RATE_LIMIT_WINDOW=60 # Optimized performance export CLASSIFIER_TOOL_MAX_WORKERS=24 export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=15 export CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=7200 # Security: limit text length export CLASSIFIER_TOOL_MAX_TEXT_LENGTH=8000 ``` ### Example 4: Programmatic Configuration ```python from aiecs.tools.task_tools.classfire_tool import ClassifierTool # Initialize with custom configuration classifier = ClassifierTool(config={ 'max_workers': 16, 'pipeline_cache_ttl': 3600, 'pipeline_cache_size': 10, 'max_text_length': 5000, 'spacy_model_en': 'en_core_web_md', 'spacy_model_zh': 'zh_core_web_sm', 'rate_limit_enabled': True, 'rate_limit_requests': 200, 'rate_limit_window': 60, 'use_rake_for_english': True }) ``` ### Example 5: Mixed Configuration ```bash # Set environment defaults export CLASSIFIER_TOOL_MAX_WORKERS=20 export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=true ``` ```python # Override specific settings programmatically classifier = ClassifierTool(config={ 'rate_limit_enabled': False, # Override env var 'spacy_model_en': 'en_core_web_lg' # Use larger model }) ``` ## Configuration Priority Configuration values are resolved in the following order (highest to lowest priority): 1. **Programmatic config** - Values passed to the constructor 2. **Environment variables** - Values set via `CLASSIFIER_TOOL_*` variables 3. **Default values** - Built-in defaults as specified above ## Data Type Parsing ### Boolean Values Pydantic accepts multiple boolean representations: - **True:** `true`, `1`, `yes`, `on`, `True`, `TRUE` - **False:** `false`, `0`, `no`, `off`, `False`, `FALSE` Example: ```bash export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=yes # Parsed as True ``` ### List Values Lists must be provided as JSON array strings: ```bash # Correct export CLASSIFIER_TOOL_ALLOWED_MODELS='["en_core_web_sm","zh_core_web_sm"]' # Incorrect (will not parse) export CLASSIFIER_TOOL_ALLOWED_MODELS="en_core_web_sm,zh_core_web_sm" ``` ### Integer Values Integers should be provided as numeric strings: ```bash export CLASSIFIER_TOOL_MAX_WORKERS=16 export CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=3600 ``` ## Validation ### Automatic Type Validation Pydantic automatically validates configuration values: - Integer fields must contain valid integers - Boolean fields must contain valid boolean representations - List fields must contain valid JSON arrays - String fields accept any string value ### Custom Validation The tool includes custom validators for: - **max_text_length**: Applied to all text inputs - **allowed_models**: Checked when loading models - **rate_limit_requests**: Must be positive ### Security Validation Text inputs are validated for: - Maximum length constraints - Potentially malicious SQL injection patterns - Other security threats ## Performance Tuning ### Memory Optimization For memory-constrained environments: ```bash export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=3 export CLASSIFIER_TOOL_MAX_WORKERS=4 export CLASSIFIER_TOOL_SPACY_MODEL_EN="en_core_web_sm" ``` ### Speed Optimization For high-throughput environments: ```bash export CLASSIFIER_TOOL_MAX_WORKERS=32 export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=20 export CLASSIFIER_TOOL_PIPELINE_CACHE_TTL=7200 ``` ### Accuracy Optimization For maximum accuracy (at the cost of speed/memory): ```bash export CLASSIFIER_TOOL_SPACY_MODEL_EN="en_core_web_lg" export CLASSIFIER_TOOL_SPACY_MODEL_ZH="zh_core_web_lg" export CLASSIFIER_TOOL_ALLOWED_MODELS='["en_core_web_lg","zh_core_web_lg"]' ``` ## Model Installation Before using specific models, ensure they are installed: ```bash # Install spaCy models python -m spacy download en_core_web_sm python -m spacy download en_core_web_md python -m spacy download en_core_web_lg python -m spacy download zh_core_web_sm python -m spacy download zh_core_web_md python -m spacy download zh_core_web_lg ``` ## Troubleshooting ### Issue: Model not found **Error:** `OSError: [E050] Can't find model 'en_core_web_md'` **Solution:** ```bash # Download the required model python -m spacy download en_core_web_md # Or set to an installed model export CLASSIFIER_TOOL_SPACY_MODEL_EN="en_core_web_sm" ``` ### Issue: Rate limit exceeded **Error:** `Rate limit exceeded. Please try again later.` **Solution:** ```bash # Increase rate limits export CLASSIFIER_TOOL_RATE_LIMIT_REQUESTS=500 export CLASSIFIER_TOOL_RATE_LIMIT_WINDOW=60 # Or disable for testing export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=false ``` ### Issue: Out of memory **Cause:** Too many cached pipelines or workers **Solution:** ```bash # Reduce cache and workers export CLASSIFIER_TOOL_PIPELINE_CACHE_SIZE=3 export CLASSIFIER_TOOL_MAX_WORKERS=4 # Use smaller models export CLASSIFIER_TOOL_SPACY_MODEL_EN="en_core_web_sm" ``` ### Issue: Boolean environment variable not working **Cause:** Incorrect boolean format **Solution:** ```bash # Use recognized boolean values export CLASSIFIER_TOOL_RATE_LIMIT_ENABLED=true # or false, 1, 0, yes, no # NOT: "True", "FALSE" (with quotes can cause issues) ``` ### Issue: List parsing error **Cause:** Invalid JSON format for list values **Solution:** ```bash # Use proper JSON array syntax export CLASSIFIER_TOOL_ALLOWED_MODELS='["en_core_web_sm","zh_core_web_sm"]' # Make sure to use double quotes inside the array # Single quotes for the shell, double quotes for JSON ``` ## Best Practices 1. **Resource Management:** - Set `max_workers` to 2x CPU count for I/O-bound tasks - Limit `pipeline_cache_size` based on available memory - Use appropriate `pipeline_cache_ttl` for your workload 2. **Security:** - Keep `rate_limit_enabled=true` in production - Restrict `allowed_models` to only necessary models - Set conservative `max_text_length` limits 3. **Performance:** - Use smaller models (`_sm`) for faster processing - Use larger models (`_lg`) when accuracy is critical - Tune cache settings based on usage patterns 4. **Language Support:** - The tool auto-detects language if not specified - Pre-load models for languages you frequently use - Consider separate instances for different languages ## Operations Supported The Classifier Tool supports the following operations: - **classify**: Sentiment classification - **tokenize**: Text tokenization - **pos_tag**: Part-of-speech tagging - **ner**: Named entity recognition - **lemmatize**: Token lemmatization - **dependency_parse**: Dependency parsing - **keyword_extract**: Keyword/phrase extraction - **summarize**: Text summarization - **batch_process**: Batch processing of multiple texts ## Related Documentation - Tool implementation details in the source code - NLP best practices in `TOOL_SPECIAL_SPECIAL_INSTRUCTIONS.md` - SpaCy documentation: https://spacy.io/usage ## Support For issues or questions about Classifier Tool configuration: - Check the tool source code for implementation details - Review spaCy documentation for model-specific information - Consult the main documentation for architecture overview