Research Tool Configuration Guide
Overview
The Research Tool provides comprehensive causal inference capabilities using Mill’s methods, advanced induction, deduction, and text summarization. It leverages spaCy for natural language processing and statistical analysis for correlation studies. The tool can be configured via environment variables using the RESEARCH_TOOL_ prefix or through programmatic configuration when initializing the tool.
Using .env Files in Your Project
When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Research Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.
Setting Up .env Files
1. Install python-dotenv:
pip install python-dotenv
2. Create a .env file in your project root:
# .env file in your project root
RESEARCH_TOOL_MAX_WORKERS=16
RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS=["en_core_web_sm","zh_core_web_sm"]
3. Load the .env file in your application:
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv
# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()
# Now import and use aiecs tools
from aiecs.tools.task_tools.research_tool import ResearchTool
# The tool will automatically use the environment variables
research_tool = ResearchTool()
Multiple Environment Files
You can use different .env files for different environments:
import os
from dotenv import load_dotenv
# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')
if env == 'production':
load_dotenv('.env.production')
elif env == 'staging':
load_dotenv('.env.staging')
else:
load_dotenv('.env.development')
from aiecs.tools.task_tools.research_tool import ResearchTool
research_tool = ResearchTool()
Example .env.production:
# Production settings - optimized for performance
RESEARCH_TOOL_MAX_WORKERS=32
RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
RESEARCH_TOOL_MAX_TEXT_LENGTH=50000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS=["en_core_web_sm"]
Example .env.development:
# Development settings - more permissive for testing
RESEARCH_TOOL_MAX_WORKERS=4
RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS=["en_core_web_sm","zh_core_web_sm"]
Best Practices for .env Files
Never commit .env files to version control - Add
.envto your.gitignore:# .gitignore .env .env.local .env.*.local .env.production .env.staging
Provide a template - Create
.env.examplewith documented dummy values:# .env.example # Research Tool Configuration # Maximum number of worker threads RESEARCH_TOOL_MAX_WORKERS=16 # Default spaCy model to use RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm # Maximum text length for inputs RESEARCH_TOOL_MAX_TEXT_LENGTH=10000 # Allowed spaCy models (JSON array) RESEARCH_TOOL_ALLOWED_SPACY_MODELS=["en_core_web_sm","zh_core_web_sm"]
Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format complex types correctly:
Strings: Plain text:
en_core_web_smIntegers: Plain numbers:
16,10000Lists: JSON array format:
["en_core_web_sm","zh_core_web_sm"]
Configuration Options
1. Max Workers
Environment Variable: RESEARCH_TOOL_MAX_WORKERS
Type: Integer
Default: min(32, (os.cpu_count() or 4) * 2)
Description: Maximum number of worker threads for parallel processing. This affects the concurrency of operations that can be parallelized.
Common Values:
4- Conservative (development)8- Balanced (small servers)16- High performance (production)32- Maximum (high-end servers)
Example:
export RESEARCH_TOOL_MAX_WORKERS=16
Performance Note: Higher values use more CPU cores but may increase memory usage. Set based on available system resources.
2. SpaCy Model
Environment Variable: RESEARCH_TOOL_SPACY_MODEL
Type: String
Default: "en_core_web_sm"
Description: Default spaCy model to use for natural language processing. This model is used for all text analysis operations including tokenization, POS tagging, NER, and dependency parsing.
Available Models:
en_core_web_sm- English small model (default, fastest)en_core_web_md- English medium model (better accuracy)en_core_web_lg- English large model (best accuracy)zh_core_web_sm- Chinese small modelzh_core_web_md- Chinese medium model
Example:
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_md
Installation Note: Models must be installed separately:
python -m spacy download en_core_web_sm
python -m spacy download zh_core_web_sm
3. Max Text Length
Environment Variable: RESEARCH_TOOL_MAX_TEXT_LENGTH
Type: Integer
Default: 10_000
Description: Maximum text length in characters for input processing. This prevents memory issues with extremely long texts and ensures reasonable processing times.
Common Values:
5_000- Short texts (summaries, abstracts)10_000- Standard texts (articles, reports)50_000- Long texts (documents, books)100_000- Very long texts (research papers)
Example:
export RESEARCH_TOOL_MAX_TEXT_LENGTH=50000
Memory Note: Longer texts use more memory and processing time. Adjust based on available system resources.
4. Allowed SpaCy Models
Environment Variable: RESEARCH_TOOL_ALLOWED_SPACY_MODELS
Type: List[str]
Default: ["en_core_web_sm", "zh_core_web_sm"]
Description: List of allowed spaCy models that can be used. This is a security feature that prevents loading of unauthorized or potentially malicious models.
Format: JSON array string with double quotes
Common Configurations:
English only:
["en_core_web_sm"]Multilingual:
["en_core_web_sm", "zh_core_web_sm"]High accuracy:
["en_core_web_lg", "zh_core_web_md"]
Example:
# English only
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm"]'
# Multilingual support
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'
Security Note: Only include models that are actually needed and have been verified as safe.
Usage Examples
Example 1: Basic Environment Configuration
# Set custom processing parameters
export RESEARCH_TOOL_MAX_WORKERS=16
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_md
export RESEARCH_TOOL_MAX_TEXT_LENGTH=50000
# Run your application
python app.py
Example 2: High-Performance Configuration
# Optimized for large-scale processing
export RESEARCH_TOOL_MAX_WORKERS=32
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_lg
export RESEARCH_TOOL_MAX_TEXT_LENGTH=100000
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_lg"]'
Example 3: Multilingual Configuration
# Support for multiple languages
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm","de_core_news_sm"]'
Example 4: Programmatic Configuration
from aiecs.tools.task_tools.research_tool import ResearchTool
# Initialize with custom configuration
research_tool = ResearchTool(config={
'max_workers': 16,
'spacy_model': 'en_core_web_md',
'max_text_length': 50000,
'allowed_spacy_models': ['en_core_web_sm', 'en_core_web_md']
})
Example 5: Mixed Configuration
Environment variables are used as defaults, but can be overridden programmatically:
# Set environment defaults
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
export RESEARCH_TOOL_MAX_WORKERS=8
# Override for specific instance
research_tool = ResearchTool(config={
'spacy_model': 'en_core_web_lg', # This overrides the environment variable
'max_workers': 16 # This overrides the environment variable
})
Configuration Priority
When the Research Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):
Programmatic config - Values passed to the constructor
Environment variables - Values set via
RESEARCH_TOOL_*variablesDefault values - Built-in defaults as specified above
Data Type Parsing
String Values
Strings should be provided as plain text without quotes:
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
Integer Values
Integers should be provided as numeric strings:
export RESEARCH_TOOL_MAX_WORKERS=16
export RESEARCH_TOOL_MAX_TEXT_LENGTH=50000
List Values
Lists must be provided as JSON arrays with double quotes:
# Correct
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'
# Incorrect (will not parse)
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS="en_core_web_sm,zh_core_web_sm"
Important: Use single quotes for the shell, double quotes for JSON:
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'
# ^ ^
# Single quotes for shell
# ^ ^
# Double quotes for JSON
Validation
Automatic Type Validation
Pydantic automatically validates configuration values:
max_workersmust be a positive integerspacy_modelmust be a non-empty stringmax_text_lengthmust be a positive integerallowed_spacy_modelsmust be a list of strings
Runtime Validation
When processing data, the tool validates:
SpaCy model availability - Model must be installed and loadable
Model authorization - Model must be in
allowed_spacy_modelslistText length limits - Input text must not exceed
max_text_lengthData structure - Input data must be valid for each operation
Statistical requirements - Sufficient data for correlation analysis
Operations Supported
The Research Tool supports comprehensive causal inference and text analysis operations:
Mill’s Methods for Causal Inference
Method of Agreement
mill_agreement- Identify common factors in positive casesFinds attributes present in all cases with positive outcomes
Useful for identifying necessary conditions
Method of Difference
mill_difference- Identify factors present in positive but absent in negative casesCompares single positive and negative case
Useful for identifying sufficient conditions
Joint Method
mill_joint- Combine agreement and difference methodsIdentifies causal factors by analyzing both positive and negative cases
Most robust method for causal inference
Method of Residues
mill_residues- Identify residual causes after accounting for known causesRemoves known causal factors to find remaining causes
Useful for complex causal analysis
Method of Concomitant Variations
mill_concomitant- Analyze correlation between factor and effect variationsUses statistical correlation analysis
Provides quantitative causal evidence
Advanced Analysis Operations
Induction
induction- Generalize patterns from examples using spaCy-based clusteringExtracts common noun phrases and verbs
Identifies recurring patterns in text data
Deduction
deduction- Validate conclusions using spaCy dependency parsingChecks logical consistency between premises and conclusions
Validates reasoning chains
Text Summarization
summarize- Summarize text using spaCy sentence rankingExtracts key sentences based on keyword frequency
Produces concise summaries of long texts
Troubleshooting
Issue: SpaCy model not found
Error: OSError: [E050] Can't find model 'en_core_web_sm'
Solutions:
Install the model:
python -m spacy download en_core_web_smCheck model name:
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_smVerify installation:
python -c "import spacy; spacy.load('en_core_web_sm')"
Issue: Model not in allowed list
Error: Invalid spaCy model 'model_name', expected ['allowed_models']
Solution:
# Add the model to allowed list
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","your_model"]'
Issue: Memory errors with large texts
Error: MemoryError or system becomes unresponsive
Solutions:
# Reduce text length limit
export RESEARCH_TOOL_MAX_TEXT_LENGTH=5000
# Use smaller spaCy model
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
Issue: Slow processing
Causes: Large texts, complex models, insufficient workers
Solutions:
# Increase worker count
export RESEARCH_TOOL_MAX_WORKERS=32
# Use faster model
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
# Reduce text length
export RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
Issue: Correlation analysis fails
Error: Failed to process mill_concomitant
Solutions:
Ensure sufficient data points (minimum 2 cases)
Check data types (numeric values required)
Verify factor and effect column names exist
Use appropriate statistical methods
Issue: List parsing error
Error: Configuration parsing fails for allowed_spacy_models
Solution:
# Use proper JSON array syntax
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'
# NOT: [en_core_web_sm,zh_core_web_sm] or en_core_web_sm,zh_core_web_sm
Issue: Text too long
Error: Text exceeds maximum length limit
Solutions:
# Increase text length limit
export RESEARCH_TOOL_MAX_TEXT_LENGTH=50000
# Or preprocess text to reduce length
Best Practices
Performance Optimization
Model Selection - Choose appropriate spaCy model for your needs:
en_core_web_sm- Fastest, good for basic tasksen_core_web_md- Balanced speed and accuracyen_core_web_lg- Best accuracy, slower processing
Worker Configuration - Match worker count to available CPU cores:
Development: 4-8 workers
Production: 16-32 workers
High-end: 32+ workers
Text Length Management - Set appropriate limits:
Short texts: 5,000 characters
Standard texts: 10,000 characters
Long texts: 50,000+ characters
Memory Management - Monitor memory usage:
Use smaller models for memory-constrained environments
Process texts in batches for very long documents
Clean up spaCy models when done
Causal Inference Best Practices
Data Quality - Ensure high-quality input data:
Consistent attribute naming
Clear outcome definitions
Sufficient sample sizes
Method Selection - Choose appropriate Mill’s method:
Agreement: When you have multiple positive cases
Difference: When comparing single positive/negative cases
Joint: For most robust causal inference
Residues: When you have known causes to exclude
Concomitant: For quantitative correlation analysis
Statistical Validation - Always validate results:
Check correlation significance (p-values)
Consider multiple causal factors
Validate with additional data
Text Analysis Best Practices
Preprocessing - Clean and prepare text data:
Remove irrelevant content
Standardize formatting
Handle special characters
Model Selection - Choose appropriate spaCy model:
Match language of your text
Consider accuracy vs. speed trade-offs
Use domain-specific models when available
Result Interpretation - Understand tool limitations:
Statistical methods provide correlations, not causation
Text analysis is probabilistic, not deterministic
Results should be validated with domain expertise
Security
Model Validation - Only use trusted spaCy models:
Download from official spaCy repository
Verify model integrity
Keep models updated
Input Sanitization - Validate input data:
Check text length limits
Validate data structures
Handle malformed inputs gracefully
Resource Limits - Prevent resource exhaustion:
Set appropriate worker limits
Monitor memory usage
Implement timeout mechanisms
Development vs Production
Development:
RESEARCH_TOOL_MAX_WORKERS=4
RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'
Production:
RESEARCH_TOOL_MAX_WORKERS=32
RESEARCH_TOOL_SPACY_MODEL=en_core_web_md
RESEARCH_TOOL_MAX_TEXT_LENGTH=50000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","en_core_web_md"]'
Error Handling
Always wrap research operations in try-except blocks:
from aiecs.tools.task_tools.research_tool import ResearchTool, ResearchToolError, FileOperationError
research_tool = ResearchTool()
try:
result = research_tool.mill_agreement(cases)
except FileOperationError as e:
print(f"Research operation failed: {e}")
except ResearchToolError as e:
print(f"Research tool error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
SpaCy Model Installation
Installing Models
# Install English models
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg
# Install Chinese models
python -m spacy download zh_core_web_sm
python -m spacy download zh_core_web_md
# Install German models
python -m spacy download de_core_news_sm
python -m spacy download de_core_news_md
Verifying Installation
import spacy
# Test model loading
try:
nlp = spacy.load("en_core_web_sm")
print("Model loaded successfully")
except OSError:
print("Model not found, install with: python -m spacy download en_core_web_sm")
Model Information
import spacy
# Get model information
nlp = spacy.load("en_core_web_sm")
print(f"Model: {nlp.meta['name']}")
print(f"Version: {nlp.meta['version']}")
print(f"Language: {nlp.meta['lang']}")
print(f"Pipeline: {nlp.pipe_names}")
Support
For issues or questions about Research Tool configuration:
Check the tool source code for implementation details
Review spaCy documentation for NLP functionality
Consult the main aiecs documentation for architecture overview
Test with small datasets first to isolate configuration vs. data issues
Monitor memory and CPU usage during processing
Validate spaCy model installation and compatibility