Research Tool Configuration Guide

Overview

The Research Tool provides comprehensive causal inference capabilities using Mill’s methods, advanced induction, deduction, and text summarization. It leverages spaCy for natural language processing and statistical analysis for correlation studies. The tool can be configured via environment variables using the RESEARCH_TOOL_ prefix or through programmatic configuration when initializing the tool.

Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Research Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.

Setting Up .env Files

1. Install python-dotenv:

pip install python-dotenv

2. Create a .env file in your project root:

# .env file in your project root
RESEARCH_TOOL_MAX_WORKERS=16
RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS=["en_core_web_sm","zh_core_web_sm"]

3. Load the .env file in your application:

# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.task_tools.research_tool import ResearchTool

# The tool will automatically use the environment variables
research_tool = ResearchTool()

Multiple Environment Files

You can use different .env files for different environments:

import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.task_tools.research_tool import ResearchTool
research_tool = ResearchTool()

Example .env.production:

# Production settings - optimized for performance
RESEARCH_TOOL_MAX_WORKERS=32
RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
RESEARCH_TOOL_MAX_TEXT_LENGTH=50000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS=["en_core_web_sm"]

Example .env.development:

# Development settings - more permissive for testing
RESEARCH_TOOL_MAX_WORKERS=4
RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS=["en_core_web_sm","zh_core_web_sm"]

Best Practices for .env Files

  1. Never commit .env files to version control - Add .env to your .gitignore:

    # .gitignore
    .env
    .env.local
    .env.*.local
    .env.production
    .env.staging
    
  2. Provide a template - Create .env.example with documented dummy values:

    # .env.example
    # Research Tool Configuration
    
    # Maximum number of worker threads
    RESEARCH_TOOL_MAX_WORKERS=16
    
    # Default spaCy model to use
    RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
    
    # Maximum text length for inputs
    RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
    
    # Allowed spaCy models (JSON array)
    RESEARCH_TOOL_ALLOWED_SPACY_MODELS=["en_core_web_sm","zh_core_web_sm"]
    
  3. Document your variables - Add comments explaining each setting

  4. Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports

  5. Format complex types correctly:

    • Strings: Plain text: en_core_web_sm

    • Integers: Plain numbers: 16, 10000

    • Lists: JSON array format: ["en_core_web_sm","zh_core_web_sm"]

Configuration Options

1. Max Workers

Environment Variable: RESEARCH_TOOL_MAX_WORKERS

Type: Integer

Default: min(32, (os.cpu_count() or 4) * 2)

Description: Maximum number of worker threads for parallel processing. This affects the concurrency of operations that can be parallelized.

Common Values:

  • 4 - Conservative (development)

  • 8 - Balanced (small servers)

  • 16 - High performance (production)

  • 32 - Maximum (high-end servers)

Example:

export RESEARCH_TOOL_MAX_WORKERS=16

Performance Note: Higher values use more CPU cores but may increase memory usage. Set based on available system resources.

2. SpaCy Model

Environment Variable: RESEARCH_TOOL_SPACY_MODEL

Type: String

Default: "en_core_web_sm"

Description: Default spaCy model to use for natural language processing. This model is used for all text analysis operations including tokenization, POS tagging, NER, and dependency parsing.

Available Models:

  • en_core_web_sm - English small model (default, fastest)

  • en_core_web_md - English medium model (better accuracy)

  • en_core_web_lg - English large model (best accuracy)

  • zh_core_web_sm - Chinese small model

  • zh_core_web_md - Chinese medium model

Example:

export RESEARCH_TOOL_SPACY_MODEL=en_core_web_md

Installation Note: Models must be installed separately:

python -m spacy download en_core_web_sm
python -m spacy download zh_core_web_sm

3. Max Text Length

Environment Variable: RESEARCH_TOOL_MAX_TEXT_LENGTH

Type: Integer

Default: 10_000

Description: Maximum text length in characters for input processing. This prevents memory issues with extremely long texts and ensures reasonable processing times.

Common Values:

  • 5_000 - Short texts (summaries, abstracts)

  • 10_000 - Standard texts (articles, reports)

  • 50_000 - Long texts (documents, books)

  • 100_000 - Very long texts (research papers)

Example:

export RESEARCH_TOOL_MAX_TEXT_LENGTH=50000

Memory Note: Longer texts use more memory and processing time. Adjust based on available system resources.

4. Allowed SpaCy Models

Environment Variable: RESEARCH_TOOL_ALLOWED_SPACY_MODELS

Type: List[str]

Default: ["en_core_web_sm", "zh_core_web_sm"]

Description: List of allowed spaCy models that can be used. This is a security feature that prevents loading of unauthorized or potentially malicious models.

Format: JSON array string with double quotes

Common Configurations:

  • English only: ["en_core_web_sm"]

  • Multilingual: ["en_core_web_sm", "zh_core_web_sm"]

  • High accuracy: ["en_core_web_lg", "zh_core_web_md"]

Example:

# English only
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm"]'

# Multilingual support
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'

Security Note: Only include models that are actually needed and have been verified as safe.

Usage Examples

Example 1: Basic Environment Configuration

# Set custom processing parameters
export RESEARCH_TOOL_MAX_WORKERS=16
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_md
export RESEARCH_TOOL_MAX_TEXT_LENGTH=50000

# Run your application
python app.py

Example 2: High-Performance Configuration

# Optimized for large-scale processing
export RESEARCH_TOOL_MAX_WORKERS=32
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_lg
export RESEARCH_TOOL_MAX_TEXT_LENGTH=100000
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_lg"]'

Example 3: Multilingual Configuration

# Support for multiple languages
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm","de_core_news_sm"]'

Example 4: Programmatic Configuration

from aiecs.tools.task_tools.research_tool import ResearchTool

# Initialize with custom configuration
research_tool = ResearchTool(config={
    'max_workers': 16,
    'spacy_model': 'en_core_web_md',
    'max_text_length': 50000,
    'allowed_spacy_models': ['en_core_web_sm', 'en_core_web_md']
})

Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

# Set environment defaults
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
export RESEARCH_TOOL_MAX_WORKERS=8
# Override for specific instance
research_tool = ResearchTool(config={
    'spacy_model': 'en_core_web_lg',  # This overrides the environment variable
    'max_workers': 16                 # This overrides the environment variable
})

Configuration Priority

When the Research Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

  1. Programmatic config - Values passed to the constructor

  2. Environment variables - Values set via RESEARCH_TOOL_* variables

  3. Default values - Built-in defaults as specified above

Data Type Parsing

String Values

Strings should be provided as plain text without quotes:

export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm

Integer Values

Integers should be provided as numeric strings:

export RESEARCH_TOOL_MAX_WORKERS=16
export RESEARCH_TOOL_MAX_TEXT_LENGTH=50000

List Values

Lists must be provided as JSON arrays with double quotes:

# Correct
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'

# Incorrect (will not parse)
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS="en_core_web_sm,zh_core_web_sm"

Important: Use single quotes for the shell, double quotes for JSON:

export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'
#                                      ^                    ^
#                                      Single quotes for shell
#                                         ^      ^
#                                         Double quotes for JSON

Validation

Automatic Type Validation

Pydantic automatically validates configuration values:

  • max_workers must be a positive integer

  • spacy_model must be a non-empty string

  • max_text_length must be a positive integer

  • allowed_spacy_models must be a list of strings

Runtime Validation

When processing data, the tool validates:

  1. SpaCy model availability - Model must be installed and loadable

  2. Model authorization - Model must be in allowed_spacy_models list

  3. Text length limits - Input text must not exceed max_text_length

  4. Data structure - Input data must be valid for each operation

  5. Statistical requirements - Sufficient data for correlation analysis

Operations Supported

The Research Tool supports comprehensive causal inference and text analysis operations:

Mill’s Methods for Causal Inference

Method of Agreement

  • mill_agreement - Identify common factors in positive cases

  • Finds attributes present in all cases with positive outcomes

  • Useful for identifying necessary conditions

Method of Difference

  • mill_difference - Identify factors present in positive but absent in negative cases

  • Compares single positive and negative case

  • Useful for identifying sufficient conditions

Joint Method

  • mill_joint - Combine agreement and difference methods

  • Identifies causal factors by analyzing both positive and negative cases

  • Most robust method for causal inference

Method of Residues

  • mill_residues - Identify residual causes after accounting for known causes

  • Removes known causal factors to find remaining causes

  • Useful for complex causal analysis

Method of Concomitant Variations

  • mill_concomitant - Analyze correlation between factor and effect variations

  • Uses statistical correlation analysis

  • Provides quantitative causal evidence

Advanced Analysis Operations

Induction

  • induction - Generalize patterns from examples using spaCy-based clustering

  • Extracts common noun phrases and verbs

  • Identifies recurring patterns in text data

Deduction

  • deduction - Validate conclusions using spaCy dependency parsing

  • Checks logical consistency between premises and conclusions

  • Validates reasoning chains

Text Summarization

  • summarize - Summarize text using spaCy sentence ranking

  • Extracts key sentences based on keyword frequency

  • Produces concise summaries of long texts

Troubleshooting

Issue: SpaCy model not found

Error: OSError: [E050] Can't find model 'en_core_web_sm'

Solutions:

  1. Install the model: python -m spacy download en_core_web_sm

  2. Check model name: export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm

  3. Verify installation: python -c "import spacy; spacy.load('en_core_web_sm')"

Issue: Model not in allowed list

Error: Invalid spaCy model 'model_name', expected ['allowed_models']

Solution:

# Add the model to allowed list
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","your_model"]'

Issue: Memory errors with large texts

Error: MemoryError or system becomes unresponsive

Solutions:

# Reduce text length limit
export RESEARCH_TOOL_MAX_TEXT_LENGTH=5000

# Use smaller spaCy model
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm

Issue: Slow processing

Causes: Large texts, complex models, insufficient workers

Solutions:

# Increase worker count
export RESEARCH_TOOL_MAX_WORKERS=32

# Use faster model
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm

# Reduce text length
export RESEARCH_TOOL_MAX_TEXT_LENGTH=10000

Issue: Correlation analysis fails

Error: Failed to process mill_concomitant

Solutions:

  1. Ensure sufficient data points (minimum 2 cases)

  2. Check data types (numeric values required)

  3. Verify factor and effect column names exist

  4. Use appropriate statistical methods

Issue: List parsing error

Error: Configuration parsing fails for allowed_spacy_models

Solution:

# Use proper JSON array syntax
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'

# NOT: [en_core_web_sm,zh_core_web_sm] or en_core_web_sm,zh_core_web_sm

Issue: Text too long

Error: Text exceeds maximum length limit

Solutions:

# Increase text length limit
export RESEARCH_TOOL_MAX_TEXT_LENGTH=50000

# Or preprocess text to reduce length

Best Practices

Performance Optimization

  1. Model Selection - Choose appropriate spaCy model for your needs:

    • en_core_web_sm - Fastest, good for basic tasks

    • en_core_web_md - Balanced speed and accuracy

    • en_core_web_lg - Best accuracy, slower processing

  2. Worker Configuration - Match worker count to available CPU cores:

    • Development: 4-8 workers

    • Production: 16-32 workers

    • High-end: 32+ workers

  3. Text Length Management - Set appropriate limits:

    • Short texts: 5,000 characters

    • Standard texts: 10,000 characters

    • Long texts: 50,000+ characters

  4. Memory Management - Monitor memory usage:

    • Use smaller models for memory-constrained environments

    • Process texts in batches for very long documents

    • Clean up spaCy models when done

Causal Inference Best Practices

  1. Data Quality - Ensure high-quality input data:

    • Consistent attribute naming

    • Clear outcome definitions

    • Sufficient sample sizes

  2. Method Selection - Choose appropriate Mill’s method:

    • Agreement: When you have multiple positive cases

    • Difference: When comparing single positive/negative cases

    • Joint: For most robust causal inference

    • Residues: When you have known causes to exclude

    • Concomitant: For quantitative correlation analysis

  3. Statistical Validation - Always validate results:

    • Check correlation significance (p-values)

    • Consider multiple causal factors

    • Validate with additional data

Text Analysis Best Practices

  1. Preprocessing - Clean and prepare text data:

    • Remove irrelevant content

    • Standardize formatting

    • Handle special characters

  2. Model Selection - Choose appropriate spaCy model:

    • Match language of your text

    • Consider accuracy vs. speed trade-offs

    • Use domain-specific models when available

  3. Result Interpretation - Understand tool limitations:

    • Statistical methods provide correlations, not causation

    • Text analysis is probabilistic, not deterministic

    • Results should be validated with domain expertise

Security

  1. Model Validation - Only use trusted spaCy models:

    • Download from official spaCy repository

    • Verify model integrity

    • Keep models updated

  2. Input Sanitization - Validate input data:

    • Check text length limits

    • Validate data structures

    • Handle malformed inputs gracefully

  3. Resource Limits - Prevent resource exhaustion:

    • Set appropriate worker limits

    • Monitor memory usage

    • Implement timeout mechanisms

Development vs Production

Development:

RESEARCH_TOOL_MAX_WORKERS=4
RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'

Production:

RESEARCH_TOOL_MAX_WORKERS=32
RESEARCH_TOOL_SPACY_MODEL=en_core_web_md
RESEARCH_TOOL_MAX_TEXT_LENGTH=50000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","en_core_web_md"]'

Error Handling

Always wrap research operations in try-except blocks:

from aiecs.tools.task_tools.research_tool import ResearchTool, ResearchToolError, FileOperationError

research_tool = ResearchTool()

try:
    result = research_tool.mill_agreement(cases)
except FileOperationError as e:
    print(f"Research operation failed: {e}")
except ResearchToolError as e:
    print(f"Research tool error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

SpaCy Model Installation

Installing Models

# Install English models
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg

# Install Chinese models
python -m spacy download zh_core_web_sm
python -m spacy download zh_core_web_md

# Install German models
python -m spacy download de_core_news_sm
python -m spacy download de_core_news_md

Verifying Installation

import spacy

# Test model loading
try:
    nlp = spacy.load("en_core_web_sm")
    print("Model loaded successfully")
except OSError:
    print("Model not found, install with: python -m spacy download en_core_web_sm")

Model Information

import spacy

# Get model information
nlp = spacy.load("en_core_web_sm")
print(f"Model: {nlp.meta['name']}")
print(f"Version: {nlp.meta['version']}")
print(f"Language: {nlp.meta['lang']}")
print(f"Pipeline: {nlp.pipe_names}")

Support

For issues or questions about Research Tool configuration:

  • Check the tool source code for implementation details

  • Review spaCy documentation for NLP functionality

  • Consult the main aiecs documentation for architecture overview

  • Test with small datasets first to isolate configuration vs. data issues

  • Monitor memory and CPU usage during processing

  • Validate spaCy model installation and compatibility