# Research Tool Configuration Guide

## Overview

The Research Tool provides comprehensive causal inference capabilities using Mill's methods, advanced induction, deduction, and text summarization. It leverages spaCy for natural language processing and statistical analysis for correlation studies. The tool can be configured via environment variables using the `RESEARCH_TOOL_` prefix or through programmatic configuration when initializing the tool.

## Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a `.env` file for convenience. The Research Tool reads from environment variables that are already loaded into the process, so you need to load the `.env` file in your application before importing aiecs tools.

### Setting Up .env Files

**1. Install python-dotenv:**

```bash
pip install python-dotenv
```

**2. Create a `.env` file in your project root:**

```bash
# .env file in your project root
RESEARCH_TOOL_MAX_WORKERS=16
RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS=["en_core_web_sm","zh_core_web_sm"]
```

**3. Load the .env file in your application:**

```python
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.task_tools.research_tool import ResearchTool

# The tool will automatically use the environment variables
research_tool = ResearchTool()
```

### Multiple Environment Files

You can use different `.env` files for different environments:

```python
import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.task_tools.research_tool import ResearchTool
research_tool = ResearchTool()
```

**Example `.env.production`:**
```bash
# Production settings - optimized for performance
RESEARCH_TOOL_MAX_WORKERS=32
RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
RESEARCH_TOOL_MAX_TEXT_LENGTH=50000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS=["en_core_web_sm"]
```

**Example `.env.development`:**
```bash
# Development settings - more permissive for testing
RESEARCH_TOOL_MAX_WORKERS=4
RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS=["en_core_web_sm","zh_core_web_sm"]
```

### Best Practices for .env Files

1. **Never commit .env files to version control** - Add `.env` to your `.gitignore`:
   ```gitignore
   # .gitignore
   .env
   .env.local
   .env.*.local
   .env.production
   .env.staging
   ```

2. **Provide a template** - Create `.env.example` with documented dummy values:
   ```bash
   # .env.example
   # Research Tool Configuration
   
   # Maximum number of worker threads
   RESEARCH_TOOL_MAX_WORKERS=16
   
   # Default spaCy model to use
   RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
   
   # Maximum text length for inputs
   RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
   
   # Allowed spaCy models (JSON array)
   RESEARCH_TOOL_ALLOWED_SPACY_MODELS=["en_core_web_sm","zh_core_web_sm"]
   ```

3. **Document your variables** - Add comments explaining each setting

4. **Use load_dotenv() early** - Call it at the very top of your entry point, before any aiecs imports

5. **Format complex types correctly**:
   - Strings: Plain text: `en_core_web_sm`
   - Integers: Plain numbers: `16`, `10000`
   - Lists: JSON array format: `["en_core_web_sm","zh_core_web_sm"]`

## Configuration Options

### 1. Max Workers

**Environment Variable:** `RESEARCH_TOOL_MAX_WORKERS`

**Type:** Integer

**Default:** `min(32, (os.cpu_count() or 4) * 2)`

**Description:** Maximum number of worker threads for parallel processing. This affects the concurrency of operations that can be parallelized.

**Common Values:**
- `4` - Conservative (development)
- `8` - Balanced (small servers)
- `16` - High performance (production)
- `32` - Maximum (high-end servers)

**Example:**
```bash
export RESEARCH_TOOL_MAX_WORKERS=16
```

**Performance Note:** Higher values use more CPU cores but may increase memory usage. Set based on available system resources.

### 2. SpaCy Model

**Environment Variable:** `RESEARCH_TOOL_SPACY_MODEL`

**Type:** String

**Default:** `"en_core_web_sm"`

**Description:** Default spaCy model to use for natural language processing. This model is used for all text analysis operations including tokenization, POS tagging, NER, and dependency parsing.

**Available Models:**
- `en_core_web_sm` - English small model (default, fastest)
- `en_core_web_md` - English medium model (better accuracy)
- `en_core_web_lg` - English large model (best accuracy)
- `zh_core_web_sm` - Chinese small model
- `zh_core_web_md` - Chinese medium model

**Example:**
```bash
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_md
```

**Installation Note:** Models must be installed separately:
```bash
python -m spacy download en_core_web_sm
python -m spacy download zh_core_web_sm
```

### 3. Max Text Length

**Environment Variable:** `RESEARCH_TOOL_MAX_TEXT_LENGTH`

**Type:** Integer

**Default:** `10_000`

**Description:** Maximum text length in characters for input processing. This prevents memory issues with extremely long texts and ensures reasonable processing times.

**Common Values:**
- `5_000` - Short texts (summaries, abstracts)
- `10_000` - Standard texts (articles, reports)
- `50_000` - Long texts (documents, books)
- `100_000` - Very long texts (research papers)

**Example:**
```bash
export RESEARCH_TOOL_MAX_TEXT_LENGTH=50000
```

**Memory Note:** Longer texts use more memory and processing time. Adjust based on available system resources.

### 4. Allowed SpaCy Models

**Environment Variable:** `RESEARCH_TOOL_ALLOWED_SPACY_MODELS`

**Type:** List[str]

**Default:** `["en_core_web_sm", "zh_core_web_sm"]`

**Description:** List of allowed spaCy models that can be used. This is a security feature that prevents loading of unauthorized or potentially malicious models.

**Format:** JSON array string with double quotes

**Common Configurations:**
- **English only:** `["en_core_web_sm"]`
- **Multilingual:** `["en_core_web_sm", "zh_core_web_sm"]`
- **High accuracy:** `["en_core_web_lg", "zh_core_web_md"]`

**Example:**
```bash
# English only
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm"]'

# Multilingual support
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'
```

**Security Note:** Only include models that are actually needed and have been verified as safe.

## Usage Examples

### Example 1: Basic Environment Configuration

```bash
# Set custom processing parameters
export RESEARCH_TOOL_MAX_WORKERS=16
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_md
export RESEARCH_TOOL_MAX_TEXT_LENGTH=50000

# Run your application
python app.py
```

### Example 2: High-Performance Configuration

```bash
# Optimized for large-scale processing
export RESEARCH_TOOL_MAX_WORKERS=32
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_lg
export RESEARCH_TOOL_MAX_TEXT_LENGTH=100000
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_lg"]'
```

### Example 3: Multilingual Configuration

```bash
# Support for multiple languages
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm","de_core_news_sm"]'
```

### Example 4: Programmatic Configuration

```python
from aiecs.tools.task_tools.research_tool import ResearchTool

# Initialize with custom configuration
research_tool = ResearchTool(config={
    'max_workers': 16,
    'spacy_model': 'en_core_web_md',
    'max_text_length': 50000,
    'allowed_spacy_models': ['en_core_web_sm', 'en_core_web_md']
})
```

### Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

```bash
# Set environment defaults
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
export RESEARCH_TOOL_MAX_WORKERS=8
```

```python
# Override for specific instance
research_tool = ResearchTool(config={
    'spacy_model': 'en_core_web_lg',  # This overrides the environment variable
    'max_workers': 16                 # This overrides the environment variable
})
```

## Configuration Priority

When the Research Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

1. **Programmatic config** - Values passed to the constructor
2. **Environment variables** - Values set via `RESEARCH_TOOL_*` variables
3. **Default values** - Built-in defaults as specified above

## Data Type Parsing

### String Values

Strings should be provided as plain text without quotes:

```bash
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
```

### Integer Values

Integers should be provided as numeric strings:

```bash
export RESEARCH_TOOL_MAX_WORKERS=16
export RESEARCH_TOOL_MAX_TEXT_LENGTH=50000
```

### List Values

Lists must be provided as JSON arrays with double quotes:

```bash
# Correct
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'

# Incorrect (will not parse)
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS="en_core_web_sm,zh_core_web_sm"
```

**Important:** Use single quotes for the shell, double quotes for JSON:
```bash
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'
#                                      ^                    ^
#                                      Single quotes for shell
#                                         ^      ^
#                                         Double quotes for JSON
```

## Validation

### Automatic Type Validation

Pydantic automatically validates configuration values:

- `max_workers` must be a positive integer
- `spacy_model` must be a non-empty string
- `max_text_length` must be a positive integer
- `allowed_spacy_models` must be a list of strings

### Runtime Validation

When processing data, the tool validates:

1. **SpaCy model availability** - Model must be installed and loadable
2. **Model authorization** - Model must be in `allowed_spacy_models` list
3. **Text length limits** - Input text must not exceed `max_text_length`
4. **Data structure** - Input data must be valid for each operation
5. **Statistical requirements** - Sufficient data for correlation analysis

## Operations Supported

The Research Tool supports comprehensive causal inference and text analysis operations:

### Mill's Methods for Causal Inference

#### Method of Agreement
- `mill_agreement` - Identify common factors in positive cases
- Finds attributes present in all cases with positive outcomes
- Useful for identifying necessary conditions

#### Method of Difference
- `mill_difference` - Identify factors present in positive but absent in negative cases
- Compares single positive and negative case
- Useful for identifying sufficient conditions

#### Joint Method
- `mill_joint` - Combine agreement and difference methods
- Identifies causal factors by analyzing both positive and negative cases
- Most robust method for causal inference

#### Method of Residues
- `mill_residues` - Identify residual causes after accounting for known causes
- Removes known causal factors to find remaining causes
- Useful for complex causal analysis

#### Method of Concomitant Variations
- `mill_concomitant` - Analyze correlation between factor and effect variations
- Uses statistical correlation analysis
- Provides quantitative causal evidence

### Advanced Analysis Operations

#### Induction
- `induction` - Generalize patterns from examples using spaCy-based clustering
- Extracts common noun phrases and verbs
- Identifies recurring patterns in text data

#### Deduction
- `deduction` - Validate conclusions using spaCy dependency parsing
- Checks logical consistency between premises and conclusions
- Validates reasoning chains

#### Text Summarization
- `summarize` - Summarize text using spaCy sentence ranking
- Extracts key sentences based on keyword frequency
- Produces concise summaries of long texts

## Troubleshooting

### Issue: SpaCy model not found

**Error:** `OSError: [E050] Can't find model 'en_core_web_sm'`

**Solutions:**
1. Install the model: `python -m spacy download en_core_web_sm`
2. Check model name: `export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm`
3. Verify installation: `python -c "import spacy; spacy.load('en_core_web_sm')"`

### Issue: Model not in allowed list

**Error:** `Invalid spaCy model 'model_name', expected ['allowed_models']`

**Solution:**
```bash
# Add the model to allowed list
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","your_model"]'
```

### Issue: Memory errors with large texts

**Error:** `MemoryError` or system becomes unresponsive

**Solutions:**
```bash
# Reduce text length limit
export RESEARCH_TOOL_MAX_TEXT_LENGTH=5000

# Use smaller spaCy model
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
```

### Issue: Slow processing

**Causes:** Large texts, complex models, insufficient workers

**Solutions:**
```bash
# Increase worker count
export RESEARCH_TOOL_MAX_WORKERS=32

# Use faster model
export RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm

# Reduce text length
export RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
```

### Issue: Correlation analysis fails

**Error:** `Failed to process mill_concomitant`

**Solutions:**
1. Ensure sufficient data points (minimum 2 cases)
2. Check data types (numeric values required)
3. Verify factor and effect column names exist
4. Use appropriate statistical methods

### Issue: List parsing error

**Error:** Configuration parsing fails for `allowed_spacy_models`

**Solution:**
```bash
# Use proper JSON array syntax
export RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'

# NOT: [en_core_web_sm,zh_core_web_sm] or en_core_web_sm,zh_core_web_sm
```

### Issue: Text too long

**Error:** Text exceeds maximum length limit

**Solutions:**
```bash
# Increase text length limit
export RESEARCH_TOOL_MAX_TEXT_LENGTH=50000

# Or preprocess text to reduce length
```

## Best Practices

### Performance Optimization

1. **Model Selection** - Choose appropriate spaCy model for your needs:
   - `en_core_web_sm` - Fastest, good for basic tasks
   - `en_core_web_md` - Balanced speed and accuracy
   - `en_core_web_lg` - Best accuracy, slower processing

2. **Worker Configuration** - Match worker count to available CPU cores:
   - Development: 4-8 workers
   - Production: 16-32 workers
   - High-end: 32+ workers

3. **Text Length Management** - Set appropriate limits:
   - Short texts: 5,000 characters
   - Standard texts: 10,000 characters
   - Long texts: 50,000+ characters

4. **Memory Management** - Monitor memory usage:
   - Use smaller models for memory-constrained environments
   - Process texts in batches for very long documents
   - Clean up spaCy models when done

### Causal Inference Best Practices

1. **Data Quality** - Ensure high-quality input data:
   - Consistent attribute naming
   - Clear outcome definitions
   - Sufficient sample sizes

2. **Method Selection** - Choose appropriate Mill's method:
   - **Agreement**: When you have multiple positive cases
   - **Difference**: When comparing single positive/negative cases
   - **Joint**: For most robust causal inference
   - **Residues**: When you have known causes to exclude
   - **Concomitant**: For quantitative correlation analysis

3. **Statistical Validation** - Always validate results:
   - Check correlation significance (p-values)
   - Consider multiple causal factors
   - Validate with additional data

### Text Analysis Best Practices

1. **Preprocessing** - Clean and prepare text data:
   - Remove irrelevant content
   - Standardize formatting
   - Handle special characters

2. **Model Selection** - Choose appropriate spaCy model:
   - Match language of your text
   - Consider accuracy vs. speed trade-offs
   - Use domain-specific models when available

3. **Result Interpretation** - Understand tool limitations:
   - Statistical methods provide correlations, not causation
   - Text analysis is probabilistic, not deterministic
   - Results should be validated with domain expertise

### Security

1. **Model Validation** - Only use trusted spaCy models:
   - Download from official spaCy repository
   - Verify model integrity
   - Keep models updated

2. **Input Sanitization** - Validate input data:
   - Check text length limits
   - Validate data structures
   - Handle malformed inputs gracefully

3. **Resource Limits** - Prevent resource exhaustion:
   - Set appropriate worker limits
   - Monitor memory usage
   - Implement timeout mechanisms

### Development vs Production

**Development:**
```bash
RESEARCH_TOOL_MAX_WORKERS=4
RESEARCH_TOOL_SPACY_MODEL=en_core_web_sm
RESEARCH_TOOL_MAX_TEXT_LENGTH=10000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","zh_core_web_sm"]'
```

**Production:**
```bash
RESEARCH_TOOL_MAX_WORKERS=32
RESEARCH_TOOL_SPACY_MODEL=en_core_web_md
RESEARCH_TOOL_MAX_TEXT_LENGTH=50000
RESEARCH_TOOL_ALLOWED_SPACY_MODELS='["en_core_web_sm","en_core_web_md"]'
```

### Error Handling

Always wrap research operations in try-except blocks:

```python
from aiecs.tools.task_tools.research_tool import ResearchTool, ResearchToolError, FileOperationError

research_tool = ResearchTool()

try:
    result = research_tool.mill_agreement(cases)
except FileOperationError as e:
    print(f"Research operation failed: {e}")
except ResearchToolError as e:
    print(f"Research tool error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")
```

## SpaCy Model Installation

### Installing Models

```bash
# Install English models
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg

# Install Chinese models
python -m spacy download zh_core_web_sm
python -m spacy download zh_core_web_md

# Install German models
python -m spacy download de_core_news_sm
python -m spacy download de_core_news_md
```

### Verifying Installation

```python
import spacy

# Test model loading
try:
    nlp = spacy.load("en_core_web_sm")
    print("Model loaded successfully")
except OSError:
    print("Model not found, install with: python -m spacy download en_core_web_sm")
```

### Model Information

```python
import spacy

# Get model information
nlp = spacy.load("en_core_web_sm")
print(f"Model: {nlp.meta['name']}")
print(f"Version: {nlp.meta['version']}")
print(f"Language: {nlp.meta['lang']}")
print(f"Pipeline: {nlp.pipe_names}")
```

## Related Documentation

- Tool implementation details in the source code
- SpaCy documentation: https://spacy.io/
- Mill's Methods: https://en.wikipedia.org/wiki/Mill%27s_methods
- Main aiecs documentation for architecture overview

## Support

For issues or questions about Research Tool configuration:
- Check the tool source code for implementation details
- Review spaCy documentation for NLP functionality
- Consult the main aiecs documentation for architecture overview
- Test with small datasets first to isolate configuration vs. data issues
- Monitor memory and CPU usage during processing
- Validate spaCy model installation and compatibility