# Data Profiler Tool Configuration Guide

## Overview

The Data Profiler Tool is a comprehensive data profiling and quality assessment tool that provides advanced data profiling capabilities with statistical summaries and distributions, data quality issue detection, pattern and anomaly identification, preprocessing recommendations, and column-level and dataset-level analysis. It can generate statistical summaries, detect data quality issues, identify patterns and anomalies, and recommend preprocessing steps. The tool integrates with stats_tool and pandas_tool for core operations and supports various profiling levels (basic, standard, comprehensive, deep) and data quality checks (missing_values, duplicates, outliers, inconsistencies, data_types, distributions, correlations). The tool can be configured via environment variables using the `DATA_PROFILER_` prefix or through programmatic configuration when initializing the tool.

## Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a `.env` file for convenience. The Data Profiler Tool reads from environment variables that are already loaded into the process, so you need to load the `.env` file in your application before importing aiecs tools.

### Setting Up .env Files

**1. Install python-dotenv:**

```bash
pip install python-dotenv
```

**2. Create a `.env` file in your project root:**

```bash
# .env file in your project root
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
DATA_PROFILER_CORRELATION_THRESHOLD=0.7
DATA_PROFILER_MISSING_THRESHOLD=0.5
DATA_PROFILER_ENABLE_VISUALIZATIONS=true
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=50
```

**3. Load the .env file in your application:**

```python
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.statistics.data_profiler_tool import DataProfilerTool

# The tool will automatically use the environment variables
data_profiler = DataProfilerTool()
```

### Multiple Environment Files

You can use different `.env` files for different environments:

```python
import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.statistics.data_profiler_tool import DataProfilerTool
data_profiler = DataProfilerTool()
```

**Example `.env.production`:**
```bash
# Production settings - optimized for comprehensive analysis
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
DATA_PROFILER_OUTLIER_STD_THRESHOLD=2.5
DATA_PROFILER_CORRELATION_THRESHOLD=0.8
DATA_PROFILER_MISSING_THRESHOLD=0.3
DATA_PROFILER_ENABLE_VISUALIZATIONS=true
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100
```

**Example `.env.development`:**
```bash
# Development settings - optimized for testing and debugging
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic
DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
DATA_PROFILER_CORRELATION_THRESHOLD=0.7
DATA_PROFILER_MISSING_THRESHOLD=0.5
DATA_PROFILER_ENABLE_VISUALIZATIONS=false
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=20
```

### Best Practices for .env Files

1. **Never commit .env files to version control** - Add `.env` to your `.gitignore`:
   ```gitignore
   # .gitignore
   .env
   .env.local
   .env.*.local
   .env.production
   .env.staging
   ```

2. **Provide a template** - Create `.env.example` with documented dummy values:
   ```bash
   # .env.example
   # Data Profiler Tool Configuration
   
   # Default profiling depth level
   DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
   
   # Standard deviation threshold for outlier detection
   DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
   
   # Correlation threshold for identifying strong relationships
   DATA_PROFILER_CORRELATION_THRESHOLD=0.7
   
   # Missing value threshold for quality assessment
   DATA_PROFILER_MISSING_THRESHOLD=0.5
   
   # Whether to enable visualization generation
   DATA_PROFILER_ENABLE_VISUALIZATIONS=true
   
   # Maximum unique values for categorical analysis
   DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=50
   ```

3. **Document your variables** - Add comments explaining each setting

4. **Use load_dotenv() early** - Call it at the very top of your entry point, before any aiecs imports

5. **Format values correctly**:
   - Strings: Plain text: `standard`, `comprehensive`, `basic`
   - Floats: Decimal numbers: `3.0`, `0.7`, `0.5`
   - Integers: Plain numbers: `50`, `100`
   - Booleans: `true` or `false`

## Configuration Options

### 1. Default Profile Level

**Environment Variable:** `DATA_PROFILER_DEFAULT_PROFILE_LEVEL`

**Type:** String

**Default:** `"standard"`

**Description:** Default profiling depth level when no specific level is specified. This determines the comprehensiveness of the data profiling analysis.

**Supported Levels:**
- `basic` - Basic statistical summaries and simple quality checks
- `standard` - Standard profiling with quality assessment (default)
- `comprehensive` - Comprehensive analysis with detailed patterns
- `deep` - Deep analysis with advanced statistical methods

**Example:**
```bash
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
```

**Level Note:** Higher levels provide more detail but take longer to process.

### 2. Outlier STD Threshold

**Environment Variable:** `DATA_PROFILER_OUTLIER_STD_THRESHOLD`

**Type:** Float

**Default:** `3.0`

**Description:** Standard deviation threshold for outlier detection. Values beyond this threshold are considered outliers using the Z-score method.

**Common Values:**
- `2.0` - Strict outlier detection (more outliers detected)
- `2.5` - Moderate outlier detection
- `3.0` - Standard outlier detection (default)
- `3.5` - Lenient outlier detection (fewer outliers detected)

**Example:**
```bash
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=2.5
```

**Threshold Note:** Lower values detect more outliers, higher values are more lenient.

### 3. Correlation Threshold

**Environment Variable:** `DATA_PROFILER_CORRELATION_THRESHOLD`

**Type:** Float

**Default:** `0.7`

**Description:** Correlation threshold for identifying strong relationships between variables. Correlations above this threshold are considered significant.

**Common Values:**
- `0.5` - Moderate correlation threshold
- `0.7` - Strong correlation threshold (default)
- `0.8` - Very strong correlation threshold
- `0.9` - Extremely strong correlation threshold

**Example:**
```bash
export DATA_PROFILER_CORRELATION_THRESHOLD=0.8
```

**Correlation Note:** Higher thresholds identify only the strongest relationships.

### 4. Missing Threshold

**Environment Variable:** `DATA_PROFILER_MISSING_THRESHOLD`

**Type:** Float

**Default:** `0.5`

**Description:** Missing value threshold for quality assessment. Columns with missing values above this threshold are flagged as having quality issues.

**Common Values:**
- `0.1` - Strict missing value threshold (10% missing)
- `0.3` - Moderate missing value threshold (30% missing)
- `0.5` - Standard missing value threshold (50% missing, default)
- `0.7` - Lenient missing value threshold (70% missing)

**Example:**
```bash
export DATA_PROFILER_MISSING_THRESHOLD=0.3
```

**Missing Note:** Lower thresholds are more strict about missing values.

### 5. Enable Visualizations

**Environment Variable:** `DATA_PROFILER_ENABLE_VISUALIZATIONS`

**Type:** Boolean

**Default:** `True`

**Description:** Whether to enable visualization generation during profiling. Visualizations include histograms, correlation matrices, and distribution plots.

**Values:**
- `true` - Enable visualizations (default)
- `false` - Disable visualizations

**Example:**
```bash
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
```

**Visualization Note:** Visualizations improve analysis but may slow down profiling.

### 6. Max Unique Values Categorical

**Environment Variable:** `DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL`

**Type:** Integer

**Default:** `50`

**Description:** Maximum number of unique values for categorical analysis. Columns with more unique values are treated as text rather than categorical.

**Common Values:**
- `20` - Small categorical threshold
- `50` - Standard categorical threshold (default)
- `100` - Large categorical threshold
- `200` - Very large categorical threshold

**Example:**
```bash
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100
```

**Categorical Note:** Higher values allow more categories but may impact performance.

## Usage Examples

### Example 1: Basic Environment Configuration

```bash
# Set basic data profiling parameters
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
export DATA_PROFILER_CORRELATION_THRESHOLD=0.7
export DATA_PROFILER_MISSING_THRESHOLD=0.5
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=50

# Run your application
python app.py
```

### Example 2: Comprehensive Analysis Configuration

```bash
# Optimized for comprehensive data analysis
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=2.5
export DATA_PROFILER_CORRELATION_THRESHOLD=0.8
export DATA_PROFILER_MISSING_THRESHOLD=0.3
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100
```

### Example 3: Development Configuration

```bash
# Development-friendly settings
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
export DATA_PROFILER_CORRELATION_THRESHOLD=0.7
export DATA_PROFILER_MISSING_THRESHOLD=0.5
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=20
```

### Example 4: Programmatic Configuration

```python
from aiecs.tools.statistics.data_profiler_tool import DataProfilerTool

# Initialize with custom configuration
data_profiler = DataProfilerTool(config={
    'default_profile_level': 'standard',
    'outlier_std_threshold': 3.0,
    'correlation_threshold': 0.7,
    'missing_threshold': 0.5,
    'enable_visualizations': True,
    'max_unique_values_categorical': 50
})
```

### Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

```bash
# Set environment defaults
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
```

```python
# Override for specific instance
data_profiler = DataProfilerTool(config={
    'default_profile_level': 'comprehensive',  # This overrides the environment variable
    'enable_visualizations': False  # This overrides the environment variable
})
```

## Configuration Priority

When the Data Profiler Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

1. **Programmatic config** - Values passed to the constructor
2. **Environment variables** - Values set via `DATA_PROFILER_*` variables
3. **Default values** - Built-in defaults as specified above

## Data Type Parsing

### String Values

Strings should be provided as plain text without quotes:

```bash
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
```

### Float Values

Floats should be provided as decimal numbers:

```bash
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
export DATA_PROFILER_CORRELATION_THRESHOLD=0.7
export DATA_PROFILER_MISSING_THRESHOLD=0.5
```

### Integer Values

Integers should be provided as numeric strings:

```bash
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=50
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100
```

### Boolean Values

Booleans should be provided as lowercase strings:

```bash
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false
```

## Validation

### Automatic Type Validation

Pydantic automatically validates configuration values:

- `default_profile_level` must be a valid profile level string
- `outlier_std_threshold` must be a positive float
- `correlation_threshold` must be a float between 0 and 1
- `missing_threshold` must be a float between 0 and 1
- `enable_visualizations` must be a boolean
- `max_unique_values_categorical` must be a positive integer

### Runtime Validation

When profiling data, the tool validates:

1. **Profile level** - Level must be supported and appropriate for data size
2. **Threshold values** - Thresholds must be reasonable for the analysis
3. **Data compatibility** - Data must be compatible with profiling operations
4. **Memory requirements** - Profiling must not exceed memory limits
5. **Processing time** - Profiling must complete within reasonable time

## Profile Levels

The Data Profiler Tool supports various profiling levels:

### Basic Level
- Basic statistical summaries (mean, median, std, etc.)
- Simple quality checks (missing values, duplicates)
- Fast processing for large datasets
- Minimal resource usage

### Standard Level
- Standard statistical analysis
- Quality assessment with thresholds
- Pattern identification
- Balanced performance and detail

### Comprehensive Level
- Detailed statistical analysis
- Advanced quality checks
- Pattern and anomaly detection
- Correlation analysis
- Preprocessing recommendations

### Deep Level
- Advanced statistical methods
- Machine learning-based analysis
- Complex pattern recognition
- Detailed anomaly detection
- Comprehensive preprocessing recommendations

## Data Quality Checks

### Missing Values
- Count and percentage of missing values
- Missing value patterns
- Impact assessment
- Imputation recommendations

### Duplicates
- Duplicate row detection
- Duplicate column identification
- Deduplication strategies
- Impact analysis

### Outliers
- Statistical outlier detection
- Domain-specific outlier identification
- Outlier impact assessment
- Treatment recommendations

### Inconsistencies
- Data type inconsistencies
- Format inconsistencies
- Value inconsistencies
- Cross-field validation

### Data Types
- Automatic type inference
- Type validation
- Type conversion recommendations
- Type optimization

### Distributions
- Distribution analysis
- Normality testing
- Skewness and kurtosis
- Transformation recommendations

### Correlations
- Correlation matrix generation
- Strong relationship identification
- Multicollinearity detection
- Feature selection recommendations

## Operations Supported

The Data Profiler Tool supports comprehensive data profiling operations:

### Basic Profiling
- `profile_data` - Generate comprehensive data profile
- `profile_column` - Profile individual columns
- `profile_dataset` - Profile entire dataset
- `generate_summary` - Generate statistical summary
- `detect_quality_issues` - Detect data quality problems

### Advanced Profiling
- `analyze_distributions` - Analyze data distributions
- `detect_outliers` - Detect statistical outliers
- `analyze_correlations` - Analyze variable correlations
- `identify_patterns` - Identify data patterns
- `assess_data_quality` - Comprehensive quality assessment

### Quality Operations
- `validate_data_types` - Validate data type consistency
- `check_missing_values` - Check missing value patterns
- `detect_duplicates` - Detect duplicate records
- `analyze_inconsistencies` - Analyze data inconsistencies
- `generate_quality_report` - Generate quality assessment report

### Visualization Operations
- `create_histograms` - Create distribution histograms
- `create_correlation_matrix` - Create correlation heatmap
- `create_box_plots` - Create outlier box plots
- `create_missing_heatmap` - Create missing value heatmap
- `create_summary_plots` - Create summary visualizations

### Recommendation Operations
- `recommend_preprocessing` - Recommend preprocessing steps
- `suggest_transformations` - Suggest data transformations
- `recommend_cleaning` - Recommend data cleaning steps
- `suggest_feature_engineering` - Suggest feature engineering
- `generate_action_plan` - Generate data improvement plan

## Troubleshooting

### Issue: Profiling takes too long

**Error:** Profiling operation times out or is very slow

**Solutions:**
```bash
# Use basic profile level
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic

# Disable visualizations
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false

# Reduce categorical threshold
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=20
```

### Issue: Memory usage exceeded

**Error:** Out of memory during profiling

**Solutions:**
```bash
# Use basic profile level
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic

# Disable visualizations
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false

# Process data in chunks
data_profiler.profile_data(data, chunk_size=10000)
```

### Issue: Too many outliers detected

**Error:** Excessive outlier detection

**Solutions:**
```bash
# Increase outlier threshold
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.5

# Or use domain-specific outlier detection
data_profiler.detect_outliers(data, method='domain_specific')
```

### Issue: Missing correlation detection

**Error:** No correlations detected

**Solutions:**
```bash
# Lower correlation threshold
export DATA_PROFILER_CORRELATION_THRESHOLD=0.5

# Check data types and distributions
data_profiler.analyze_distributions(data)
```

### Issue: Categorical analysis issues

**Error:** Categorical analysis problems

**Solutions:**
```bash
# Increase categorical threshold
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100

# Or specify categorical columns explicitly
data_profiler.profile_data(data, categorical_columns=['col1', 'col2'])
```

### Issue: Visualization generation fails

**Error:** Visualization creation errors

**Solutions:**
```bash
# Disable visualizations
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false

# Check visualization dependencies
pip install matplotlib seaborn plotly
```

### Issue: Quality assessment too strict

**Error:** Too many quality issues detected

**Solutions:**
```bash
# Increase missing threshold
export DATA_PROFILER_MISSING_THRESHOLD=0.7

# Adjust quality criteria
data_profiler.assess_data_quality(data, strict_mode=False)
```

## Best Practices

### Performance Optimization

1. **Profile Level Selection** - Choose appropriate profile level for your needs
2. **Visualization Control** - Disable visualizations for large datasets
3. **Categorical Threshold** - Set appropriate categorical threshold
4. **Chunk Processing** - Process large datasets in chunks
5. **Memory Management** - Monitor memory usage during profiling

### Error Handling

1. **Graceful Degradation** - Handle profiling failures gracefully
2. **Validation** - Validate data before profiling
3. **Fallback Strategies** - Provide fallback profiling methods
4. **Error Logging** - Log errors for debugging and monitoring
5. **User Feedback** - Provide clear error messages

### Security

1. **Data Privacy** - Ensure data privacy during profiling
2. **Access Control** - Control access to profiling results
3. **Audit Logging** - Log profiling activities
4. **Data Sanitization** - Sanitize sensitive data
5. **Compliance** - Ensure compliance with data regulations

### Resource Management

1. **Memory Monitoring** - Monitor memory usage during profiling
2. **Processing Time** - Set reasonable timeouts
3. **Storage Optimization** - Optimize result storage
4. **Cleanup** - Clean up temporary files
5. **Resource Limits** - Set appropriate resource limits

### Integration

1. **Tool Dependencies** - Ensure required tools are available
2. **API Compatibility** - Maintain API compatibility
3. **Error Propagation** - Properly propagate errors
4. **Logging Integration** - Integrate with logging systems
5. **Monitoring** - Monitor tool performance and usage

### Development vs Production

**Development:**
```bash
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic
DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
DATA_PROFILER_CORRELATION_THRESHOLD=0.7
DATA_PROFILER_MISSING_THRESHOLD=0.5
DATA_PROFILER_ENABLE_VISUALIZATIONS=false
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=20
```

**Production:**
```bash
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
DATA_PROFILER_OUTLIER_STD_THRESHOLD=2.5
DATA_PROFILER_CORRELATION_THRESHOLD=0.8
DATA_PROFILER_MISSING_THRESHOLD=0.3
DATA_PROFILER_ENABLE_VISUALIZATIONS=true
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100
```

### Error Handling

Always wrap profiling operations in try-except blocks:

```python
from aiecs.tools.statistics.data_profiler_tool import DataProfilerTool, DataProfilerError, ProfilingError

data_profiler = DataProfilerTool()

try:
    profile = data_profiler.profile_data(
        data=df,
        profile_level='standard',
        enable_visualizations=True
    )
except ProfilingError as e:
    print(f"Profiling error: {e}")
except DataProfilerError as e:
    print(f"Data profiler error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")
```

## Dependencies

### Core Dependencies

```bash
# Install core dependencies
pip install pydantic python-dotenv

# Install data processing dependencies
pip install pandas numpy scipy

# Install visualization dependencies
pip install matplotlib seaborn plotly
```

### Optional Dependencies

```bash
# For advanced statistical analysis
pip install scikit-learn statsmodels

# For enhanced visualization
pip install bokeh altair

# For data quality assessment
pip install great-expectations

# For advanced profiling
pip install pandas-profiling ydata-profiling
```

### Verification

```python
# Test dependency availability
try:
    import pandas
    import numpy
    import scipy
    print("Core dependencies available")
except ImportError as e:
    print(f"Missing dependency: {e}")

# Test visualization availability
try:
    import matplotlib
    import seaborn
    print("Visualization available")
except ImportError:
    print("Visualization not available")

# Test advanced analysis availability
try:
    import sklearn
    import statsmodels
    print("Advanced analysis available")
except ImportError:
    print("Advanced analysis not available")

# Test profiling libraries availability
try:
    import ydata_profiling
    print("Advanced profiling available")
except ImportError:
    print("Advanced profiling not available")
```

## Related Documentation

- Tool implementation details in the source code
- Statistics tool documentation for statistical analysis
- Pandas tool documentation for data operations
- Main aiecs documentation for architecture overview

## Support

For issues or questions about Data Profiler Tool configuration:
- Check the tool source code for implementation details
- Review statistics tool documentation for statistical analysis
- Consult the main aiecs documentation for architecture overview
- Test with simple datasets first to isolate configuration vs. profiling issues
- Verify data compatibility and format requirements
- Check profile level and threshold settings
- Ensure proper visualization dependencies
- Validate data quality and structure requirements