Data Profiler Tool Configuration Guide

Overview

The Data Profiler Tool is a comprehensive data profiling and quality assessment tool that provides advanced data profiling capabilities with statistical summaries and distributions, data quality issue detection, pattern and anomaly identification, preprocessing recommendations, and column-level and dataset-level analysis. It can generate statistical summaries, detect data quality issues, identify patterns and anomalies, and recommend preprocessing steps. The tool integrates with stats_tool and pandas_tool for core operations and supports various profiling levels (basic, standard, comprehensive, deep) and data quality checks (missing_values, duplicates, outliers, inconsistencies, data_types, distributions, correlations). The tool can be configured via environment variables using the DATA_PROFILER_ prefix or through programmatic configuration when initializing the tool.

Using .env Files in Your Project

When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Data Profiler Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.

Setting Up .env Files

1. Install python-dotenv:

pip install python-dotenv

2. Create a .env file in your project root:

# .env file in your project root
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
DATA_PROFILER_CORRELATION_THRESHOLD=0.7
DATA_PROFILER_MISSING_THRESHOLD=0.5
DATA_PROFILER_ENABLE_VISUALIZATIONS=true
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=50

3. Load the .env file in your application:

# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv

# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()

# Now import and use aiecs tools
from aiecs.tools.statistics.data_profiler_tool import DataProfilerTool

# The tool will automatically use the environment variables
data_profiler = DataProfilerTool()

Multiple Environment Files

You can use different .env files for different environments:

import os
from dotenv import load_dotenv

# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')

if env == 'production':
    load_dotenv('.env.production')
elif env == 'staging':
    load_dotenv('.env.staging')
else:
    load_dotenv('.env.development')

from aiecs.tools.statistics.data_profiler_tool import DataProfilerTool
data_profiler = DataProfilerTool()

Example .env.production:

# Production settings - optimized for comprehensive analysis
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
DATA_PROFILER_OUTLIER_STD_THRESHOLD=2.5
DATA_PROFILER_CORRELATION_THRESHOLD=0.8
DATA_PROFILER_MISSING_THRESHOLD=0.3
DATA_PROFILER_ENABLE_VISUALIZATIONS=true
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100

Example .env.development:

# Development settings - optimized for testing and debugging
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic
DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
DATA_PROFILER_CORRELATION_THRESHOLD=0.7
DATA_PROFILER_MISSING_THRESHOLD=0.5
DATA_PROFILER_ENABLE_VISUALIZATIONS=false
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=20

Best Practices for .env Files

Never commit .env files to version control - Add .env to your .gitignore:

# .gitignore
.env
.env.local
.env.*.local
.env.production
.env.staging

Provide a template - Create .env.example with documented dummy values:

# .env.example
# Data Profiler Tool Configuration

# Default profiling depth level
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard

# Standard deviation threshold for outlier detection
DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0

# Correlation threshold for identifying strong relationships
DATA_PROFILER_CORRELATION_THRESHOLD=0.7

# Missing value threshold for quality assessment
DATA_PROFILER_MISSING_THRESHOLD=0.5

# Whether to enable visualization generation
DATA_PROFILER_ENABLE_VISUALIZATIONS=true

# Maximum unique values for categorical analysis
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=50

Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format values correctly:
- Strings: Plain text: standard, comprehensive, basic
- Floats: Decimal numbers: 3.0, 0.7, 0.5
- Integers: Plain numbers: 50, 100
- Booleans: true or false

Configuration Options

1. Default Profile Level

Environment Variable: DATA_PROFILER_DEFAULT_PROFILE_LEVEL

Type: String

Default: "standard"

Description: Default profiling depth level when no specific level is specified. This determines the comprehensiveness of the data profiling analysis.

Supported Levels:

basic - Basic statistical summaries and simple quality checks
standard - Standard profiling with quality assessment (default)
comprehensive - Comprehensive analysis with detailed patterns
deep - Deep analysis with advanced statistical methods

Example:

export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive

Level Note: Higher levels provide more detail but take longer to process.

2. Outlier STD Threshold

Environment Variable: DATA_PROFILER_OUTLIER_STD_THRESHOLD

Type: Float

Default: 3.0

Description: Standard deviation threshold for outlier detection. Values beyond this threshold are considered outliers using the Z-score method.

Common Values:

2.0 - Strict outlier detection (more outliers detected)
2.5 - Moderate outlier detection
3.0 - Standard outlier detection (default)
3.5 - Lenient outlier detection (fewer outliers detected)

Example:

export DATA_PROFILER_OUTLIER_STD_THRESHOLD=2.5

Threshold Note: Lower values detect more outliers, higher values are more lenient.

3. Correlation Threshold

Environment Variable: DATA_PROFILER_CORRELATION_THRESHOLD

Type: Float

Default: 0.7

Description: Correlation threshold for identifying strong relationships between variables. Correlations above this threshold are considered significant.

Common Values:

0.5 - Moderate correlation threshold
0.7 - Strong correlation threshold (default)
0.8 - Very strong correlation threshold
0.9 - Extremely strong correlation threshold

Example:

export DATA_PROFILER_CORRELATION_THRESHOLD=0.8

Correlation Note: Higher thresholds identify only the strongest relationships.

4. Missing Threshold

Environment Variable: DATA_PROFILER_MISSING_THRESHOLD

Type: Float

Default: 0.5

Description: Missing value threshold for quality assessment. Columns with missing values above this threshold are flagged as having quality issues.

Common Values:

0.1 - Strict missing value threshold (10% missing)
0.3 - Moderate missing value threshold (30% missing)
0.5 - Standard missing value threshold (50% missing, default)
0.7 - Lenient missing value threshold (70% missing)

Example:

export DATA_PROFILER_MISSING_THRESHOLD=0.3

Missing Note: Lower thresholds are more strict about missing values.

5. Enable Visualizations

Environment Variable: DATA_PROFILER_ENABLE_VISUALIZATIONS

Type: Boolean

Default: True

Description: Whether to enable visualization generation during profiling. Visualizations include histograms, correlation matrices, and distribution plots.

Values:

true - Enable visualizations (default)
false - Disable visualizations

Example:

export DATA_PROFILER_ENABLE_VISUALIZATIONS=true

Visualization Note: Visualizations improve analysis but may slow down profiling.

6. Max Unique Values Categorical

Environment Variable: DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL

Type: Integer

Default: 50

Description: Maximum number of unique values for categorical analysis. Columns with more unique values are treated as text rather than categorical.

Common Values:

20 - Small categorical threshold
50 - Standard categorical threshold (default)
100 - Large categorical threshold
200 - Very large categorical threshold

Example:

export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100

Categorical Note: Higher values allow more categories but may impact performance.

Usage Examples

Example 1: Basic Environment Configuration

# Set basic data profiling parameters
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
export DATA_PROFILER_CORRELATION_THRESHOLD=0.7
export DATA_PROFILER_MISSING_THRESHOLD=0.5
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=50

# Run your application
python app.py

Example 2: Comprehensive Analysis Configuration

# Optimized for comprehensive data analysis
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=2.5
export DATA_PROFILER_CORRELATION_THRESHOLD=0.8
export DATA_PROFILER_MISSING_THRESHOLD=0.3
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100

Example 3: Development Configuration

# Development-friendly settings
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
export DATA_PROFILER_CORRELATION_THRESHOLD=0.7
export DATA_PROFILER_MISSING_THRESHOLD=0.5
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=20

Example 4: Programmatic Configuration

from aiecs.tools.statistics.data_profiler_tool import DataProfilerTool

# Initialize with custom configuration
data_profiler = DataProfilerTool(config={
    'default_profile_level': 'standard',
    'outlier_std_threshold': 3.0,
    'correlation_threshold': 0.7,
    'missing_threshold': 0.5,
    'enable_visualizations': True,
    'max_unique_values_categorical': 50
})

Example 5: Mixed Configuration

Environment variables are used as defaults, but can be overridden programmatically:

# Set environment defaults
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true

# Override for specific instance
data_profiler = DataProfilerTool(config={
    'default_profile_level': 'comprehensive',  # This overrides the environment variable
    'enable_visualizations': False  # This overrides the environment variable
})

Configuration Priority

When the Data Profiler Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):

Programmatic config - Values passed to the constructor
Environment variables - Values set via DATA_PROFILER_* variables
Default values - Built-in defaults as specified above

Data Type Parsing

String Values

Strings should be provided as plain text without quotes:

export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive

Float Values

Floats should be provided as decimal numbers:

export DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
export DATA_PROFILER_CORRELATION_THRESHOLD=0.7
export DATA_PROFILER_MISSING_THRESHOLD=0.5

Integer Values

Integers should be provided as numeric strings:

export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=50
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100

Boolean Values

Booleans should be provided as lowercase strings:

export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false

Validation

Automatic Type Validation

Pydantic automatically validates configuration values:

default_profile_level must be a valid profile level string
outlier_std_threshold must be a positive float
correlation_threshold must be a float between 0 and 1
missing_threshold must be a float between 0 and 1
enable_visualizations must be a boolean
max_unique_values_categorical must be a positive integer

Runtime Validation

When profiling data, the tool validates:

Profile level - Level must be supported and appropriate for data size
Threshold values - Thresholds must be reasonable for the analysis
Data compatibility - Data must be compatible with profiling operations
Memory requirements - Profiling must not exceed memory limits
Processing time - Profiling must complete within reasonable time

Profile Levels

The Data Profiler Tool supports various profiling levels:

Basic Level

Basic statistical summaries (mean, median, std, etc.)
Simple quality checks (missing values, duplicates)
Fast processing for large datasets
Minimal resource usage

Standard Level

Standard statistical analysis
Quality assessment with thresholds
Pattern identification
Balanced performance and detail

Comprehensive Level

Detailed statistical analysis
Advanced quality checks
Pattern and anomaly detection
Correlation analysis
Preprocessing recommendations

Deep Level

Advanced statistical methods
Machine learning-based analysis
Complex pattern recognition
Detailed anomaly detection
Comprehensive preprocessing recommendations

Data Quality Checks

Missing Values

Count and percentage of missing values
Missing value patterns
Impact assessment
Imputation recommendations

Duplicates

Duplicate row detection
Duplicate column identification
Deduplication strategies
Impact analysis

Outliers

Statistical outlier detection
Domain-specific outlier identification
Outlier impact assessment
Treatment recommendations

Inconsistencies

Data type inconsistencies
Format inconsistencies
Value inconsistencies
Cross-field validation

Data Types

Automatic type inference
Type validation
Type conversion recommendations
Type optimization

Distributions

Distribution analysis
Normality testing
Skewness and kurtosis
Transformation recommendations

Correlations

Correlation matrix generation
Strong relationship identification
Multicollinearity detection
Feature selection recommendations

Operations Supported

The Data Profiler Tool supports comprehensive data profiling operations:

Basic Profiling

profile_data - Generate comprehensive data profile
profile_column - Profile individual columns
profile_dataset - Profile entire dataset
generate_summary - Generate statistical summary
detect_quality_issues - Detect data quality problems

Advanced Profiling

analyze_distributions - Analyze data distributions
detect_outliers - Detect statistical outliers
analyze_correlations - Analyze variable correlations
identify_patterns - Identify data patterns
assess_data_quality - Comprehensive quality assessment

Quality Operations

validate_data_types - Validate data type consistency
check_missing_values - Check missing value patterns
detect_duplicates - Detect duplicate records
analyze_inconsistencies - Analyze data inconsistencies
generate_quality_report - Generate quality assessment report

Visualization Operations

create_histograms - Create distribution histograms
create_correlation_matrix - Create correlation heatmap
create_box_plots - Create outlier box plots
create_missing_heatmap - Create missing value heatmap
create_summary_plots - Create summary visualizations

Recommendation Operations

recommend_preprocessing - Recommend preprocessing steps
suggest_transformations - Suggest data transformations
recommend_cleaning - Recommend data cleaning steps
suggest_feature_engineering - Suggest feature engineering
generate_action_plan - Generate data improvement plan

Troubleshooting

Issue: Profiling takes too long

Error: Profiling operation times out or is very slow

Solutions:

# Use basic profile level
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic

# Disable visualizations
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false

# Reduce categorical threshold
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=20

Issue: Memory usage exceeded

Error: Out of memory during profiling

Solutions:

# Use basic profile level
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic

# Disable visualizations
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false

# Process data in chunks
data_profiler.profile_data(data, chunk_size=10000)

Issue: Too many outliers detected

Error: Excessive outlier detection

Solutions:

# Increase outlier threshold
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.5

# Or use domain-specific outlier detection
data_profiler.detect_outliers(data, method='domain_specific')

Issue: Missing correlation detection

Error: No correlations detected

Solutions:

# Lower correlation threshold
export DATA_PROFILER_CORRELATION_THRESHOLD=0.5

# Check data types and distributions
data_profiler.analyze_distributions(data)

Issue: Categorical analysis issues

Error: Categorical analysis problems

Solutions:

# Increase categorical threshold
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100

# Or specify categorical columns explicitly
data_profiler.profile_data(data, categorical_columns=['col1', 'col2'])

Issue: Visualization generation fails

Error: Visualization creation errors

Solutions:

# Disable visualizations
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false

# Check visualization dependencies
pip install matplotlib seaborn plotly

Issue: Quality assessment too strict

Error: Too many quality issues detected

Solutions:

# Increase missing threshold
export DATA_PROFILER_MISSING_THRESHOLD=0.7

# Adjust quality criteria
data_profiler.assess_data_quality(data, strict_mode=False)

Best Practices

Performance Optimization

Profile Level Selection - Choose appropriate profile level for your needs
Visualization Control - Disable visualizations for large datasets
Categorical Threshold - Set appropriate categorical threshold
Chunk Processing - Process large datasets in chunks
Memory Management - Monitor memory usage during profiling

Error Handling

Graceful Degradation - Handle profiling failures gracefully
Validation - Validate data before profiling
Fallback Strategies - Provide fallback profiling methods
Error Logging - Log errors for debugging and monitoring
User Feedback - Provide clear error messages

Security

Data Privacy - Ensure data privacy during profiling
Access Control - Control access to profiling results
Audit Logging - Log profiling activities
Data Sanitization - Sanitize sensitive data
Compliance - Ensure compliance with data regulations

Resource Management

Memory Monitoring - Monitor memory usage during profiling
Processing Time - Set reasonable timeouts
Storage Optimization - Optimize result storage
Cleanup - Clean up temporary files
Resource Limits - Set appropriate resource limits

Integration

Tool Dependencies - Ensure required tools are available
API Compatibility - Maintain API compatibility
Error Propagation - Properly propagate errors
Logging Integration - Integrate with logging systems
Monitoring - Monitor tool performance and usage

Development vs Production

Development:

DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic
DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
DATA_PROFILER_CORRELATION_THRESHOLD=0.7
DATA_PROFILER_MISSING_THRESHOLD=0.5
DATA_PROFILER_ENABLE_VISUALIZATIONS=false
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=20

Production:

DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
DATA_PROFILER_OUTLIER_STD_THRESHOLD=2.5
DATA_PROFILER_CORRELATION_THRESHOLD=0.8
DATA_PROFILER_MISSING_THRESHOLD=0.3
DATA_PROFILER_ENABLE_VISUALIZATIONS=true
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100

Error Handling

Always wrap profiling operations in try-except blocks:

from aiecs.tools.statistics.data_profiler_tool import DataProfilerTool, DataProfilerError, ProfilingError

data_profiler = DataProfilerTool()

try:
    profile = data_profiler.profile_data(
        data=df,
        profile_level='standard',
        enable_visualizations=True
    )
except ProfilingError as e:
    print(f"Profiling error: {e}")
except DataProfilerError as e:
    print(f"Data profiler error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Dependencies

Core Dependencies

# Install core dependencies
pip install pydantic python-dotenv

# Install data processing dependencies
pip install pandas numpy scipy

# Install visualization dependencies
pip install matplotlib seaborn plotly

Optional Dependencies

# For advanced statistical analysis
pip install scikit-learn statsmodels

# For enhanced visualization
pip install bokeh altair

# For data quality assessment
pip install great-expectations

# For advanced profiling
pip install pandas-profiling ydata-profiling

Verification

# Test dependency availability
try:
    import pandas
    import numpy
    import scipy
    print("Core dependencies available")
except ImportError as e:
    print(f"Missing dependency: {e}")

# Test visualization availability
try:
    import matplotlib
    import seaborn
    print("Visualization available")
except ImportError:
    print("Visualization not available")

# Test advanced analysis availability
try:
    import sklearn
    import statsmodels
    print("Advanced analysis available")
except ImportError:
    print("Advanced analysis not available")

# Test profiling libraries availability
try:
    import ydata_profiling
    print("Advanced profiling available")
except ImportError:
    print("Advanced profiling not available")

Support

For issues or questions about Data Profiler Tool configuration:

Check the tool source code for implementation details
Review statistics tool documentation for statistical analysis
Consult the main aiecs documentation for architecture overview
Test with simple datasets first to isolate configuration vs. profiling issues
Verify data compatibility and format requirements
Check profile level and threshold settings
Ensure proper visualization dependencies
Validate data quality and structure requirements

Data Profiler Tool Configuration Guide

Overview

Using .env Files in Your Project

Setting Up .env Files

Multiple Environment Files

Best Practices for .env Files

Configuration Options

1. Default Profile Level

2. Outlier STD Threshold

3. Correlation Threshold

4. Missing Threshold

5. Enable Visualizations

6. Max Unique Values Categorical

Usage Examples

Example 1: Basic Environment Configuration

Example 2: Comprehensive Analysis Configuration

Example 3: Development Configuration

Example 4: Programmatic Configuration

Example 5: Mixed Configuration

Configuration Priority

Data Type Parsing

String Values

Float Values

Integer Values

Boolean Values

Validation

Automatic Type Validation

Runtime Validation

Profile Levels

Basic Level

Standard Level

Comprehensive Level

Deep Level

Data Quality Checks

Missing Values

Duplicates

Outliers

Inconsistencies

Data Types

Distributions

Correlations

Operations Supported

Basic Profiling

Advanced Profiling

Quality Operations

Visualization Operations

Recommendation Operations

Troubleshooting

Issue: Profiling takes too long

Issue: Memory usage exceeded

Issue: Too many outliers detected

Issue: Missing correlation detection

Issue: Categorical analysis issues

Issue: Visualization generation fails

Issue: Quality assessment too strict

Best Practices

Performance Optimization

Error Handling

Security

Resource Management

Integration

Development vs Production

Error Handling

Dependencies

Core Dependencies

Optional Dependencies

Verification

Related Documentation

Support