Data Profiler Tool Configuration Guide
Overview
The Data Profiler Tool is a comprehensive data profiling and quality assessment tool that provides advanced data profiling capabilities with statistical summaries and distributions, data quality issue detection, pattern and anomaly identification, preprocessing recommendations, and column-level and dataset-level analysis. It can generate statistical summaries, detect data quality issues, identify patterns and anomalies, and recommend preprocessing steps. The tool integrates with stats_tool and pandas_tool for core operations and supports various profiling levels (basic, standard, comprehensive, deep) and data quality checks (missing_values, duplicates, outliers, inconsistencies, data_types, distributions, correlations). The tool can be configured via environment variables using the DATA_PROFILER_ prefix or through programmatic configuration when initializing the tool.
Using .env Files in Your Project
When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Data Profiler Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.
Setting Up .env Files
1. Install python-dotenv:
pip install python-dotenv
2. Create a .env file in your project root:
# .env file in your project root
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
DATA_PROFILER_CORRELATION_THRESHOLD=0.7
DATA_PROFILER_MISSING_THRESHOLD=0.5
DATA_PROFILER_ENABLE_VISUALIZATIONS=true
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=50
3. Load the .env file in your application:
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv
# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()
# Now import and use aiecs tools
from aiecs.tools.statistics.data_profiler_tool import DataProfilerTool
# The tool will automatically use the environment variables
data_profiler = DataProfilerTool()
Multiple Environment Files
You can use different .env files for different environments:
import os
from dotenv import load_dotenv
# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')
if env == 'production':
load_dotenv('.env.production')
elif env == 'staging':
load_dotenv('.env.staging')
else:
load_dotenv('.env.development')
from aiecs.tools.statistics.data_profiler_tool import DataProfilerTool
data_profiler = DataProfilerTool()
Example .env.production:
# Production settings - optimized for comprehensive analysis
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
DATA_PROFILER_OUTLIER_STD_THRESHOLD=2.5
DATA_PROFILER_CORRELATION_THRESHOLD=0.8
DATA_PROFILER_MISSING_THRESHOLD=0.3
DATA_PROFILER_ENABLE_VISUALIZATIONS=true
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100
Example .env.development:
# Development settings - optimized for testing and debugging
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic
DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
DATA_PROFILER_CORRELATION_THRESHOLD=0.7
DATA_PROFILER_MISSING_THRESHOLD=0.5
DATA_PROFILER_ENABLE_VISUALIZATIONS=false
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=20
Best Practices for .env Files
Never commit .env files to version control - Add
.envto your.gitignore:# .gitignore .env .env.local .env.*.local .env.production .env.staging
Provide a template - Create
.env.examplewith documented dummy values:# .env.example # Data Profiler Tool Configuration # Default profiling depth level DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard # Standard deviation threshold for outlier detection DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0 # Correlation threshold for identifying strong relationships DATA_PROFILER_CORRELATION_THRESHOLD=0.7 # Missing value threshold for quality assessment DATA_PROFILER_MISSING_THRESHOLD=0.5 # Whether to enable visualization generation DATA_PROFILER_ENABLE_VISUALIZATIONS=true # Maximum unique values for categorical analysis DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=50
Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format values correctly:
Strings: Plain text:
standard,comprehensive,basicFloats: Decimal numbers:
3.0,0.7,0.5Integers: Plain numbers:
50,100Booleans:
trueorfalse
Configuration Options
1. Default Profile Level
Environment Variable: DATA_PROFILER_DEFAULT_PROFILE_LEVEL
Type: String
Default: "standard"
Description: Default profiling depth level when no specific level is specified. This determines the comprehensiveness of the data profiling analysis.
Supported Levels:
basic- Basic statistical summaries and simple quality checksstandard- Standard profiling with quality assessment (default)comprehensive- Comprehensive analysis with detailed patternsdeep- Deep analysis with advanced statistical methods
Example:
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
Level Note: Higher levels provide more detail but take longer to process.
2. Outlier STD Threshold
Environment Variable: DATA_PROFILER_OUTLIER_STD_THRESHOLD
Type: Float
Default: 3.0
Description: Standard deviation threshold for outlier detection. Values beyond this threshold are considered outliers using the Z-score method.
Common Values:
2.0- Strict outlier detection (more outliers detected)2.5- Moderate outlier detection3.0- Standard outlier detection (default)3.5- Lenient outlier detection (fewer outliers detected)
Example:
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=2.5
Threshold Note: Lower values detect more outliers, higher values are more lenient.
3. Correlation Threshold
Environment Variable: DATA_PROFILER_CORRELATION_THRESHOLD
Type: Float
Default: 0.7
Description: Correlation threshold for identifying strong relationships between variables. Correlations above this threshold are considered significant.
Common Values:
0.5- Moderate correlation threshold0.7- Strong correlation threshold (default)0.8- Very strong correlation threshold0.9- Extremely strong correlation threshold
Example:
export DATA_PROFILER_CORRELATION_THRESHOLD=0.8
Correlation Note: Higher thresholds identify only the strongest relationships.
4. Missing Threshold
Environment Variable: DATA_PROFILER_MISSING_THRESHOLD
Type: Float
Default: 0.5
Description: Missing value threshold for quality assessment. Columns with missing values above this threshold are flagged as having quality issues.
Common Values:
0.1- Strict missing value threshold (10% missing)0.3- Moderate missing value threshold (30% missing)0.5- Standard missing value threshold (50% missing, default)0.7- Lenient missing value threshold (70% missing)
Example:
export DATA_PROFILER_MISSING_THRESHOLD=0.3
Missing Note: Lower thresholds are more strict about missing values.
5. Enable Visualizations
Environment Variable: DATA_PROFILER_ENABLE_VISUALIZATIONS
Type: Boolean
Default: True
Description: Whether to enable visualization generation during profiling. Visualizations include histograms, correlation matrices, and distribution plots.
Values:
true- Enable visualizations (default)false- Disable visualizations
Example:
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
Visualization Note: Visualizations improve analysis but may slow down profiling.
6. Max Unique Values Categorical
Environment Variable: DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL
Type: Integer
Default: 50
Description: Maximum number of unique values for categorical analysis. Columns with more unique values are treated as text rather than categorical.
Common Values:
20- Small categorical threshold50- Standard categorical threshold (default)100- Large categorical threshold200- Very large categorical threshold
Example:
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100
Categorical Note: Higher values allow more categories but may impact performance.
Usage Examples
Example 1: Basic Environment Configuration
# Set basic data profiling parameters
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
export DATA_PROFILER_CORRELATION_THRESHOLD=0.7
export DATA_PROFILER_MISSING_THRESHOLD=0.5
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=50
# Run your application
python app.py
Example 2: Comprehensive Analysis Configuration
# Optimized for comprehensive data analysis
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=2.5
export DATA_PROFILER_CORRELATION_THRESHOLD=0.8
export DATA_PROFILER_MISSING_THRESHOLD=0.3
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100
Example 3: Development Configuration
# Development-friendly settings
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
export DATA_PROFILER_CORRELATION_THRESHOLD=0.7
export DATA_PROFILER_MISSING_THRESHOLD=0.5
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=20
Example 4: Programmatic Configuration
from aiecs.tools.statistics.data_profiler_tool import DataProfilerTool
# Initialize with custom configuration
data_profiler = DataProfilerTool(config={
'default_profile_level': 'standard',
'outlier_std_threshold': 3.0,
'correlation_threshold': 0.7,
'missing_threshold': 0.5,
'enable_visualizations': True,
'max_unique_values_categorical': 50
})
Example 5: Mixed Configuration
Environment variables are used as defaults, but can be overridden programmatically:
# Set environment defaults
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
# Override for specific instance
data_profiler = DataProfilerTool(config={
'default_profile_level': 'comprehensive', # This overrides the environment variable
'enable_visualizations': False # This overrides the environment variable
})
Configuration Priority
When the Data Profiler Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):
Programmatic config - Values passed to the constructor
Environment variables - Values set via
DATA_PROFILER_*variablesDefault values - Built-in defaults as specified above
Data Type Parsing
String Values
Strings should be provided as plain text without quotes:
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=standard
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
Float Values
Floats should be provided as decimal numbers:
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
export DATA_PROFILER_CORRELATION_THRESHOLD=0.7
export DATA_PROFILER_MISSING_THRESHOLD=0.5
Integer Values
Integers should be provided as numeric strings:
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=50
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100
Boolean Values
Booleans should be provided as lowercase strings:
export DATA_PROFILER_ENABLE_VISUALIZATIONS=true
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false
Validation
Automatic Type Validation
Pydantic automatically validates configuration values:
default_profile_levelmust be a valid profile level stringoutlier_std_thresholdmust be a positive floatcorrelation_thresholdmust be a float between 0 and 1missing_thresholdmust be a float between 0 and 1enable_visualizationsmust be a booleanmax_unique_values_categoricalmust be a positive integer
Runtime Validation
When profiling data, the tool validates:
Profile level - Level must be supported and appropriate for data size
Threshold values - Thresholds must be reasonable for the analysis
Data compatibility - Data must be compatible with profiling operations
Memory requirements - Profiling must not exceed memory limits
Processing time - Profiling must complete within reasonable time
Profile Levels
The Data Profiler Tool supports various profiling levels:
Basic Level
Basic statistical summaries (mean, median, std, etc.)
Simple quality checks (missing values, duplicates)
Fast processing for large datasets
Minimal resource usage
Standard Level
Standard statistical analysis
Quality assessment with thresholds
Pattern identification
Balanced performance and detail
Comprehensive Level
Detailed statistical analysis
Advanced quality checks
Pattern and anomaly detection
Correlation analysis
Preprocessing recommendations
Deep Level
Advanced statistical methods
Machine learning-based analysis
Complex pattern recognition
Detailed anomaly detection
Comprehensive preprocessing recommendations
Data Quality Checks
Missing Values
Count and percentage of missing values
Missing value patterns
Impact assessment
Imputation recommendations
Duplicates
Duplicate row detection
Duplicate column identification
Deduplication strategies
Impact analysis
Outliers
Statistical outlier detection
Domain-specific outlier identification
Outlier impact assessment
Treatment recommendations
Inconsistencies
Data type inconsistencies
Format inconsistencies
Value inconsistencies
Cross-field validation
Data Types
Automatic type inference
Type validation
Type conversion recommendations
Type optimization
Distributions
Distribution analysis
Normality testing
Skewness and kurtosis
Transformation recommendations
Correlations
Correlation matrix generation
Strong relationship identification
Multicollinearity detection
Feature selection recommendations
Operations Supported
The Data Profiler Tool supports comprehensive data profiling operations:
Basic Profiling
profile_data- Generate comprehensive data profileprofile_column- Profile individual columnsprofile_dataset- Profile entire datasetgenerate_summary- Generate statistical summarydetect_quality_issues- Detect data quality problems
Advanced Profiling
analyze_distributions- Analyze data distributionsdetect_outliers- Detect statistical outliersanalyze_correlations- Analyze variable correlationsidentify_patterns- Identify data patternsassess_data_quality- Comprehensive quality assessment
Quality Operations
validate_data_types- Validate data type consistencycheck_missing_values- Check missing value patternsdetect_duplicates- Detect duplicate recordsanalyze_inconsistencies- Analyze data inconsistenciesgenerate_quality_report- Generate quality assessment report
Visualization Operations
create_histograms- Create distribution histogramscreate_correlation_matrix- Create correlation heatmapcreate_box_plots- Create outlier box plotscreate_missing_heatmap- Create missing value heatmapcreate_summary_plots- Create summary visualizations
Recommendation Operations
recommend_preprocessing- Recommend preprocessing stepssuggest_transformations- Suggest data transformationsrecommend_cleaning- Recommend data cleaning stepssuggest_feature_engineering- Suggest feature engineeringgenerate_action_plan- Generate data improvement plan
Troubleshooting
Issue: Profiling takes too long
Error: Profiling operation times out or is very slow
Solutions:
# Use basic profile level
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic
# Disable visualizations
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false
# Reduce categorical threshold
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=20
Issue: Memory usage exceeded
Error: Out of memory during profiling
Solutions:
# Use basic profile level
export DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic
# Disable visualizations
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false
# Process data in chunks
data_profiler.profile_data(data, chunk_size=10000)
Issue: Too many outliers detected
Error: Excessive outlier detection
Solutions:
# Increase outlier threshold
export DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.5
# Or use domain-specific outlier detection
data_profiler.detect_outliers(data, method='domain_specific')
Issue: Missing correlation detection
Error: No correlations detected
Solutions:
# Lower correlation threshold
export DATA_PROFILER_CORRELATION_THRESHOLD=0.5
# Check data types and distributions
data_profiler.analyze_distributions(data)
Issue: Categorical analysis issues
Error: Categorical analysis problems
Solutions:
# Increase categorical threshold
export DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100
# Or specify categorical columns explicitly
data_profiler.profile_data(data, categorical_columns=['col1', 'col2'])
Issue: Visualization generation fails
Error: Visualization creation errors
Solutions:
# Disable visualizations
export DATA_PROFILER_ENABLE_VISUALIZATIONS=false
# Check visualization dependencies
pip install matplotlib seaborn plotly
Issue: Quality assessment too strict
Error: Too many quality issues detected
Solutions:
# Increase missing threshold
export DATA_PROFILER_MISSING_THRESHOLD=0.7
# Adjust quality criteria
data_profiler.assess_data_quality(data, strict_mode=False)
Best Practices
Performance Optimization
Profile Level Selection - Choose appropriate profile level for your needs
Visualization Control - Disable visualizations for large datasets
Categorical Threshold - Set appropriate categorical threshold
Chunk Processing - Process large datasets in chunks
Memory Management - Monitor memory usage during profiling
Error Handling
Graceful Degradation - Handle profiling failures gracefully
Validation - Validate data before profiling
Fallback Strategies - Provide fallback profiling methods
Error Logging - Log errors for debugging and monitoring
User Feedback - Provide clear error messages
Security
Data Privacy - Ensure data privacy during profiling
Access Control - Control access to profiling results
Audit Logging - Log profiling activities
Data Sanitization - Sanitize sensitive data
Compliance - Ensure compliance with data regulations
Resource Management
Memory Monitoring - Monitor memory usage during profiling
Processing Time - Set reasonable timeouts
Storage Optimization - Optimize result storage
Cleanup - Clean up temporary files
Resource Limits - Set appropriate resource limits
Integration
Tool Dependencies - Ensure required tools are available
API Compatibility - Maintain API compatibility
Error Propagation - Properly propagate errors
Logging Integration - Integrate with logging systems
Monitoring - Monitor tool performance and usage
Development vs Production
Development:
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=basic
DATA_PROFILER_OUTLIER_STD_THRESHOLD=3.0
DATA_PROFILER_CORRELATION_THRESHOLD=0.7
DATA_PROFILER_MISSING_THRESHOLD=0.5
DATA_PROFILER_ENABLE_VISUALIZATIONS=false
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=20
Production:
DATA_PROFILER_DEFAULT_PROFILE_LEVEL=comprehensive
DATA_PROFILER_OUTLIER_STD_THRESHOLD=2.5
DATA_PROFILER_CORRELATION_THRESHOLD=0.8
DATA_PROFILER_MISSING_THRESHOLD=0.3
DATA_PROFILER_ENABLE_VISUALIZATIONS=true
DATA_PROFILER_MAX_UNIQUE_VALUES_CATEGORICAL=100
Error Handling
Always wrap profiling operations in try-except blocks:
from aiecs.tools.statistics.data_profiler_tool import DataProfilerTool, DataProfilerError, ProfilingError
data_profiler = DataProfilerTool()
try:
profile = data_profiler.profile_data(
data=df,
profile_level='standard',
enable_visualizations=True
)
except ProfilingError as e:
print(f"Profiling error: {e}")
except DataProfilerError as e:
print(f"Data profiler error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Dependencies
Core Dependencies
# Install core dependencies
pip install pydantic python-dotenv
# Install data processing dependencies
pip install pandas numpy scipy
# Install visualization dependencies
pip install matplotlib seaborn plotly
Optional Dependencies
# For advanced statistical analysis
pip install scikit-learn statsmodels
# For enhanced visualization
pip install bokeh altair
# For data quality assessment
pip install great-expectations
# For advanced profiling
pip install pandas-profiling ydata-profiling
Verification
# Test dependency availability
try:
import pandas
import numpy
import scipy
print("Core dependencies available")
except ImportError as e:
print(f"Missing dependency: {e}")
# Test visualization availability
try:
import matplotlib
import seaborn
print("Visualization available")
except ImportError:
print("Visualization not available")
# Test advanced analysis availability
try:
import sklearn
import statsmodels
print("Advanced analysis available")
except ImportError:
print("Advanced analysis not available")
# Test profiling libraries availability
try:
import ydata_profiling
print("Advanced profiling available")
except ImportError:
print("Advanced profiling not available")
Support
For issues or questions about Data Profiler Tool configuration:
Check the tool source code for implementation details
Review statistics tool documentation for statistical analysis
Consult the main aiecs documentation for architecture overview
Test with simple datasets first to isolate configuration vs. profiling issues
Verify data compatibility and format requirements
Check profile level and threshold settings
Ensure proper visualization dependencies
Validate data quality and structure requirements