Scraper Tool Configuration Guide
Overview
The Scraper Tool provides comprehensive web scraping capabilities with multiple HTTP clients, JavaScript rendering, HTML parsing, and security features. It supports httpx, urllib, Playwright for JavaScript rendering, BeautifulSoup for HTML parsing, and Scrapy integration for advanced crawling. The tool can be configured via environment variables using the SCRAPER_TOOL_ prefix or through programmatic configuration when initializing the tool.
Using .env Files in Your Project
When using aiecs as a dependency in your project, you can store configuration in a .env file for convenience. The Scraper Tool reads from environment variables that are already loaded into the process, so you need to load the .env file in your application before importing aiecs tools.
Setting Up .env Files
1. Install python-dotenv:
pip install python-dotenv
2. Create a .env file in your project root:
# .env file in your project root
SCRAPER_TOOL_USER_AGENT=MyScraperBot/1.0
SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760
SCRAPER_TOOL_OUTPUT_DIR=/path/to/outputs
SCRAPER_TOOL_SCRAPY_COMMAND=scrapy
SCRAPER_TOOL_ALLOWED_DOMAINS=["example.com","api.example.com"]
SCRAPER_TOOL_BLOCKED_DOMAINS=["blocked.com","malicious.com"]
SCRAPER_TOOL_USE_STEALTH=false
3. Load the .env file in your application:
# main.py or app.py - at the top of your entry point
from dotenv import load_dotenv
# Load environment variables from .env file
# This must be done BEFORE importing aiecs tools
load_dotenv()
# Now import and use aiecs tools
from aiecs.tools.scraper_tool import ScraperTool
# The tool will automatically use the environment variables
scraper_tool = ScraperTool()
Multiple Environment Files
You can use different .env files for different environments:
import os
from dotenv import load_dotenv
# Load environment-specific configuration
env = os.getenv('APP_ENV', 'development')
if env == 'production':
load_dotenv('.env.production')
elif env == 'staging':
load_dotenv('.env.staging')
else:
load_dotenv('.env.development')
from aiecs.tools.scraper_tool import ScraperTool
scraper_tool = ScraperTool()
Example .env.production:
# Production settings - optimized for security and performance
SCRAPER_TOOL_USER_AGENT=ProductionScraper/2.0
SCRAPER_TOOL_MAX_CONTENT_LENGTH=52428800
SCRAPER_TOOL_OUTPUT_DIR=/app/scraper_outputs
SCRAPER_TOOL_ALLOWED_DOMAINS=["trusted-site.com","api.trusted-site.com"]
SCRAPER_TOOL_BLOCKED_DOMAINS=["malicious.com","spam.com"]
SCRAPER_TOOL_USE_STEALTH=true
Example .env.development:
# Development settings - more permissive for testing
SCRAPER_TOOL_USER_AGENT=DevScraper/1.0
SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760
SCRAPER_TOOL_OUTPUT_DIR=./scraper_outputs
SCRAPER_TOOL_ALLOWED_DOMAINS=[]
SCRAPER_TOOL_BLOCKED_DOMAINS=[]
SCRAPER_TOOL_USE_STEALTH=false
Best Practices for .env Files
Never commit .env files to version control - Add
.envto your.gitignore:# .gitignore .env .env.local .env.*.local .env.production .env.staging
Provide a template - Create
.env.examplewith documented dummy values:# .env.example # Scraper Tool Configuration # User agent for HTTP requests SCRAPER_TOOL_USER_AGENT=MyScraperBot/1.0 # Maximum content length in bytes (10MB) SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760 # Directory for output files SCRAPER_TOOL_OUTPUT_DIR=./scraper_outputs # Command to run Scrapy SCRAPER_TOOL_SCRAPY_COMMAND=scrapy # Allowed domains for scraping (JSON array) SCRAPER_TOOL_ALLOWED_DOMAINS=["example.com","api.example.com"] # Blocked domains for scraping (JSON array) SCRAPER_TOOL_BLOCKED_DOMAINS=["blocked.com","malicious.com"] # Enable stealth mode for Playwright (requires playwright-stealth) SCRAPER_TOOL_USE_STEALTH=false
Document your variables - Add comments explaining each setting
Use load_dotenv() early - Call it at the very top of your entry point, before any aiecs imports
Format complex types correctly:
Strings: Plain text:
MyScraperBot/1.0,scrapyIntegers: Plain numbers:
10485760,52428800Lists: JSON array format:
["example.com","api.example.com"]
Configuration Options
1. User Agent
Environment Variable: SCRAPER_TOOL_USER_AGENT
Type: String
Default: "PythonMiddlewareScraper/2.0"
Description: User agent string sent with HTTP requests. This identifies your scraper to web servers and should be descriptive and respectful.
Best Practices:
Use a descriptive name:
MyCompanyBot/1.0Include contact information:
MyBot/1.0 (contact@example.com)Follow robots.txt guidelines
Be honest about your bot’s purpose
Example:
export SCRAPER_TOOL_USER_AGENT="MyResearchBot/1.0 (research@university.edu)"
Legal Note: Always respect robots.txt and website terms of service.
2. Max Content Length
Environment Variable: SCRAPER_TOOL_MAX_CONTENT_LENGTH
Type: Integer
Default: 10 * 1024 * 1024 (10MB)
Description: Maximum content length in bytes for HTTP responses. This prevents memory issues with extremely large files and ensures reasonable processing times.
Common Values:
5 * 1024 * 1024- 5MB (small files)10 * 1024 * 1024- 10MB (default)50 * 1024 * 1024- 50MB (large files)100 * 1024 * 1024- 100MB (very large files)
Example:
export SCRAPER_TOOL_MAX_CONTENT_LENGTH=52428800
Memory Note: Larger values use more memory but allow processing of bigger files. Adjust based on available system resources.
3. Output Directory
Environment Variable: SCRAPER_TOOL_OUTPUT_DIR
Type: String
Default: os.path.join(tempfile.gettempdir(), 'scraper_outputs')
Description: Directory where scraped content and output files are saved. The directory will be created automatically if it doesn’t exist.
Example:
export SCRAPER_TOOL_OUTPUT_DIR="/app/scraper_outputs"
Security Note: Ensure the directory has appropriate permissions and is not accessible via web servers.
4. Scrapy Command
Environment Variable: SCRAPER_TOOL_SCRAPY_COMMAND
Type: String
Default: "scrapy"
Description: Command to run Scrapy spiders. This can be customized for different Scrapy installations or virtual environments.
Common Values:
scrapy- Standard Scrapy commandpython -m scrapy- Python module execution/path/to/venv/bin/scrapy- Virtual environment Scrapydocker exec container scrapy- Docker container execution
Example:
export SCRAPER_TOOL_SCRAPY_COMMAND="python -m scrapy"
Note: Ensure Scrapy is installed and accessible via the specified command.
5. Allowed Domains
Environment Variable: SCRAPER_TOOL_ALLOWED_DOMAINS
Type: List[str]
Default: [] (empty list - no restrictions)
Description: List of allowed domains for scraping. This is a security feature that restricts scraping to specific domains. Empty list means no restrictions.
Format: JSON array string with double quotes
Security Configurations:
Restrictive:
["trusted-site.com","api.trusted-site.com"]Permissive:
[](no restrictions)API only:
["api.example.com"]
Example:
# Allow only specific domains
export SCRAPER_TOOL_ALLOWED_DOMAINS='["example.com","api.example.com"]'
# No restrictions (development only)
export SCRAPER_TOOL_ALLOWED_DOMAINS='[]'
Security Note: Use restrictive domain lists in production to prevent unauthorized scraping.
6. Blocked Domains
Environment Variable: SCRAPER_TOOL_BLOCKED_DOMAINS
Type: List[str]
Default: [] (empty list - no blocks)
Description: List of blocked domains for scraping. This prevents scraping of known malicious or problematic domains.
Format: JSON array string with double quotes
Common Blocked Domains:
Malicious sites
Sites with aggressive anti-bot measures
Sites that violate terms of service
Sites with known security issues
Example:
# Block known problematic domains
export SCRAPER_TOOL_BLOCKED_DOMAINS='["malicious.com","spam.com","blocked-site.com"]'
Security Note: Regularly update blocked domains list based on security advisories.
7. Use Stealth Mode
Environment Variable: SCRAPER_TOOL_USE_STEALTH
Type: Boolean
Default: False
Description: Whether to use stealth mode with Playwright to avoid bot detection. When enabled, the tool applies various techniques to make the browser appear more like a regular user browser, helping to bypass anti-bot measures.
Stealth Features:
Removes webdriver property
Masks automation indicators
Randomizes browser fingerprints
Mimics human-like behavior
Bypasses common bot detection methods
Requirements:
# Install playwright-stealth
pip install playwright-stealth
# Or install with scraper extras
pip install aiecs[scraper]
Example:
# Enable stealth mode globally
export SCRAPER_TOOL_USE_STEALTH=true
# Or in .env file
SCRAPER_TOOL_USE_STEALTH=true
Use Cases:
Scraping sites with anti-bot protection
Accessing content that blocks automated browsers
Bypassing Cloudflare and similar protections
Testing website behavior with realistic browser profiles
Note: Stealth mode only works with Playwright rendering. It has no effect on regular HTTP requests. If playwright-stealth is not installed, the tool will log a warning and continue without stealth mode.
8. Playwright Available (Read-Only)
Environment Variable: Not configurable via environment
Type: Boolean
Default: False (auto-detected)
Description: Whether Playwright is available for JavaScript rendering. This is automatically detected during initialization and cannot be set via environment variables.
Auto-Detection: The tool automatically checks if Playwright is installed and sets this field accordingly.
Installation:
pip install playwright
playwright install
Usage Examples
Example 1: Basic Environment Configuration
# Set custom scraping parameters
export SCRAPER_TOOL_USER_AGENT="MyBot/1.0"
export SCRAPER_TOOL_MAX_CONTENT_LENGTH=52428800
export SCRAPER_TOOL_OUTPUT_DIR="/app/scraper_outputs"
# Run your application
python app.py
Example 2: Security-Focused Configuration
# Strict security settings
export SCRAPER_TOOL_USER_AGENT="SecureBot/1.0 (contact@company.com)"
export SCRAPER_TOOL_ALLOWED_DOMAINS='["trusted-site.com","api.trusted-site.com"]'
export SCRAPER_TOOL_BLOCKED_DOMAINS='["malicious.com","spam.com"]'
export SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760
Example 3: Development Configuration
# Development-friendly settings
export SCRAPER_TOOL_USER_AGENT="DevBot/1.0"
export SCRAPER_TOOL_OUTPUT_DIR="./scraper_outputs"
export SCRAPER_TOOL_ALLOWED_DOMAINS='[]'
export SCRAPER_TOOL_BLOCKED_DOMAINS='[]'
Example 4: Programmatic Configuration
from aiecs.tools.scraper_tool import ScraperTool
# Initialize with custom configuration
scraper_tool = ScraperTool(config={
'timeout': 30,
'max_retries': 3,
'impersonate': 'chrome120',
'proxy': None,
'requests_per_minute': 30,
'enable_cache': True,
'enable_js_render': False,
'use_stealth': True # Enable stealth mode
})
Example 5: Stealth Mode Configuration
Using stealth mode to bypass bot detection:
from aiecs.tools.scraper_tool import ScraperTool
# Method 1: Enable stealth mode via configuration
scraper_with_stealth = ScraperTool(config={
'use_stealth': True,
'enable_js_render': True # Required for rendering
})
# Fetch a page with stealth mode enabled
result = await scraper_with_stealth.fetch(url="https://example.com")
# Method 2: Override stealth mode per request
scraper_default = ScraperTool()
# Enable stealth for this specific request
result = await scraper_default.render(
url="https://example.com",
wait_time=5,
use_stealth=True # Override config setting
)
# Disable stealth for this specific request
result = await scraper_default.render(
url="https://example.com",
wait_time=5,
use_stealth=False # Override config setting
)
Environment Variable:
# Enable stealth mode globally
export SCRAPER_TOOL_USE_STEALTH=true
Example 6: Mixed Configuration
Environment variables are used as defaults, but can be overridden programmatically:
# Set environment defaults
export SCRAPER_TOOL_USER_AGENT="DefaultBot/1.0"
export SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760
export SCRAPER_TOOL_USE_STEALTH=true
# Override for specific instance
scraper_tool = ScraperTool(config={
'user_agent': 'CustomBot/2.0', # This overrides the environment variable
'max_content_length': 52428800, # This overrides the environment variable
'use_stealth': False # This overrides the environment variable
})
Configuration Priority
When the Scraper Tool is initialized, configuration values are resolved in the following order (highest to lowest priority):
Programmatic config - Values passed to the constructor
Environment variables - Values set via
SCRAPER_TOOL_*variablesDefault values - Built-in defaults as specified above
Data Type Parsing
String Values
Strings should be provided as plain text without quotes:
export SCRAPER_TOOL_USER_AGENT=MyBot/1.0
export SCRAPER_TOOL_SCRAPY_COMMAND=scrapy
Integer Values
Integers should be provided as numeric strings:
export SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760
List Values
Lists must be provided as JSON arrays with double quotes:
# Correct
export SCRAPER_TOOL_ALLOWED_DOMAINS='["example.com","api.example.com"]'
# Incorrect (will not parse)
export SCRAPER_TOOL_ALLOWED_DOMAINS="example.com,api.example.com"
Important: Use single quotes for the shell, double quotes for JSON:
export SCRAPER_TOOL_ALLOWED_DOMAINS='["example.com","api.example.com"]'
# ^ ^
# Single quotes for shell
# ^ ^
# Double quotes for JSON
Validation
Automatic Type Validation
Pydantic automatically validates configuration values:
user_agentmust be a non-empty stringmax_content_lengthmust be a positive integeroutput_dirmust be a non-empty stringscrapy_commandmust be a non-empty stringallowed_domainsmust be a list of stringsblocked_domainsmust be a list of stringsplaywright_availablemust be a boolean
Runtime Validation
When scraping, the tool validates:
Domain restrictions - URLs must be in allowed domains (if specified)
Domain blocks - URLs must not be in blocked domains
Content length - Response content must not exceed max_content_length
Output directory - Output directory must be writable
External tools - Scrapy and Playwright availability is checked
Operations Supported
The Scraper Tool supports comprehensive web scraping operations:
HTTP Clients
Httpx Client
get_httpx- Modern async HTTP client with full feature supportSupports all HTTP methods (GET, POST, PUT, DELETE, etc.)
Built-in SSL verification and redirect handling
Cookie and authentication support
Urllib Client
get_urllib- Standard library HTTP clientLightweight alternative to httpx
Good for simple requests without advanced features
Legacy Methods
get_requests- Legacy method (now uses httpx in sync mode)get_aiohttp- Legacy method (now uses httpx in async mode)
JavaScript Rendering
Playwright Rendering
render- Render JavaScript-heavy pagesSupports waiting for specific elements
Screenshot capture capabilities
Scroll and interaction support
HTML Parsing
BeautifulSoup Parsing
parse_html- Parse HTML content with CSS selectorsXPath support via lxml
Attribute and text extraction
Flexible selector types
Scrapy Integration
Spider Execution
crawl_scrapy- Execute Scrapy spidersCustom spider arguments support
Output file generation
Execution monitoring
Output Formats
Multiple Formats
Text - Plain text output
JSON - Structured JSON data
HTML - Raw HTML content
Markdown - Formatted markdown
CSV - Tabular data export
Troubleshooting
Issue: SSL certificate errors
Error: SSL: CERTIFICATE_VERIFY_FAILED
Solutions:
Update certificates:
pip install --upgrade certifiDisable SSL verification (not recommended): Set
verify_ssl=FalseUse custom CA bundle: Set
verify_ssl="/path/to/ca-bundle.pem"
Issue: Playwright not available
Error: Playwright is not available
Solutions:
# Install Playwright
pip install playwright
# Install browser binaries
playwright install
# Verify installation
python -c "import playwright; print('Playwright installed')"
Issue: Scrapy command not found
Error: Scrapy crawl failed: command not found
Solutions:
# Install Scrapy
pip install scrapy
# Check command
export SCRAPER_TOOL_SCRAPY_COMMAND="python -m scrapy"
# Or use full path
export SCRAPER_TOOL_SCRAPY_COMMAND="/path/to/venv/bin/scrapy"
Issue: Content too large
Error: Response content too large
Solutions:
# Increase content length limit
export SCRAPER_TOOL_MAX_CONTENT_LENGTH=52428800
# Or process content in chunks
# Use streaming requests for large files
Issue: Domain not allowed
Error: Domain not in allowed list
Solutions:
# Add domain to allowed list
export SCRAPER_TOOL_ALLOWED_DOMAINS='["example.com","new-domain.com"]'
# Or remove restrictions (development only)
export SCRAPER_TOOL_ALLOWED_DOMAINS='[]'
Issue: Rate limiting
Error: Rate limit exceeded or 429 Too Many Requests
Solutions:
Implement delays between requests
Use rotating user agents
Respect robots.txt
Use proxy rotation
Implement exponential backoff
Issue: Timeout errors
Error: Request timeout or Connection timeout
Solutions:
Increase timeout values
Check network connectivity
Use retry mechanisms
Implement circuit breakers
Issue: List parsing error
Error: Configuration parsing fails for domain lists
Solution:
# Use proper JSON array syntax
export SCRAPER_TOOL_ALLOWED_DOMAINS='["example.com","api.example.com"]'
# NOT: [example.com,api.example.com] or example.com,api.example.com
Issue: Output directory not writable
Error: Permission denied when saving files
Solutions:
# Set writable output directory
export SCRAPER_TOOL_OUTPUT_DIR="/writable/path"
# Or create directory with proper permissions
mkdir -p /path/to/outputs
chmod 755 /path/to/outputs
Issue: Stealth mode not working
Error: playwright-stealth is not installed warning in logs
Solutions:
# Install playwright-stealth
pip install playwright-stealth
# Or install with scraper extras
pip install aiecs[scraper]
# Verify installation
python -c "from playwright_stealth import stealth_async; print('OK')"
Issue: Bot detection still occurring with stealth mode
Symptoms: Website still detects automation despite stealth mode enabled
Solutions:
Verify stealth mode is enabled:
# Check logs for "Stealth mode enabled for Playwright" message scraper = ScraperTool(config={'use_stealth': True}) result = await scraper.render(url, use_stealth=True)
Add additional delays:
# Wait longer for page to load result = await scraper.render( url=url, wait_time=10, # Increase wait time use_stealth=True )
Use realistic user agent:
export SCRAPER_TOOL_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
Implement rate limiting:
Add delays between requests
Randomize request timing
Respect robots.txt
Note: Some advanced bot detection systems may still detect automation. Stealth mode improves success rate but is not foolproof.
Best Practices
Web Scraping Ethics
Respect robots.txt - Always check and follow robots.txt files
Rate limiting - Implement delays between requests
User agent identification - Use descriptive, honest user agents
Terms of service - Read and follow website terms of service
Legal compliance - Ensure compliance with local laws and regulations
Security
Domain filtering - Use allowed/blocked domain lists
Content validation - Validate scraped content for malicious code
SSL verification - Always verify SSL certificates in production
Input sanitization - Sanitize URLs and parameters
Output security - Secure output directories and files
Performance
Connection pooling - Reuse HTTP connections when possible
Async operations - Use async methods for better concurrency
Memory management - Monitor memory usage with large content
Caching - Implement caching for frequently accessed content
Resource limits - Set appropriate content length limits
Error Handling
Retry mechanisms - Implement exponential backoff for failed requests
Circuit breakers - Stop requests to failing services
Graceful degradation - Handle partial failures gracefully
Logging - Log errors and performance metrics
Monitoring - Monitor scraping success rates and performance
Development vs Production
Development:
SCRAPER_TOOL_USER_AGENT=DevBot/1.0
SCRAPER_TOOL_OUTPUT_DIR=./scraper_outputs
SCRAPER_TOOL_ALLOWED_DOMAINS='[]'
SCRAPER_TOOL_BLOCKED_DOMAINS='[]'
SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760
Production:
SCRAPER_TOOL_USER_AGENT=ProductionBot/2.0 (contact@company.com)
SCRAPER_TOOL_OUTPUT_DIR=/app/scraper_outputs
SCRAPER_TOOL_ALLOWED_DOMAINS='["trusted-site.com","api.trusted-site.com"]'
SCRAPER_TOOL_BLOCKED_DOMAINS='["malicious.com","spam.com"]'
SCRAPER_TOOL_MAX_CONTENT_LENGTH=52428800
Error Handling
Always wrap scraping operations in try-except blocks:
from aiecs.tools.scraper_tool import ScraperTool, HttpError, RateLimitError
scraper_tool = ScraperTool()
try:
result = await scraper_tool.get_httpx(url)
except HttpError as e:
print(f"HTTP error: {e}")
except TimeoutError as e:
print(f"Timeout error: {e}")
except RateLimitError as e:
print(f"Rate limit error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Installation Requirements
Core Dependencies
# Install core scraping dependencies
pip install httpx beautifulsoup4 lxml
# Install optional dependencies
pip install playwright scrapy
# Install stealth mode support
pip install playwright-stealth
# Or install all scraper extras at once
pip install aiecs[scraper]
Playwright Setup
# Install Playwright
pip install playwright
# Install browser binaries
playwright install
# Install specific browsers
playwright install chromium
playwright install firefox
playwright install webkit
Stealth Mode Setup
# Install playwright-stealth for anti-bot detection
pip install playwright-stealth
# Verify installation
python -c "from playwright_stealth import stealth_async; print('Stealth mode available')"
Scrapy Setup
# Install Scrapy
pip install scrapy
# Create a Scrapy project
scrapy startproject myproject
# Create a spider
cd myproject
scrapy genspider myspider example.com
Verification
# Test Playwright installation
try:
import playwright
print("Playwright installed successfully")
except ImportError:
print("Playwright not installed")
# Test Scrapy installation
try:
import scrapy
print("Scrapy installed successfully")
except ImportError:
print("Scrapy not installed")
Support
For issues or questions about Scraper Tool configuration:
Check the tool source code for implementation details
Review HTTP client documentation for specific features
Consult the main aiecs documentation for architecture overview
Test with simple URLs first to isolate configuration vs. scraping issues
Monitor network traffic and response times
Validate SSL certificates and domain restrictions
Check robots.txt and terms of service compliance