# Scraper Tool Configuration Guide ## Overview The Scraper Tool provides comprehensive web scraping capabilities with multiple HTTP clients, JavaScript rendering, HTML parsing, and security features. It supports httpx, urllib, Playwright for JavaScript rendering, BeautifulSoup for HTML parsing, and Scrapy integration for advanced crawling. The tool can be configured via environment variables using the `SCRAPER_TOOL_` prefix or through programmatic configuration when initializing the tool. ## Using .env Files in Your Project When using aiecs as a dependency in your project, you can store configuration in a `.env` file for convenience. The Scraper Tool reads from environment variables that are already loaded into the process, so you need to load the `.env` file in your application before importing aiecs tools. ### Setting Up .env Files **1. Install python-dotenv:** ```bash pip install python-dotenv ``` **2. Create a `.env` file in your project root:** ```bash # .env file in your project root SCRAPER_TOOL_USER_AGENT=MyScraperBot/1.0 SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760 SCRAPER_TOOL_OUTPUT_DIR=/path/to/outputs SCRAPER_TOOL_SCRAPY_COMMAND=scrapy SCRAPER_TOOL_ALLOWED_DOMAINS=["example.com","api.example.com"] SCRAPER_TOOL_BLOCKED_DOMAINS=["blocked.com","malicious.com"] SCRAPER_TOOL_USE_STEALTH=false ``` **3. Load the .env file in your application:** ```python # main.py or app.py - at the top of your entry point from dotenv import load_dotenv # Load environment variables from .env file # This must be done BEFORE importing aiecs tools load_dotenv() # Now import and use aiecs tools from aiecs.tools.scraper_tool import ScraperTool # The tool will automatically use the environment variables scraper_tool = ScraperTool() ``` ### Multiple Environment Files You can use different `.env` files for different environments: ```python import os from dotenv import load_dotenv # Load environment-specific configuration env = os.getenv('APP_ENV', 'development') if env == 'production': load_dotenv('.env.production') elif env == 'staging': load_dotenv('.env.staging') else: load_dotenv('.env.development') from aiecs.tools.scraper_tool import ScraperTool scraper_tool = ScraperTool() ``` **Example `.env.production`:** ```bash # Production settings - optimized for security and performance SCRAPER_TOOL_USER_AGENT=ProductionScraper/2.0 SCRAPER_TOOL_MAX_CONTENT_LENGTH=52428800 SCRAPER_TOOL_OUTPUT_DIR=/app/scraper_outputs SCRAPER_TOOL_ALLOWED_DOMAINS=["trusted-site.com","api.trusted-site.com"] SCRAPER_TOOL_BLOCKED_DOMAINS=["malicious.com","spam.com"] SCRAPER_TOOL_USE_STEALTH=true ``` **Example `.env.development`:** ```bash # Development settings - more permissive for testing SCRAPER_TOOL_USER_AGENT=DevScraper/1.0 SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760 SCRAPER_TOOL_OUTPUT_DIR=./scraper_outputs SCRAPER_TOOL_ALLOWED_DOMAINS=[] SCRAPER_TOOL_BLOCKED_DOMAINS=[] SCRAPER_TOOL_USE_STEALTH=false ``` ### Best Practices for .env Files 1. **Never commit .env files to version control** - Add `.env` to your `.gitignore`: ```gitignore # .gitignore .env .env.local .env.*.local .env.production .env.staging ``` 2. **Provide a template** - Create `.env.example` with documented dummy values: ```bash # .env.example # Scraper Tool Configuration # User agent for HTTP requests SCRAPER_TOOL_USER_AGENT=MyScraperBot/1.0 # Maximum content length in bytes (10MB) SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760 # Directory for output files SCRAPER_TOOL_OUTPUT_DIR=./scraper_outputs # Command to run Scrapy SCRAPER_TOOL_SCRAPY_COMMAND=scrapy # Allowed domains for scraping (JSON array) SCRAPER_TOOL_ALLOWED_DOMAINS=["example.com","api.example.com"] # Blocked domains for scraping (JSON array) SCRAPER_TOOL_BLOCKED_DOMAINS=["blocked.com","malicious.com"] # Enable stealth mode for Playwright (requires playwright-stealth) SCRAPER_TOOL_USE_STEALTH=false ``` 3. **Document your variables** - Add comments explaining each setting 4. **Use load_dotenv() early** - Call it at the very top of your entry point, before any aiecs imports 5. **Format complex types correctly**: - Strings: Plain text: `MyScraperBot/1.0`, `scrapy` - Integers: Plain numbers: `10485760`, `52428800` - Lists: JSON array format: `["example.com","api.example.com"]` ## Configuration Options ### 1. User Agent **Environment Variable:** `SCRAPER_TOOL_USER_AGENT` **Type:** String **Default:** `"PythonMiddlewareScraper/2.0"` **Description:** User agent string sent with HTTP requests. This identifies your scraper to web servers and should be descriptive and respectful. **Best Practices:** - Use a descriptive name: `MyCompanyBot/1.0` - Include contact information: `MyBot/1.0 (contact@example.com)` - Follow robots.txt guidelines - Be honest about your bot's purpose **Example:** ```bash export SCRAPER_TOOL_USER_AGENT="MyResearchBot/1.0 (research@university.edu)" ``` **Legal Note:** Always respect robots.txt and website terms of service. ### 2. Max Content Length **Environment Variable:** `SCRAPER_TOOL_MAX_CONTENT_LENGTH` **Type:** Integer **Default:** `10 * 1024 * 1024` (10MB) **Description:** Maximum content length in bytes for HTTP responses. This prevents memory issues with extremely large files and ensures reasonable processing times. **Common Values:** - `5 * 1024 * 1024` - 5MB (small files) - `10 * 1024 * 1024` - 10MB (default) - `50 * 1024 * 1024` - 50MB (large files) - `100 * 1024 * 1024` - 100MB (very large files) **Example:** ```bash export SCRAPER_TOOL_MAX_CONTENT_LENGTH=52428800 ``` **Memory Note:** Larger values use more memory but allow processing of bigger files. Adjust based on available system resources. ### 3. Output Directory **Environment Variable:** `SCRAPER_TOOL_OUTPUT_DIR` **Type:** String **Default:** `os.path.join(tempfile.gettempdir(), 'scraper_outputs')` **Description:** Directory where scraped content and output files are saved. The directory will be created automatically if it doesn't exist. **Example:** ```bash export SCRAPER_TOOL_OUTPUT_DIR="/app/scraper_outputs" ``` **Security Note:** Ensure the directory has appropriate permissions and is not accessible via web servers. ### 4. Scrapy Command **Environment Variable:** `SCRAPER_TOOL_SCRAPY_COMMAND` **Type:** String **Default:** `"scrapy"` **Description:** Command to run Scrapy spiders. This can be customized for different Scrapy installations or virtual environments. **Common Values:** - `scrapy` - Standard Scrapy command - `python -m scrapy` - Python module execution - `/path/to/venv/bin/scrapy` - Virtual environment Scrapy - `docker exec container scrapy` - Docker container execution **Example:** ```bash export SCRAPER_TOOL_SCRAPY_COMMAND="python -m scrapy" ``` **Note:** Ensure Scrapy is installed and accessible via the specified command. ### 5. Allowed Domains **Environment Variable:** `SCRAPER_TOOL_ALLOWED_DOMAINS` **Type:** List[str] **Default:** `[]` (empty list - no restrictions) **Description:** List of allowed domains for scraping. This is a security feature that restricts scraping to specific domains. Empty list means no restrictions. **Format:** JSON array string with double quotes **Security Configurations:** - **Restrictive:** `["trusted-site.com","api.trusted-site.com"]` - **Permissive:** `[]` (no restrictions) - **API only:** `["api.example.com"]` **Example:** ```bash # Allow only specific domains export SCRAPER_TOOL_ALLOWED_DOMAINS='["example.com","api.example.com"]' # No restrictions (development only) export SCRAPER_TOOL_ALLOWED_DOMAINS='[]' ``` **Security Note:** Use restrictive domain lists in production to prevent unauthorized scraping. ### 6. Blocked Domains **Environment Variable:** `SCRAPER_TOOL_BLOCKED_DOMAINS` **Type:** List[str] **Default:** `[]` (empty list - no blocks) **Description:** List of blocked domains for scraping. This prevents scraping of known malicious or problematic domains. **Format:** JSON array string with double quotes **Common Blocked Domains:** - Malicious sites - Sites with aggressive anti-bot measures - Sites that violate terms of service - Sites with known security issues **Example:** ```bash # Block known problematic domains export SCRAPER_TOOL_BLOCKED_DOMAINS='["malicious.com","spam.com","blocked-site.com"]' ``` **Security Note:** Regularly update blocked domains list based on security advisories. ### 7. Use Stealth Mode **Environment Variable:** `SCRAPER_TOOL_USE_STEALTH` **Type:** Boolean **Default:** `False` **Description:** Whether to use stealth mode with Playwright to avoid bot detection. When enabled, the tool applies various techniques to make the browser appear more like a regular user browser, helping to bypass anti-bot measures. **Stealth Features:** - Removes webdriver property - Masks automation indicators - Randomizes browser fingerprints - Mimics human-like behavior - Bypasses common bot detection methods **Requirements:** ```bash # Install playwright-stealth pip install playwright-stealth # Or install with scraper extras pip install aiecs[scraper] ``` **Example:** ```bash # Enable stealth mode globally export SCRAPER_TOOL_USE_STEALTH=true # Or in .env file SCRAPER_TOOL_USE_STEALTH=true ``` **Use Cases:** - Scraping sites with anti-bot protection - Accessing content that blocks automated browsers - Bypassing Cloudflare and similar protections - Testing website behavior with realistic browser profiles **Note:** Stealth mode only works with Playwright rendering. It has no effect on regular HTTP requests. If `playwright-stealth` is not installed, the tool will log a warning and continue without stealth mode. ### 8. Playwright Available (Read-Only) **Environment Variable:** Not configurable via environment **Type:** Boolean **Default:** `False` (auto-detected) **Description:** Whether Playwright is available for JavaScript rendering. This is automatically detected during initialization and cannot be set via environment variables. **Auto-Detection:** The tool automatically checks if Playwright is installed and sets this field accordingly. **Installation:** ```bash pip install playwright playwright install ``` ## Usage Examples ### Example 1: Basic Environment Configuration ```bash # Set custom scraping parameters export SCRAPER_TOOL_USER_AGENT="MyBot/1.0" export SCRAPER_TOOL_MAX_CONTENT_LENGTH=52428800 export SCRAPER_TOOL_OUTPUT_DIR="/app/scraper_outputs" # Run your application python app.py ``` ### Example 2: Security-Focused Configuration ```bash # Strict security settings export SCRAPER_TOOL_USER_AGENT="SecureBot/1.0 (contact@company.com)" export SCRAPER_TOOL_ALLOWED_DOMAINS='["trusted-site.com","api.trusted-site.com"]' export SCRAPER_TOOL_BLOCKED_DOMAINS='["malicious.com","spam.com"]' export SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760 ``` ### Example 3: Development Configuration ```bash # Development-friendly settings export SCRAPER_TOOL_USER_AGENT="DevBot/1.0" export SCRAPER_TOOL_OUTPUT_DIR="./scraper_outputs" export SCRAPER_TOOL_ALLOWED_DOMAINS='[]' export SCRAPER_TOOL_BLOCKED_DOMAINS='[]' ``` ### Example 4: Programmatic Configuration ```python from aiecs.tools.scraper_tool import ScraperTool # Initialize with custom configuration scraper_tool = ScraperTool(config={ 'timeout': 30, 'max_retries': 3, 'impersonate': 'chrome120', 'proxy': None, 'requests_per_minute': 30, 'enable_cache': True, 'enable_js_render': False, 'use_stealth': True # Enable stealth mode }) ``` ### Example 5: Stealth Mode Configuration Using stealth mode to bypass bot detection: ```python from aiecs.tools.scraper_tool import ScraperTool # Method 1: Enable stealth mode via configuration scraper_with_stealth = ScraperTool(config={ 'use_stealth': True, 'enable_js_render': True # Required for rendering }) # Fetch a page with stealth mode enabled result = await scraper_with_stealth.fetch(url="https://example.com") # Method 2: Override stealth mode per request scraper_default = ScraperTool() # Enable stealth for this specific request result = await scraper_default.render( url="https://example.com", wait_time=5, use_stealth=True # Override config setting ) # Disable stealth for this specific request result = await scraper_default.render( url="https://example.com", wait_time=5, use_stealth=False # Override config setting ) ``` **Environment Variable:** ```bash # Enable stealth mode globally export SCRAPER_TOOL_USE_STEALTH=true ``` ### Example 6: Mixed Configuration Environment variables are used as defaults, but can be overridden programmatically: ```bash # Set environment defaults export SCRAPER_TOOL_USER_AGENT="DefaultBot/1.0" export SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760 export SCRAPER_TOOL_USE_STEALTH=true ``` ```python # Override for specific instance scraper_tool = ScraperTool(config={ 'user_agent': 'CustomBot/2.0', # This overrides the environment variable 'max_content_length': 52428800, # This overrides the environment variable 'use_stealth': False # This overrides the environment variable }) ``` ## Configuration Priority When the Scraper Tool is initialized, configuration values are resolved in the following order (highest to lowest priority): 1. **Programmatic config** - Values passed to the constructor 2. **Environment variables** - Values set via `SCRAPER_TOOL_*` variables 3. **Default values** - Built-in defaults as specified above ## Data Type Parsing ### String Values Strings should be provided as plain text without quotes: ```bash export SCRAPER_TOOL_USER_AGENT=MyBot/1.0 export SCRAPER_TOOL_SCRAPY_COMMAND=scrapy ``` ### Integer Values Integers should be provided as numeric strings: ```bash export SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760 ``` ### List Values Lists must be provided as JSON arrays with double quotes: ```bash # Correct export SCRAPER_TOOL_ALLOWED_DOMAINS='["example.com","api.example.com"]' # Incorrect (will not parse) export SCRAPER_TOOL_ALLOWED_DOMAINS="example.com,api.example.com" ``` **Important:** Use single quotes for the shell, double quotes for JSON: ```bash export SCRAPER_TOOL_ALLOWED_DOMAINS='["example.com","api.example.com"]' # ^ ^ # Single quotes for shell # ^ ^ # Double quotes for JSON ``` ## Validation ### Automatic Type Validation Pydantic automatically validates configuration values: - `user_agent` must be a non-empty string - `max_content_length` must be a positive integer - `output_dir` must be a non-empty string - `scrapy_command` must be a non-empty string - `allowed_domains` must be a list of strings - `blocked_domains` must be a list of strings - `playwright_available` must be a boolean ### Runtime Validation When scraping, the tool validates: 1. **Domain restrictions** - URLs must be in allowed domains (if specified) 2. **Domain blocks** - URLs must not be in blocked domains 3. **Content length** - Response content must not exceed max_content_length 4. **Output directory** - Output directory must be writable 5. **External tools** - Scrapy and Playwright availability is checked ## Operations Supported The Scraper Tool supports comprehensive web scraping operations: ### HTTP Clients #### Httpx Client - `get_httpx` - Modern async HTTP client with full feature support - Supports all HTTP methods (GET, POST, PUT, DELETE, etc.) - Built-in SSL verification and redirect handling - Cookie and authentication support #### Urllib Client - `get_urllib` - Standard library HTTP client - Lightweight alternative to httpx - Good for simple requests without advanced features #### Legacy Methods - `get_requests` - Legacy method (now uses httpx in sync mode) - `get_aiohttp` - Legacy method (now uses httpx in async mode) ### JavaScript Rendering #### Playwright Rendering - `render` - Render JavaScript-heavy pages - Supports waiting for specific elements - Screenshot capture capabilities - Scroll and interaction support ### HTML Parsing #### BeautifulSoup Parsing - `parse_html` - Parse HTML content with CSS selectors - XPath support via lxml - Attribute and text extraction - Flexible selector types ### Scrapy Integration #### Spider Execution - `crawl_scrapy` - Execute Scrapy spiders - Custom spider arguments support - Output file generation - Execution monitoring ### Output Formats #### Multiple Formats - **Text** - Plain text output - **JSON** - Structured JSON data - **HTML** - Raw HTML content - **Markdown** - Formatted markdown - **CSV** - Tabular data export ## Troubleshooting ### Issue: SSL certificate errors **Error:** `SSL: CERTIFICATE_VERIFY_FAILED` **Solutions:** 1. Update certificates: `pip install --upgrade certifi` 2. Disable SSL verification (not recommended): Set `verify_ssl=False` 3. Use custom CA bundle: Set `verify_ssl="/path/to/ca-bundle.pem"` ### Issue: Playwright not available **Error:** `Playwright is not available` **Solutions:** ```bash # Install Playwright pip install playwright # Install browser binaries playwright install # Verify installation python -c "import playwright; print('Playwright installed')" ``` ### Issue: Scrapy command not found **Error:** `Scrapy crawl failed: command not found` **Solutions:** ```bash # Install Scrapy pip install scrapy # Check command export SCRAPER_TOOL_SCRAPY_COMMAND="python -m scrapy" # Or use full path export SCRAPER_TOOL_SCRAPY_COMMAND="/path/to/venv/bin/scrapy" ``` ### Issue: Content too large **Error:** `Response content too large` **Solutions:** ```bash # Increase content length limit export SCRAPER_TOOL_MAX_CONTENT_LENGTH=52428800 # Or process content in chunks # Use streaming requests for large files ``` ### Issue: Domain not allowed **Error:** `Domain not in allowed list` **Solutions:** ```bash # Add domain to allowed list export SCRAPER_TOOL_ALLOWED_DOMAINS='["example.com","new-domain.com"]' # Or remove restrictions (development only) export SCRAPER_TOOL_ALLOWED_DOMAINS='[]' ``` ### Issue: Rate limiting **Error:** `Rate limit exceeded` or `429 Too Many Requests` **Solutions:** 1. Implement delays between requests 2. Use rotating user agents 3. Respect robots.txt 4. Use proxy rotation 5. Implement exponential backoff ### Issue: Timeout errors **Error:** `Request timeout` or `Connection timeout` **Solutions:** 1. Increase timeout values 2. Check network connectivity 3. Use retry mechanisms 4. Implement circuit breakers ### Issue: List parsing error **Error:** Configuration parsing fails for domain lists **Solution:** ```bash # Use proper JSON array syntax export SCRAPER_TOOL_ALLOWED_DOMAINS='["example.com","api.example.com"]' # NOT: [example.com,api.example.com] or example.com,api.example.com ``` ### Issue: Output directory not writable **Error:** `Permission denied` when saving files **Solutions:** ```bash # Set writable output directory export SCRAPER_TOOL_OUTPUT_DIR="/writable/path" # Or create directory with proper permissions mkdir -p /path/to/outputs chmod 755 /path/to/outputs ``` ### Issue: Stealth mode not working **Error:** `playwright-stealth is not installed` warning in logs **Solutions:** ```bash # Install playwright-stealth pip install playwright-stealth # Or install with scraper extras pip install aiecs[scraper] # Verify installation python -c "from playwright_stealth import stealth_async; print('OK')" ``` ### Issue: Bot detection still occurring with stealth mode **Symptoms:** Website still detects automation despite stealth mode enabled **Solutions:** 1. **Verify stealth mode is enabled:** ```python # Check logs for "Stealth mode enabled for Playwright" message scraper = ScraperTool(config={'use_stealth': True}) result = await scraper.render(url, use_stealth=True) ``` 2. **Add additional delays:** ```python # Wait longer for page to load result = await scraper.render( url=url, wait_time=10, # Increase wait time use_stealth=True ) ``` 3. **Use realistic user agent:** ```bash export SCRAPER_TOOL_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" ``` 4. **Implement rate limiting:** - Add delays between requests - Randomize request timing - Respect robots.txt 5. **Note:** Some advanced bot detection systems may still detect automation. Stealth mode improves success rate but is not foolproof. ## Best Practices ### Web Scraping Ethics 1. **Respect robots.txt** - Always check and follow robots.txt files 2. **Rate limiting** - Implement delays between requests 3. **User agent identification** - Use descriptive, honest user agents 4. **Terms of service** - Read and follow website terms of service 5. **Legal compliance** - Ensure compliance with local laws and regulations ### Security 1. **Domain filtering** - Use allowed/blocked domain lists 2. **Content validation** - Validate scraped content for malicious code 3. **SSL verification** - Always verify SSL certificates in production 4. **Input sanitization** - Sanitize URLs and parameters 5. **Output security** - Secure output directories and files ### Performance 1. **Connection pooling** - Reuse HTTP connections when possible 2. **Async operations** - Use async methods for better concurrency 3. **Memory management** - Monitor memory usage with large content 4. **Caching** - Implement caching for frequently accessed content 5. **Resource limits** - Set appropriate content length limits ### Error Handling 1. **Retry mechanisms** - Implement exponential backoff for failed requests 2. **Circuit breakers** - Stop requests to failing services 3. **Graceful degradation** - Handle partial failures gracefully 4. **Logging** - Log errors and performance metrics 5. **Monitoring** - Monitor scraping success rates and performance ### Development vs Production **Development:** ```bash SCRAPER_TOOL_USER_AGENT=DevBot/1.0 SCRAPER_TOOL_OUTPUT_DIR=./scraper_outputs SCRAPER_TOOL_ALLOWED_DOMAINS='[]' SCRAPER_TOOL_BLOCKED_DOMAINS='[]' SCRAPER_TOOL_MAX_CONTENT_LENGTH=10485760 ``` **Production:** ```bash SCRAPER_TOOL_USER_AGENT=ProductionBot/2.0 (contact@company.com) SCRAPER_TOOL_OUTPUT_DIR=/app/scraper_outputs SCRAPER_TOOL_ALLOWED_DOMAINS='["trusted-site.com","api.trusted-site.com"]' SCRAPER_TOOL_BLOCKED_DOMAINS='["malicious.com","spam.com"]' SCRAPER_TOOL_MAX_CONTENT_LENGTH=52428800 ``` ### Error Handling Always wrap scraping operations in try-except blocks: ```python from aiecs.tools.scraper_tool import ScraperTool, HttpError, RateLimitError scraper_tool = ScraperTool() try: result = await scraper_tool.get_httpx(url) except HttpError as e: print(f"HTTP error: {e}") except TimeoutError as e: print(f"Timeout error: {e}") except RateLimitError as e: print(f"Rate limit error: {e}") except Exception as e: print(f"Unexpected error: {e}") ``` ## Installation Requirements ### Core Dependencies ```bash # Install core scraping dependencies pip install httpx beautifulsoup4 lxml # Install optional dependencies pip install playwright scrapy # Install stealth mode support pip install playwright-stealth # Or install all scraper extras at once pip install aiecs[scraper] ``` ### Playwright Setup ```bash # Install Playwright pip install playwright # Install browser binaries playwright install # Install specific browsers playwright install chromium playwright install firefox playwright install webkit ``` ### Stealth Mode Setup ```bash # Install playwright-stealth for anti-bot detection pip install playwright-stealth # Verify installation python -c "from playwright_stealth import stealth_async; print('Stealth mode available')" ``` ### Scrapy Setup ```bash # Install Scrapy pip install scrapy # Create a Scrapy project scrapy startproject myproject # Create a spider cd myproject scrapy genspider myspider example.com ``` ### Verification ```python # Test Playwright installation try: import playwright print("Playwright installed successfully") except ImportError: print("Playwright not installed") # Test Scrapy installation try: import scrapy print("Scrapy installed successfully") except ImportError: print("Scrapy not installed") ``` ## Related Documentation - Tool implementation details in the source code - Httpx documentation: https://www.python-httpx.org/ - BeautifulSoup documentation: https://www.crummy.com/software/BeautifulSoup/ - Playwright documentation: https://playwright.dev/python/ - Scrapy documentation: https://docs.scrapy.org/ - Main aiecs documentation for architecture overview ## Support For issues or questions about Scraper Tool configuration: - Check the tool source code for implementation details - Review HTTP client documentation for specific features - Consult the main aiecs documentation for architecture overview - Test with simple URLs first to isolate configuration vs. scraping issues - Monitor network traffic and response times - Validate SSL certificates and domain restrictions - Check robots.txt and terms of service compliance