Prev Next

Error Handling and Robust Scraping

Scraping scripts often run for extended periods or over numerous URLs, making them prone to intermittent issues like connection timeouts or encountering malformed HTML. A robust scraper anticipates these problems and handles them gracefully.

Key Topics

Managing Connection Errors and Timeouts
Dealing with Missing Data
Non-Existent Elements
Logging Errors for Debugging
Implementing Retry Logic
Handling Malformed HTML

Managing Connection Errors and Timeouts

Network instability or server hiccups can cause requests to fail. Use try-except blocks to catch requests.exceptions and decide whether to retry or log the error.

import requests
from requests.exceptions import RequestException

url_list = ["https://example.com", "https://nonexistentdomain.tld"]

for url in url_list:
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()  # Check HTTP errors
        # Process response here...
    except RequestException as e:
        print(f"Error while fetching {url}: {e}")

Explanation: By catching RequestException, your script avoids crashing if a domain doesn’t exist or if a request times out. raise_for_status() triggers an exception for non-200 status codes (e.g., 404 or 500).

Dealing with Missing Data

Some pages won’t have the elements you expect. For instance, a user profile might omit certain fields if they’re not provided. Check for None before attempting to access .text or attributes.

from bs4 import BeautifulSoup

example_html = """\
<div class="profile">
    <p class="name">Alice</p>
    <!-- Age info is missing here -->
</div>
"""

soup = BeautifulSoup(example_html, "html.parser")
name = soup.find('p', class_='name')
age = soup.find('p', class_='age')  # Might not exist

name_text = name.text if name else "No name found"
age_text = age.text if age else "No age provided"

print("Name:", name_text)
print("Age:", age_text)

Explanation: This approach ensures the script doesn’t fail if age is missing. Instead, it falls back to a default or logs a placeholder message.

Non-Existent Elements

Similar to missing data, some elements might not appear at all, or they could appear under certain conditions (e.g., user is logged in). Always guard your scraping logic against assumptions that all pages look identical.

Tip: Testing your scraper on various sample pages, especially edge cases, can reveal discrepancies that might otherwise cause runtime errors.

Logging Errors for Debugging

Logging errors and other significant events can help you debug issues and understand the scraper's behavior over time. Use Python's built-in logging module to record errors, warnings, and informational messages.

import logging

# Configure logging
logging.basicConfig(filename='scraper.log', level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

try:
    response = requests.get('https://example.com', timeout=5)
    response.raise_for_status()
    # Process response here...
except RequestException as e:
    logging.error(f"Error while fetching URL: {e}")

Explanation: This example configures a logger to write messages to scraper.log. Errors are logged with a timestamp and severity level, aiding in post-mortem analysis.

Implementing Retry Logic

Sometimes, transient errors can be resolved by simply retrying the request. Implementing a retry mechanism with exponential backoff can help mitigate temporary issues without overwhelming the server.

import time
import requests
from requests.exceptions import RequestException

url = "https://example.com"
max_retries = 3
backoff_factor = 2

for attempt in range(max_retries):
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()
        # Process response here...
        break  # Exit loop if request is successful
    except RequestException as e:
        wait_time = backoff_factor ** attempt
        logging.warning(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait_time} seconds...")
        time.sleep(wait_time)
else:
    logging.error(f"Failed to fetch {url} after {max_retries} attempts.")

Explanation: This example retries a request up to three times with exponential backoff. If all attempts fail, an error is logged. This approach helps handle transient issues without overwhelming the server.

Handling Malformed HTML

Web pages with malformed HTML can cause parsing errors. BeautifulSoup is quite forgiving, but sometimes you need to clean or preprocess the HTML before parsing.

from bs4 import BeautifulSoup

malformed_html = """\
<html>
    <head><title>Test</title></head>
    <body>
        <div>Hello World
        <p>This is a paragraph
    </body>
</html>
"""

# Use BeautifulSoup to parse the malformed HTML
soup = BeautifulSoup(malformed_html, "html.parser")

# Extract data
title = soup.title.text if soup.title else "No title found"
paragraph = soup.find('p').text if soup.find('p') else "No paragraph found"

print("Title:", title)
print("Paragraph:", paragraph)

Explanation: BeautifulSoup can handle many common HTML errors, but it’s good practice to check for the existence of elements before accessing their properties.

Key Takeaways

Errors & Timeouts: Wrap requests in try-except blocks and handle network disruptions gracefully.
Missing Elements: Check for None before accessing text or attributes.
Logging: Use logging to record errors and significant events for debugging.
Retry Logic: Implement retry mechanisms with exponential backoff to handle transient errors.
Malformed HTML: Use BeautifulSoup’s forgiving parser and check for element existence.
Robustness: A well-tested scraper can handle real-world inconsistencies without crashing.