Error Handling and Robust Scraping
Scraping scripts often run for extended periods or over numerous URLs, making them prone to intermittent issues like connection timeouts or encountering malformed HTML. A robust scraper anticipates these problems and handles them gracefully.
Key Topics
- Managing Connection Errors and Timeouts
- Dealing with Missing Data
- Non-Existent Elements
- Logging Errors for Debugging
- Implementing Retry Logic
- Handling Malformed HTML
Managing Connection Errors and Timeouts
Network instability or server hiccups can cause requests to fail. Use try-except blocks to catch requests.exceptions
and decide whether to retry or log the error.
import requests
from requests.exceptions import RequestException
url_list = ["https://example.com", "https://nonexistentdomain.tld"]
for url in url_list:
try:
response = requests.get(url, timeout=5)
response.raise_for_status() # Check HTTP errors
# Process response here...
except RequestException as e:
print(f"Error while fetching {url}: {e}")
Explanation: By catching RequestException
, your script avoids crashing if a domain doesn’t exist or if a request times out. raise_for_status()
triggers an exception for non-200 status codes (e.g., 404 or 500).
Dealing with Missing Data
Some pages won’t have the elements you expect. For instance, a user profile might omit certain fields if they’re not provided. Check for None
before attempting to access .text
or attributes.
from bs4 import BeautifulSoup
example_html = """\
<div class="profile">
<p class="name">Alice</p>
<!-- Age info is missing here -->
</div>
"""
soup = BeautifulSoup(example_html, "html.parser")
name = soup.find('p', class_='name')
age = soup.find('p', class_='age') # Might not exist
name_text = name.text if name else "No name found"
age_text = age.text if age else "No age provided"
print("Name:", name_text)
print("Age:", age_text)
Explanation: This approach ensures the script doesn’t fail if age
is missing. Instead, it falls back to a default or logs a placeholder message.
Non-Existent Elements
Similar to missing data, some elements might not appear at all, or they could appear under certain conditions (e.g., user is logged in). Always guard your scraping logic against assumptions that all pages look identical.
Tip: Testing your scraper on various sample pages, especially edge cases, can reveal discrepancies that might otherwise cause runtime errors.
Logging Errors for Debugging
Logging errors and other significant events can help you debug issues and understand the scraper's behavior over time. Use Python's built-in logging
module to record errors, warnings, and informational messages.
import logging
# Configure logging
logging.basicConfig(filename='scraper.log', level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')
try:
response = requests.get('https://example.com', timeout=5)
response.raise_for_status()
# Process response here...
except RequestException as e:
logging.error(f"Error while fetching URL: {e}")
Explanation: This example configures a logger to write messages to scraper.log
. Errors are logged with a timestamp and severity level, aiding in post-mortem analysis.
Implementing Retry Logic
Sometimes, transient errors can be resolved by simply retrying the request. Implementing a retry mechanism with exponential backoff can help mitigate temporary issues without overwhelming the server.
import time
import requests
from requests.exceptions import RequestException
url = "https://example.com"
max_retries = 3
backoff_factor = 2
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=5)
response.raise_for_status()
# Process response here...
break # Exit loop if request is successful
except RequestException as e:
wait_time = backoff_factor ** attempt
logging.warning(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait_time} seconds...")
time.sleep(wait_time)
else:
logging.error(f"Failed to fetch {url} after {max_retries} attempts.")
Explanation: This example retries a request up to three times with exponential backoff. If all attempts fail, an error is logged. This approach helps handle transient issues without overwhelming the server.
Handling Malformed HTML
Web pages with malformed HTML can cause parsing errors. BeautifulSoup is quite forgiving, but sometimes you need to clean or preprocess the HTML before parsing.
from bs4 import BeautifulSoup
malformed_html = """\
<html>
<head><title>Test</title></head>
<body>
<div>Hello World
<p>This is a paragraph
</body>
</html>
"""
# Use BeautifulSoup to parse the malformed HTML
soup = BeautifulSoup(malformed_html, "html.parser")
# Extract data
title = soup.title.text if soup.title else "No title found"
paragraph = soup.find('p').text if soup.find('p') else "No paragraph found"
print("Title:", title)
print("Paragraph:", paragraph)
Explanation: BeautifulSoup can handle many common HTML errors, but it’s good practice to check for the existence of elements before accessing their properties.
Key Takeaways
- Errors & Timeouts: Wrap requests in try-except blocks and handle network disruptions gracefully.
- Missing Elements: Check for
None
before accessing text or attributes. - Logging: Use logging to record errors and significant events for debugging.
- Retry Logic: Implement retry mechanisms with exponential backoff to handle transient errors.
- Malformed HTML: Use BeautifulSoup’s forgiving parser and check for element existence.
- Robustness: A well-tested scraper can handle real-world inconsistencies without crashing.