Web Scraping with Proxy Servers and User Agents

Many websites monitor incoming traffic to detect patterns indicative of automated scraping. By default, most Python libraries send identifiable headers (like Python-requests) that make you easily traceable. Additionally, if you make many requests from a single IP, you risk being blocked. Using proxy servers and rotating user agents can help distribute requests and mimic real user traffic, reducing the likelihood of being banned.

Key Topics

Using Proxies to Avoid IP Blocking

A proxy server acts as an intermediary for requests, allowing you to mask your real IP. This can be useful if you need to scrape geographically restricted content or evade certain anti-scraping measures.

import requests

proxy_dict = {
    "http": "http://123.45.67.89:8080",
    "https": "https://123.45.67.89:8080"
}

try:
    response = requests.get("https://example.com", proxies=proxy_dict, timeout=5)
    print("Status Code:", response.status_code)
    # Process the response...
except requests.exceptions.RequestException as e:
    print("Proxy error:", e)

Explanation: The proxies parameter instructs requests to route traffic through the specified IP and port. Free proxies can be unstable, so rotating multiple proxies or using a reputable paid service is often necessary.

Example: Rotating Proxies

import requests
import random

proxies = [
    "http://123.45.67.89:8080",
    "http://98.76.54.32:8080",
    "http://11.22.33.44:8080"
]

url = "https://example.com"
proxy = {"http": random.choice(proxies), "https": random.choice(proxies)}

try:
    response = requests.get(url, proxies=proxy, timeout=5)
    print("Status Code:", response.status_code)
    # Process the response...
except requests.exceptions.RequestException as e:
    print("Proxy error:", e)

Explanation: This example demonstrates how to rotate proxies by randomly selecting one from a list for each request, further reducing the risk of being blocked.

Rotating User Agents to Mimic Human Activity

The user agent string identifies the browser or client making the request. By default, requests uses something like python-requests/2.x, which many sites can detect as non-human. Rotating a variety of browser-like user agents helps reduce suspicion.

import random
import requests

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/85.0"
]

headers = {"User-Agent": random.choice(user_agents)}

response = requests.get("https://example.com", headers=headers)
print("User-Agent used:", headers["User-Agent"])

Explanation: A list of user agents is maintained, and one is randomly chosen for each request. This technique helps simulate traffic from multiple browsers or operating systems, reducing your automated footprint.

Example: Combining Proxies and User Agents

import random
import requests

proxies = [
    "http://123.45.67.89:8080",
    "http://98.76.54.32:8080",
    "http://11.22.33.44:8080"
]

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/85.0"
]

url = "https://example.com"
proxy = {"http": random.choice(proxies), "https": random.choice(proxies)}
headers = {"User-Agent": random.choice(user_agents)}

try:
    response = requests.get(url, proxies=proxy, headers=headers, timeout=5)
    print("Status Code:", response.status_code)
    print("User-Agent used:", headers["User-Agent"])
    # Process the response...
except requests.exceptions.RequestException as e:
    print("Error:", e)

Explanation: This example combines proxy rotation and user agent rotation to further mimic human browsing behavior and reduce the risk of detection.

Using Headless Browsers for Scraping

Headless browsers like Chrome Headless or PhantomJS allow a webpage to be loaded and rendered in an environment without a graphical interface. This is similar to Selenium usage, but can be combined with additional strategies to appear more like a real user, such as setting language preferences, screen resolutions, or timezones.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

chrome_options = Options()
chrome_options.add_argument('--headless')  # Run in headless mode
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--window-size=1920x1080')
chrome_options.add_argument('--lang=en-US')

# Make sure to download the appropriate ChromeDriver
browser = webdriver.Chrome(options=chrome_options)

browser.get("https://dynamicexample.com")
page_source = browser.page_source
soup = BeautifulSoup(page_source, "html.parser")

# Extract data using BeautifulSoup
data = soup.select("div.content")

for item in data:
    print(item.get_text())

browser.quit()

Explanation: Selenium launches a headless Chrome browser, navigates to the page, and loads all dynamic content. We then grab the final rendered HTML with browser.page_source and parse it using BeautifulSoup. Additional options like window size and language settings help mimic real user behavior.

Example: Using Headless Browser with Proxy and User Agent

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import random

proxies = [
    "http://123.45.67.89:8080",
    "http://98.76.54.32:8080",
    "http://11.22.33.44:8080"
]

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/85.0"
]

proxy = random.choice(proxies)
user_agent = random.choice(user_agents)

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--window-size=1920x1080')
chrome_options.add_argument('--lang=en-US')
chrome_options.add_argument(f'--proxy-server={proxy}')
chrome_options.add_argument(f'user-agent={user_agent}')

browser = webdriver.Chrome(options=chrome_options)

browser.get("https://dynamicexample.com")
page_source = browser.page_source
browser.quit()

# Parse the final HTML with BeautifulSoup
soup = BeautifulSoup(page_source, "html.parser")
data = soup.select("div.content")

for item in data:
    print(item.get_text())

Explanation: This example combines the use of a headless browser with proxy and user agent rotation to further mimic real user behavior and reduce the risk of detection.

Key Takeaways

  • Proxy Servers: Mask your IP and rotate proxies to avoid blocks.
  • User Agents: Vary them to simulate multiple real-browser requests.
  • Headless Browsers: Selenium or Chrome Headless can render complex pages and further mimic real-user behavior.
  • Combining Techniques: Using proxies, rotating user agents, and headless browsers together can significantly reduce the risk of detection and blocking.