25
Web scraping with Python is a way to extract data from websites. It’s commonly done using libraries like BeautifulSoup, requests, and Selenium (for dynamic content). Here’s a no-fluff overview to get you started:
🛠️ Basic Tools You’ll Need
requests
– to fetch the web page.BeautifulSoup
– to parse HTML and extract data.pandas
(optional) – to structure and store the data.Selenium
– when you need to interact with JavaScript-heavy sites.
âś… Example: Scrape Quotes from http://quotes.toscrape.com
import requests
from bs4 import BeautifulSoup
url = "http://quotes.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
quotes = soup.find_all('div', class_='quote')
for quote in quotes:
text = quote.find('span', class_='text').get_text()
author = quote.find('small', class_='author').get_text()
print(f"{text} - {author}")
đź§ Key Concepts
response.text
gives the raw HTML.BeautifulSoup(html, 'html.parser')
parses it..find()
/.find_all()
helps locate HTML elements..get_text()
extracts readable content.
⚠️ Tips & Ethics
- Always check the site’s
robots.txt
(e.g.,example.com/robots.txt
) to see what’s allowed. - Don’t overload servers – be polite with delays (
time.sleep()
). - Use headers to mimic a browser:
headers = {'User-Agent': 'Mozilla/5.0'} requests.get(url, headers=headers)
đź§Ş If You Need to Interact (JavaScript-Driven Sites)
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("http://quotes.toscrape.com/js")
quotes = driver.find_elements(By.CLASS_NAME, "quote")
for quote in quotes:
print(quote.text)
driver.quit()
Want to scrape a specific site or need help structuring the data? Drop the URL or your goal, and I’ll guide you.