Step 2
Static vs dynamic — BS4 + Playwright
25 min
Static vs dynamic — BS4 + Playwright
Wrong tool = 10x slower, 10x more likely to be blocked. Decide first.
1. Static (server-rendered)
curl returns the data you want.
- 100–300ms
- Light resources
- requests + BeautifulSoup or httpx
2. Dynamic (JS-rendered)
Source has empty <div id="app"> and JS fills it.
- 2–10s
- Hundreds of MB per browser
- Playwright / Selenium
3. Decide fast
curl https://target.com/page | grep "the text you want"
No match? Open DevTools → Network → XHR. Often there's a JSON API you can call directly.
4. Hidden APIs
Many "dynamic" sites actually call REST APIs. Calling them directly beats Playwright in speed and stability.
5. requests + BS4
import httpx
from bs4 import BeautifulSoup
async with httpx.AsyncClient(headers={"User-Agent": "MyBot/1.0"}) as client:
resp = await client.get("https://example.com/page")
soup = BeautifulSoup(resp.text, "html.parser")
for item in soup.select("div.item"):
yield {
"title": item.select_one(".title").text.strip(),
"price": item.select_one(".price").text.strip(),
}
6. Playwright
from playwright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
await page.wait_for_selector(".item")
titles = await page.locator(".item .title").all_inner_texts()
await browser.close()
7. Optimizations
await page.route("**/*.{png,jpg,gif,svg,woff,woff2,css}", lambda r: r.abort())
Block images/styles → 3–5x faster.
8. Reusable context
context = await browser.new_context(user_agent="MyBot/1.0")
page1 = await context.new_page()
page2 = await context.new_page()
Shares cookies / storage.
9. Hybrid
Playwright once for login / JS-rendered listing; BS4 in parallel for details.
urls = await extract_urls_with_playwright(list_page)
async with httpx.AsyncClient() as client:
details = await asyncio.gather(*[fetch_bs4(client, u) for u in urls])
10. Gotchas
- Playwright for static pages — wasteful
- BS4 for SPAs — empty HTML
- Missing hidden APIs — check Network tab
- Default Playwright timeout (30s) too short on slow sites
Closing
"curl first, hidden API next, Playwright last" — preserves speed, stability, and politeness.
Next
- 03-rate-limit-backoff