-
-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Description
crawl4ai version
0.7.4
Expected Behavior
Critical Bug: Incomplete Redirect Chain Detection Breaks Internal URL Extraction
Summary
The redirected_url attribute in the CrawlResult object only captures the first redirect in a redirect chain, not the final destination URL after multiple redirects. This causes severe issues with internal URL extraction, relative path resolution, and all URL-based processing operations.
Environment
- crawl4ai version: 0.7.4
- Python version: 3.12
- Browser: Chromium (via Playwright)
Steps to Reproduce
- Use crawl4ai to fetch a URL that has multiple redirects
- Check the
redirected_urlattribute in the result - Compare with the actual final URL that the browser reaches
Test Code
import asyncio
from crawl4ai import AsyncWebCrawler
async def test_redirect():
url = 'https://zhaopin.sgcc.com.cn'
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url)
if result:
print(f"Original URL: {url}")
print(f"redirected_url: {result.redirected_url}")
print(f"status_code: {result.status_code}")
# The actual final URL should be: https://zhaopin.sgcc.com.cn/sgcchr/static/home.html
# But redirected_url only shows: https://zhaopin.sgcc.com.cn/
asyncio.run(test_redirect())Expected Behavior
The redirected_url attribute should contain the final destination URL after all redirects have been followed, which in this case should be:
https://zhaopin.sgcc.com.cn/sgcchr/static/home.html
Actual Behavior
The redirected_url attribute only contains the first redirect:
https://zhaopin.sgcc.com.cn/
Redirect Chain Analysis
Using Playwright directly, the complete redirect chain appears to be:
https://zhaopin.sgcc.com.cn→https://zhaopin.sgcc.com.cn/(HTTP redirect)https://zhaopin.sgcc.com.cn/→https://zhaopin.sgcc.com.cn/sgcchr/static/home.html(JavaScript redirect or additional HTTP redirect)
Impact
This bug affects applications that need to:
- Track the complete redirect chain
- Get the actual final URL after all redirects
- Perform URL-based analysis or caching based on the final destination
Suggested Solution
Consider adding one or more of the following:
- A
final_urlattribute that contains the actual final destination - A
redirect_chainattribute that contains the complete list of redirects - Update the existing
redirected_urlto contain the final destination instead of just the first redirect
Additional Context
This appears to be a common pattern with many websites that use multiple redirects (HTTP + JavaScript) to reach their final destination. The current implementation only captures the first HTTP redirect but misses subsequent redirects that may be handled by JavaScript or additional HTTP redirects.
Critical Impact on Internal URL Extraction
This bug has severe consequences for applications that extract and process internal links:
Problem Analysis
When redirected_url only contains the first redirect (https://zhaopin.sgcc.com.cn/) instead of the final URL (https://zhaopin.sgcc.com.cn/sgcchr/static/home.html), all subsequent URL processing becomes incorrect:
- Wrong Base URL for Link Extraction: Internal link extraction uses the incomplete
redirected_urlas the base URL - Incorrect Relative Path Resolution: All relative links are resolved against the wrong base URL
- Wrong Internal Link Classification: Links are classified as internal/external based on the wrong domain
- Broken URL Normalization: All URL processing operations use the incorrect final URL
Code Example of the Problem
# Current behavior with crawl4ai
result = await crawler.arun('https://zhaopin.sgcc.com.cn')
final_url = result.redirected_url # Only gets: https://zhaopin.sgcc.com.cn/
# But the actual final URL should be: https://zhaopin.sgcc.com.cn/sgcchr/static/home.html
# This causes incorrect internal link extraction:
internal_links = extract_internal_links(html_content, final_url) # Uses wrong base URL!Real-World Impact
- Web Scraping: Extracted internal links point to wrong URLs
- SEO Analysis: Incorrect internal link structure analysis
- Content Analysis: Wrong base URL for relative resource resolution
- Caching Systems: Cache keys based on incorrect final URLs
- Security Analysis: Wrong domain-based security checks
Workaround
Currently, developers need to manually track redirects or use additional tools like Playwright to get the complete redirect chain, which defeats the purpose of using crawl4ai for this functionality.
Additional Test Case
import asyncio
from crawl4ai import AsyncWebCrawler
from urllib.parse import urljoin
async def test_internal_link_extraction():
url = 'https://zhaopin.sgcc.com.cn'
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url)
if result:
# This is WRONG - only gets first redirect
base_url = result.redirected_url or url
print(f"Base URL used for link extraction: {base_url}")
# Example: if HTML contains <a href="/some-page">
# It gets resolved to: https://zhaopin.sgcc.com.cn/some-page
# But should be: https://zhaopin.sgcc.com.cn/sgcchr/static/some-page
# This affects ALL relative URL resolution in the application
relative_link = "/some-page"
resolved_url = urljoin(base_url, relative_link)
print(f"Resolved relative link: {resolved_url}")
print("This URL is likely INCORRECT due to wrong base URL!")
asyncio.run(test_internal_link_extraction())Priority
This should be considered a HIGH PRIORITY bug because:
- It affects core functionality (URL processing)
- It causes silent failures in many use cases
- It breaks fundamental web scraping operations
- The impact is not immediately obvious to developers
Current Behavior
The redirected_url attribute only contains the first redirect:
https://zhaopin.sgcc.com.cn/
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
macOS
Python version
3.12
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response