Skip to content

[Bug]: Critical Bug: Incomplete Redirect Chain Detection Breaks Internal URL Extraction #1472

@mo-xiaoxi

Description

@mo-xiaoxi

crawl4ai version

0.7.4

Expected Behavior

Critical Bug: Incomplete Redirect Chain Detection Breaks Internal URL Extraction

Summary

The redirected_url attribute in the CrawlResult object only captures the first redirect in a redirect chain, not the final destination URL after multiple redirects. This causes severe issues with internal URL extraction, relative path resolution, and all URL-based processing operations.

Environment

  • crawl4ai version: 0.7.4
  • Python version: 3.12
  • Browser: Chromium (via Playwright)

Steps to Reproduce

  1. Use crawl4ai to fetch a URL that has multiple redirects
  2. Check the redirected_url attribute in the result
  3. Compare with the actual final URL that the browser reaches

Test Code

import asyncio
from crawl4ai import AsyncWebCrawler

async def test_redirect():
    url = 'https://zhaopin.sgcc.com.cn'
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url)
        
        if result:
            print(f"Original URL: {url}")
            print(f"redirected_url: {result.redirected_url}")
            print(f"status_code: {result.status_code}")
            
            # The actual final URL should be: https://zhaopin.sgcc.com.cn/sgcchr/static/home.html
            # But redirected_url only shows: https://zhaopin.sgcc.com.cn/

asyncio.run(test_redirect())

Expected Behavior

The redirected_url attribute should contain the final destination URL after all redirects have been followed, which in this case should be:
https://zhaopin.sgcc.com.cn/sgcchr/static/home.html

Actual Behavior

The redirected_url attribute only contains the first redirect:
https://zhaopin.sgcc.com.cn/

Redirect Chain Analysis

Using Playwright directly, the complete redirect chain appears to be:

  1. https://zhaopin.sgcc.com.cnhttps://zhaopin.sgcc.com.cn/ (HTTP redirect)
  2. https://zhaopin.sgcc.com.cn/https://zhaopin.sgcc.com.cn/sgcchr/static/home.html (JavaScript redirect or additional HTTP redirect)

Impact

This bug affects applications that need to:

  • Track the complete redirect chain
  • Get the actual final URL after all redirects
  • Perform URL-based analysis or caching based on the final destination

Suggested Solution

Consider adding one or more of the following:

  1. A final_url attribute that contains the actual final destination
  2. A redirect_chain attribute that contains the complete list of redirects
  3. Update the existing redirected_url to contain the final destination instead of just the first redirect

Additional Context

This appears to be a common pattern with many websites that use multiple redirects (HTTP + JavaScript) to reach their final destination. The current implementation only captures the first HTTP redirect but misses subsequent redirects that may be handled by JavaScript or additional HTTP redirects.

Critical Impact on Internal URL Extraction

This bug has severe consequences for applications that extract and process internal links:

Problem Analysis

When redirected_url only contains the first redirect (https://zhaopin.sgcc.com.cn/) instead of the final URL (https://zhaopin.sgcc.com.cn/sgcchr/static/home.html), all subsequent URL processing becomes incorrect:

  1. Wrong Base URL for Link Extraction: Internal link extraction uses the incomplete redirected_url as the base URL
  2. Incorrect Relative Path Resolution: All relative links are resolved against the wrong base URL
  3. Wrong Internal Link Classification: Links are classified as internal/external based on the wrong domain
  4. Broken URL Normalization: All URL processing operations use the incorrect final URL

Code Example of the Problem

# Current behavior with crawl4ai
result = await crawler.arun('https://zhaopin.sgcc.com.cn')
final_url = result.redirected_url  # Only gets: https://zhaopin.sgcc.com.cn/
# But the actual final URL should be: https://zhaopin.sgcc.com.cn/sgcchr/static/home.html

# This causes incorrect internal link extraction:
internal_links = extract_internal_links(html_content, final_url)  # Uses wrong base URL!

Real-World Impact

  • Web Scraping: Extracted internal links point to wrong URLs
  • SEO Analysis: Incorrect internal link structure analysis
  • Content Analysis: Wrong base URL for relative resource resolution
  • Caching Systems: Cache keys based on incorrect final URLs
  • Security Analysis: Wrong domain-based security checks

Workaround

Currently, developers need to manually track redirects or use additional tools like Playwright to get the complete redirect chain, which defeats the purpose of using crawl4ai for this functionality.

Additional Test Case

import asyncio
from crawl4ai import AsyncWebCrawler
from urllib.parse import urljoin

async def test_internal_link_extraction():
    url = 'https://zhaopin.sgcc.com.cn'
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url)
        
        if result:
            # This is WRONG - only gets first redirect
            base_url = result.redirected_url or url
            print(f"Base URL used for link extraction: {base_url}")
            
            # Example: if HTML contains <a href="/some-page">
            # It gets resolved to: https://zhaopin.sgcc.com.cn/some-page
            # But should be: https://zhaopin.sgcc.com.cn/sgcchr/static/some-page
            
            # This affects ALL relative URL resolution in the application
            relative_link = "/some-page"
            resolved_url = urljoin(base_url, relative_link)
            print(f"Resolved relative link: {resolved_url}")
            print("This URL is likely INCORRECT due to wrong base URL!")

asyncio.run(test_internal_link_extraction())

Priority

This should be considered a HIGH PRIORITY bug because:

  • It affects core functionality (URL processing)
  • It causes silent failures in many use cases
  • It breaks fundamental web scraping operations
  • The impact is not immediately obvious to developers

Current Behavior

The redirected_url attribute only contains the first redirect:
https://zhaopin.sgcc.com.cn/

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

macOS

Python version

3.12

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions