Skip to content

[Bug]: Crawl4AI does not resolve relative links to absolute when crawling, but does when generating markdown #1480

@Tayomide

Description

@Tayomide

crawl4ai version

0.7.4

Expected Behavior

While crawling http://aifs-content.eastus.azurecontainer.io/ crawl4AI should resolve relative links(in the header) to absolute links and then crawl those pages without raising errors

Current Behavior

While crawling http://aifs-content.eastus.azurecontainer.io/, using this configurations

browser_conf = BrowserConfig(headless=True)
run_conf = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=depth, include_external=False
    ),
    check_robots_txt=True,
)

Crawl4AI throws the error below for relative links

Image

Some digging around showed the error is thrown after failed validation from the function can_process_url

Image

The generated raw_markdown on the other hand resolves the relative urls to their absolute counterpart

I have not been able to do further research to find out how, but I was wondering if the same could be done for the crawl logic

Is this reproducible?

Yes

Inputs Causing the Bug

- URL: http://aifs-content.eastus.azurecontainer.io/
- Settings Used:

browser_conf = BrowserConfig(headless=True)
run_conf = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=depth, include_external=False
    ),
    check_robots_txt=True,
)

Steps to Reproduce

Code snippets

url = "http://aifs-content.eastus.azurecontainer.io/"
depth = 3

# Configure browser and crawler settings
browser_conf = BrowserConfig(headless=True)
run_conf = CrawlerRunConfig(
  deep_crawl_strategy=BFSDeepCrawlStrategy(
    max_depth=depth, include_external=False
  ),
  check_robots_txt=True
)

# Perform crawling
async with AsyncWebCrawler(config=browser_conf) as crawler:
  results = await crawler.arun(url=url, config=run_conf)

OS

Linux

Python version

3.11.9

Browser

Github Codespace

Browser version

No response

Error logs & Screenshots (if applicable)

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions