Java Web Crawler

This Java application is designed to navigate the web, index pages, and extract specific content. Utilizing the Jsoup HTML parsing library and managed with Maven, the crawler operates to a depth of 2 links, retrieving target titles, links, and text, and subsequently saving them to a file.

Features

Web Crawling: Initiates from a seed URL and explores linked pages up to a depth of 2.
Content Extraction: Parses and extracts titles, hyperlinks, and textual content from web pages.
Data Storage: Saves the extracted information into a structured file for further analysis or processing.
Robots.txt Compliance: Respects web scraping policies by adhering to the robots.txt directives of each site.

How It Works

Initialization: Begins with a seed URL added to the frontier queue.
Fetching: Retrieves the HTML content of the URL.
Parsing: Uses Jsoup to parse the HTML and extract links and desired content.
Duplication Check: Verifies if the URL or its content has been previously crawled to avoid redundancy.
Compliance Verification: Checks the site's robots.txt file to ensure adherence to crawling policies.
Iteration: Adds new, uncrawled, and compliant URLs to the frontier queue, repeating the process up to the specified depth.

Prerequisites

Java Development Kit (JDK): Ensure JDK 8 or higher is installed.
Maven: For project dependency management.
Jsoup Library: Included as a Maven dependency.

Setup Instructions

Clone the Repository:

git clone https://github.com/KELVI23/Java-Web-Crawler.git

Navigate to the Project Directory:
```
cd Java-Web-Crawler
```
Build the Project with Maven:
```
mvn clean install
```
Run the Application:
- Execute the Main class to start the web crawling process.
- Monitor the console output for progress and results.

Configuration

Seed URL: Modify the seedUrl variable in the Main class to change the starting point of the crawl.
Crawling Depth: Adjust the maxDepth variable to set the desired depth of link traversal.
Output File: Specify the destination file for extracted data in the outputFilePath variable.

Notes

Ethical Crawling: Always ensure compliance with each website's robots.txt directives and terms of service.
Performance Considerations: Be mindful of the load imposed on servers; implement appropriate delays between requests if necessary.
Data Accuracy: The quality of extracted data depends on the structure of the target web pages and may require adjustments to parsing logic.

License

This project is open-source. Feel free to modify and use it according to your needs.

For issues, contributions, or further information, please refer to the GitHub repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Java Web Crawler

Features

How It Works

Prerequisites

Setup Instructions

Configuration

Notes

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Java Web Crawler

Features

How It Works

Prerequisites

Setup Instructions

Configuration

Notes

License