Skip to content

kinoz01/wget

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project aims to recreate some functionalities of wget using the Go programming language.

The program is divided to two parts: one that download the resource with a visual progress bar and another one where we try to mirror an entire website.

For direct downloads you can use these flags:

  1. -B ---> this flag downloads a file silently in the background and the output is redirected to a log file ("wget-log").
  • Usage Example:
go run . -B [URLS]...
Output will be written to "wget-log"
  1. -O ---> Download a file and save it under a different name, if a name and URL aren't provided we will print error.
  • Usage example: go run . -O=test_20MB.zip https://assets.01-edu.org/wgetDataSamples/20MB.zip ---> The flag should be in the exact format!
  1. -P ---> This flag should handle the path to where your file is going to be saved.
  • Usage example: go run . -P=~/Downloads/ [URL]... --> Path can be absolute or relative.
  1. --rate-limit ---> The program limit the download speed. The default value is bytes but you can use different value types, example: k and M.

  2. -i ---> The program will receive the -i flag followed by a file name that will contain all links that are to be downloaded. The Downloads work asynchronously, it downloads multiple files at the same time.

Mirror

In case of mirror, example: go run . --mirror facebook.com

Flags that will works with mirror to fine-tune the mirroring are:

  1. --reject or -R ---> this flag will have a list of file suffixes that the program will avoid downloading during the retrieval.
  • Usage Example: go run . --mirror -R=jpg,gif https://example.com
  1. --exclude or -X ---> this flag will have a list of paths that the program will avoid to follow and retrieve. So if the URL is https://example.com and the directories are /js, /css and /assets you can avoid any path by using -X=/js,/assets. The fs will now just have /css.
  • Usage Example: go run . --mirror -X=/assets,/css https://example.com
  • You can start your paths with / or not.
  1. --convert-links ---> this flag will convert the links in the downloaded files so that they can be viewed offline, changing them to point to the locally downloaded resources instead of the original URLs.
  • Usage Example: go run . --mirror --convert-links https://example.com
  1. -B: this flag works also with mirror.

Note: to inlude JS while mirroring a website you can set the environement variable JS into a non empty string. export JS="something"

Algorithm

The main URL is passed to the ProcessURL function, which is the core of the recursive logic that downloads and processes each resource. ProcessURL begins by ensuring the URL hasn't already been processed and checks if it should be excluded or rejected based on predefined rules. It fetches the resource, determines its type (e.g., HTML, CSS, or other), and processes it accordingly.
For HTML content, the function parses it into a DOM tree using html.Parse and calls ProcessNode to recursively traverse and modify the HTML structure. ProcessNode handles each HTML element and its attributes, such as src or href, resolving and downloading linked resources via ProcessURL. If the element contains a linked resource (e.g., an image or script), ProcessURL is called again, potentially creating nested calls for embedded resources. This recursion ensures that all linked resources, no matter how deeply nested, are downloaded and saved locally. Similarly, CSS files are scanned for url() references using regex, and each referenced resource is processed by calling ProcessURL.
The algorithm ensures that all resources are downloaded, saved in a directory structure that mirrors the original website, and updated to use relative paths.
Together, ProcessURL and the recursive traversal of ProcessNode work in tandem to comprehensively mirror the website by handling both the hierarchy of pages and the dependency graph of resources.

Explanations

You can find details here:

About

Go-based wget implementation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors