This repository is for practicing Scrapy, a free and open-source web-crawling framework, and thanks to the people who gave me this valuable opportunity.
- Python 3.8+
- scrapy 2.11
-
Create a Python virtual environment, which helps isolate the practice environment from the main environment and reduces the possibility of package conflicts.
python -m venv my_scrapy_env -
Activate the virtual environment.
my_scrapy_env/Scripts/activate -
Install dependencies.
pip install -r requirements.txtThis will install packages from requirements.txt:
- scrapy
- shub
- scrapy-crawlera
- google-cloud-storage
- scrapy-sessions
Please note that this project is initialized with Scrapy 2.11. Running
scrapy startproject exerciseswith Scrapy 2.4 conflicts with other packages.
Please note the log level is set to INFO.
-
Tackle World
Inside the exercises folder, run:
scrapy crawl tackleworldadelaide -O tackleworldadelaide.jsonThis generates a json file containing products data from TackleWorld.
-
Surfboard Empire
Inside the exercises folder, run:
scrapy crawl surfboardempire -O surfboardempire.jsonThis generates a json file containing products data from Surfboard Empire.
-
Regular Expressions
Inside root folder, run:
python regex.pySimply extract the numerical total number of products from an HTML elements.