Too messy commit structure to merge, check other PR#14
Too messy commit structure to merge, check other PR#14SimonBerend wants to merge 6 commits intoCorrelAidxNL:masterfrom SimonBerend:master
Conversation
Updating scrapers before introducing keywords funcionality
andrewsutjahjo
left a comment
There was a problem hiding this comment.
The functionality is there and good, it's just not implemented in the right place. It needs a small rework so that future developers can understand the code quickly.
For bonuspoints: use a !fixup commit to remove any errant code like just_scrape.py so that it doesn't clutter the repo
| url_term_list = [] | ||
|
|
||
| for term in keywords: | ||
| search_term = 'term=' + term.replace(' ', '+') | ||
| random_seed = 'seed=' + str(random.randint(1, 65536)) | ||
| get_params = self.default_url_params + [f"category_id={category}", f"page={page}", random_seed, search_term] # added a keyword parameter to the url | ||
| url = self.base_url + '&'.join(get_params) | ||
| url_term_list.append(url) | ||
|
|
||
| return url_term_list |
There was a problem hiding this comment.
This is a good idea, however it's implemented in an illogical place: the function you've placed it in is called get_url which is singular. Anyone that reads this that hasn't been part of this PR will (rightfully) assume that this function will return exactly one URL, and the docstring will validate that belief. When it gets used they will get a list of URLs.
I'm not too sure if we want to do a search of each combination of category_id and search_term, but I haven't had enough of a dive into kickstarter data to know what to decide on that front
I'd place this in another function like get_urls_by_keyword and extend the scrape script to do both a loop of the original crawl by category_id functionality and a loop of searching by keyword with(or without) category_id
|
|
||
| self.write_to_file(projects, str(category)) | ||
|
|
||
There was a problem hiding this comment.
formatting: keep whitespaces out of empty lines
|
|
||
| url_term_list = self.get_url(category, page) | ||
|
|
||
| print(url_term_list) |
There was a problem hiding this comment.
use the logger if you want to log what it's doing, otherwise remove the print statement
|
PS: please remember to request a review from someone (top of the right sidebar on github) so that they know they should read your code and comment on it, otherwise no one knows it's ready to be reviewed |
No description provided.