Conversation
Updating scrapers before introducing keywords funcionality
| surveys = self.client.get_survey_lists() | ||
| if 'data' not in surveys: | ||
| raise ValueError( | ||
| f"Surveys not found for search string {search_string}, data returned: {surveys}" |
There was a problem hiding this comment.
I'd change this to f"Surveys not found from client. data returned: {surveys}"
as this part of the code doesn't do anything with the search_string yet
|
|
||
| def get_survey_questions(self, survey_id: str) -> Dict[str, str]: | ||
| """Get the questions associated with the row number in a survey.""" | ||
| pass |
There was a problem hiding this comment.
Why is this passing, do you have the questions somewhere?
| """Get id of respondent """ | ||
| """Note: multiple projects per respondent""" |
There was a problem hiding this comment.
Could you make this into one note?
eg:
"""
Get id of respondent.
Note: multiple projects per respondent
"""
This'll make the Note a part of the docstring and visible for everyone who's mousing over the function
| def get_questions_dict(self, survey_details): | ||
| questions = {} |
There was a problem hiding this comment.
This could use a docstring:
"""Get questions and their answers.
Returns a Dict: {question_id : answer}
"""
| def get_respondent_data(self, responses, survey_id, respondent_id, club_data_dict, entity) -> str: | ||
| """get data entities per respondent""" | ||
| for page in responses[survey_id][respondent_id]['pages']: | ||
| if page['id'] == '146876559': |
There was a problem hiding this comment.
What's this specific number? could use a documentation at least, or turn it into a variable if it could change later down the line
| if len(question["headings"]) > 0: | ||
| project_data_ids[question['id']] = question['headings'][0].get('heading', "") |
There was a problem hiding this comment.
Is it always the first heading of question["headings"] that contains important data, and never the second?
| return club_data | ||
|
|
||
| def get_project_data_ids(self, survey_details: dict): | ||
| """Get ids of the answers to questions on projects.""" |
There was a problem hiding this comment.
I'm not too sure what this returns - The dict has the format
{
"question_id" : "information_from_heading"
}
the docstring says Get ids of the answers to questions on projects.
Is the question_id then also the id of the answer? What's the heading?
| for x in unique_values: | ||
| ids_per_value = [key for key, value in project_data_ids.items() if value == x] | ||
| ids_per_value.sort() | ||
| if len(ids_per_value) > 1: | ||
| split_project_dict[x] = ids_per_value | ||
| return split_project_dict |
There was a problem hiding this comment.
What this does is it flips the keys and values in project_data_ids and if there are duplicate values, it makes a new dict where the initial value is the new key, and the new value is a list of the keys that belong to that value.
However, if there's 1 or less values, this just returns an empty dict
| project_list.append({ | ||
| 'id' : 'respondent_id'+ str(i), | ||
| 'title' : self.get_project_data(responses, survey_id, respondent_id, split_project_data_dict, project_number = i, entity = 'Project Title'), | ||
| 'description' : self.get_project_data(responses, survey_id, respondent_id, split_project_data_dict, project_number = i, entity = 'Describe your project (at least 300 words)<br><br><em>- Context (What is the dilemma that the project is trying to tackle? Why is it important for this neighbourhood/group of people/the country?</em><br><em>- Activities (What did you do?)</em><br><em>- Results (What did you achieve? What did you create, produce, accomplish? Try to include numbers, if possible).</em><br><em>- Impact (What changed in the community? What did you learn yourself or as a team? Did you meet your own expectations)?</em>'), |
There was a problem hiding this comment.
I'm worried that this is a very long string for an entity
| except Exception as e: | ||
| self.logger.exception('Failed to get projects from current page') | ||
|
|
||
| self.write_to_file(projects, str(search_string + respondent_id)) |
There was a problem hiding this comment.
This'll write a separate file per respondent ; so sometimes 1 and sometimes 4 projects. I think we were using a one file per scrape methodology, so I'd say do something like create a projects list at the start of the scrape() method: projects = [] and then in the try: change it to
resp_projects = self.process_response(responses, survey_details, survey_id, respondent_id)
projects.extend(resp_projects)
| responses = self.client.get_all_pages_response(survey_id) | ||
| for response in responses: | ||
| if not response.get("data", []): | ||
| raise ValueError( |
There was a problem hiding this comment.
This will throw an error for the first item in the list that does not have data param. Is this the intended behaviour?
If not, you could do something like this:
for response in responses:
data = response.get('data', [])
for response_data in data:
respondent_id = response_data.get("id", "")
responses_dict[survey_id][respondent_id] = response_data
Some issues I didnt know how to fix: