-
Notifications
You must be signed in to change notification settings - Fork 0
Ethica
This file was copied from the SenseDoc page, to serve as a template for the Ethica version. I'll remove this note once it has been adapted to Ethica.
Ethica is a research-grade smartphone app, made by Ethica Data, and used for mobility (GPS) and physical activity (accelerometer) tracking. These data are collected in a 1-in-5 duty cycle (1 minute active, 4 minutes idle) and allow us to measure location-based physical activity and infer transportation mode.
The data fields collected by the Ethica app can be found on INTERACT's Data Dictionary
Data is uploaded by the individual phones multiple times per day and cached on Ethica Data's servers until the end of the study.
At the close of a study, somebody from the INTERACT team (either the data manaager or the project coordinator) needs to contact Ethica and ask them to send the data.
TODO: add details of how the data is delivered, where it is delivered, and what exactly is expected in the file
The cached data will then be exported by Ethica into zipped CSV files and downloaded to Compute Canada servers by our data manager. The zip files are then validated (command: unzip -t ZIPFILE) to ensure that the contents of the downloaded files still conform to the checksum computed when the files were initially created.
Important Note: Once we notify Ethica Data that the files have been received and verified, they purge all the associated data tables from their systems. For this reason, we take two additional steps before giving such notification:
- Wait at least 48 hours after receiving and validating the files, to ensure that the new files have been captured by Compute Canada's nightly backup system twice.
- Conduct a random secondary verification of the downloaded zip files by unzipping a few of them and viewing the contained CSV files to ensure that they are indeed CSV files and appear to have credible looking data. (In the case of Wave 2, the zip files correctly contained CSV files, but they were erroneously given .zip extensions. This suggests that their export process is ad-hoc and we should be certain we have real data before it gets purged from their systems.)
With SenseDoc data, the files are uploaded by project coordinators and often do not conform to the normalized naming conventions, so they are staged in the /def-dfuller/interact/incoming_data folder. Later they are copied to the /def-dfuller/interact/permanent_archive folder, normalized, and filtered by the data manager, as the opening step of the ingest cycle. But since Ethica data is migrated directly by the data manager, no such staging area is required. Ethica files are downloaded directly to the permanent_archive area.
After the zip files have been validated, they are each given a "provenance sidecar." This sidecar file then moves around the system along with the data file itself and can be used at any time to verify that the file has not been altered since it was first collected. Additionally, if a change is made to the data file, the sidecar can be updated with a new checksum, as well as an explanation of what changed. These sidecars are handled by our ProvLog system, and provide provenance tracking for the lifecycle of all the data files that are part of our pipeline.
Note: Due to the way they are managed and distributed, SenseDoc devices capture data from our project coordinators as well as from the participant they are assigned to, so a wear date window was tracked to allow us to filter out the telemetry not contributed by the participants themselves. But since Ethica data comes directly from the user's phone, there is no need to filter out extraneous contributors, so there are no wear dates associated with Ethica participants and no pre-filtering done at ingest time.
Before data is actually ingested, a number of verification and normalization steps are conducted first to ensure that the ingest can proceed successfully.
- The raw data files that were migrated to the incoming_data folder are copied to the permanent archive area
- The folder path should be: /projects/def-dfuller/interact/permanent_archive/{CITYNAME}/Wave{WAVENUM}/Ethica/raw
- Some cities are organized with multiple studies (e.g. Montreal has one English study and one French study, conducted simultaneously), so the files within the raw folder are organized by study number and data table name
- Files are named {STUDYID}_{TABLENAME}.zip
- The normalized permanent_archive files are then added to our ProvLog system, which scans every night to ensure that all files always match the checksums they were uploaded with and have not been deleted or altered on disk during the course of working with them
- An entire directory can be added to the monitoring system with the command: provlog -T {ROOTDIR}
- If any changes are ever detected in any logged files on disk, a message is sent to the data manager, who investigates and either restores the data files from backup, or updates the ProvLog record to explain the change, thus ensuring a complete manifest of data changes is attached to each contributing file
- The participant metadata (which usually encompasses both SenseDoc and Ethica users) is also placed into the permanent_archive folder
- The file path should be: /projects/def-dfuller/interact/permanent_archive/{CITYNAME}/Wave{WAVENUM}/linkage.csv
- It's called this because it provides the linkage between the participants and the devices they were issued
adapted to here
Once the data files are all in the correct place, with the expected names, the data manager can launch the guided process to complete the ingest. The first step is to set up the Jupyter Notebook that will govern the process.
- Getting Jupyter Notebooks set up properly is outside the scope of this document, but we have explored two different configurations to date:
- Running Jupyter Lab directly on the ComputeCanada cluster
- Faster run-times
- Harder to set up
- Prone to frequent delays that can impede efficient workflow
- Running Jupyter Lab on a local machine, with remote SSH mounts to the CC file system and PostgreSQL instance
- Easier to set up
- More responsive workflow
- Slower execution of data-heavy operations
- Running Jupyter Lab directly on the ComputeCanada cluster
- Set up your environment variables
- $SQL_LOCAL_SERVER and $SQL_LOCAL_PORT will depend on which configuration you chose above for Jupyter Lab
- $SQL_USER should be set to your ComputeCanada userid
- $INGEST_CITY should be the integer code for the city you'll be ingesting (Victoria=1, Vancouver=2, Saskatoon=3, Montreal=4)
- $INGEST_WAVE should be the integer wave number that you'll be ingesting
- Once everything is configured, launch Jupyter Lab and open a copy of Ingest-SenseDoc-Wave2-Protocol.ipynb
The guided process will be an iterative process in which you work your way down the series of code blocks in the document, executing each one until it runs cleanly, and then moving on to the next block. Note that every block of executable code is followed by an after running the block section that explains what you should see in the output of the previous section, and what to do if problems are reported.
The whole point of normalizing the filenames, paths, and data bundles was to allow the same code to be used each time, regardless of which wave or city is being ingested. This block is where those values are initialized from the environment variables, and a few other frequently used variables are set up.
As a rule, you will not have to change anything here, but there may be special cases. In particular, there may be cases where the file structures do not conform precisely to the standard laid out above, so you might have to tweak file paths here.
Important: Never, under any circumstances code passwords or userids directly into this notebook. Remember that this document is hosted publicly on GitHub. Sharing security codes in this way would be a breach of our privacy protocol.
Run the block and then read note that follows. Confirm that everything ran as expected before moving on.
- Edit the parameter assignments in the first code block of the notebook to set the wave_id and city_id being ingested
The next few blocks of the notebook will conduct some additional analyses to find gaps and conflicts in the data so they can be fixed prior to ingest.
- All expected files are confirmed to be present and named correctly
- The incoming linkage data is confirmed to be well-formed
- Each expected participant has corresponding telemetry data in the permanent_archive folder
- All telemetry data found in the permanent_archive folder corresponds to a known participant
Any problems found by these tests are reported to the data manager who then consults with the regional coordinator to fix the discrepancies. The most common problem found here is to find a user in the linkage table who produced no data in the telemetry folders. These are usually cases where a coordinator created a dummy account to use for testing. But since they did not actually wear a device, there is no telemetry to go with the account. In these cases, the user record must be marked by putting the word 'ignore' in the data_disposition field of the linkage table, which instructs the ingest system to ignore that user record entirely.
Once the validation block of the Jupyter Notebook passes cleanly, reporting no unexpected conditions in the data, the actual ingest blocks can be run.
Before telemetry can be loaded, we need to know about the users who produced it. This metadata is primarily used to link the Ethica userid with the Interact_id used to identify participants within our research, which is why it is referred to as the 'linkage' table.
The linkage data is created by project coordinators, who enter much of the data by hand. So the first concern will be to ensure that the data is well formed and complete. The Jupyter Notebook explains these steps in closer detail, but the core idea is that we want to identify three basic edge cases:
- Dummy user records: These are records created by project coordinators or other team members and used to test various parts of the data collection system. We want to ignore all data associated with these records, since they are not actual contributors to the study.
- Missing telemetry: Some users drop out of the study before contributing any actual data. In other cases, a data file might be logged to an erroneous userid, or it might be missing completely. In all of these cases, we want to flag the situation and attempt to resolve it before proceeding to the ingest stage.
- Unknown contributor: In some cases, we receive telemetry data associated with a userid that does not appear in our linkage table. This might be an administrative oversight, a typo in the linkage table, or even a glitch Ethica's data management. In any case, we want to flag these situations and attempt to resolve them before proceeding to the ingest stage.
The data manager runs the linkage validation blocks of the Jupyter Notebook and then works with the project coordinators to correct the anomalies raised. This usually involves making corrections to the linkage file and/or renaming telemetry files. To encode the disposition of problem user records, a column called 'data_disposition' is added to the linkage table. Users who are in fact team members, whose data should be ignored, are given a disposition value of 'ignore', while legitimate participants who produced incomplete or corrupted data are given a disposition of 'cull'.
Once those corrections have been made, the verification step should be run again, and any new or still unresolved issues should be addressed.
Do not proceed until the verification block runs cleanly and reports no anomalies.
Once the linkage data has been validated and cleaned, the content of the linkage file can then be ingested into the linkage table of the DB.
, and it includes information about all participants - not just those who produced usable data. So our first step is to examine the data at hand
Ingest consists of two phases: ingesting the participant metadata (the linkage file) and then ingesting the telemetry data First, the linkage data will be loaded into the DB table (portal_dev.sensedoc_assignments). This is a straightforward process of...
Then the notebook will proceed to loading the telemetry files.
Loading telemetry files is a bit more complicated...
- The last few sections perform the actual ingest
- In the first pass, the raw telemetry files are loaded into a temporary DB table
- In the next block, that temporary table is cross-linked with the proper IID, based on the mapping from the device id found in the linkage table
- Finally, the cross-linked telemetry data is added to the final telemetry tables (sd_gps, sd_accel, '''and others?''')
- Once the ingest has completed successfully, a few housekeeping tasks are required:
- Delete the temporary tables '''(called?)'''
- Export the JupyterNotebook as a PDF, which provides a complete document of the ingest process, as it happened.
- If any substantive code was changed in the notebook (aside from setting parameters), clear all the output blocks, save the notebook, and commit the changes to the git repo, describing what improvements or corrections were made to the code
- Congratulations, you have now completed an ingest cycle.
Describe the technical parameters of the telemetry and survey data collected by Ethica ($jeff,$zoe)
The Ethica app captured:
- GPS
- WiFI
- Accelerometer
- Activity Recognition
- Pedometer
- Battery
- Survey responses (EMA)