Skip to content

refactor data processing#419

Open
seanses wants to merge 5 commits intomainfrom
di/refactor_data_processing
Open

refactor data processing#419
seanses wants to merge 5 commits intomainfrom
di/refactor_data_processing

Conversation

@seanses
Copy link
Copy Markdown
Contributor

@seanses seanses commented Aug 29, 2024

Refactor data processing to

  1. change clean API as a non-async-iterator buffer based API (can drop async next as we drop async in underlying crates), usage:
let pft = PointerFileTranslatorV3::new(config).await;

/* ----------- Clean file 1 (can safely spawn into another thread) ----------- */
let cleaner = pft.start_clean(4096 /*buffer size*/, Some(path1)).await?;
while let Some(data) =  read_file(&mut reader1) {
    cleaner.add_bytes(data).await?;
}
let cleaned_result = cleaner.result().await;

/* ----------- Clean file 2 (can safely spawn into another thread)  ----------- */
let cleaner = pft.start_clean(4096 /*buffer size*/, Some(path2)).await?;
while let Some(data) =  read_file(&mut reader2) {
    cleaner.add_bytes(data).await?;
}
let cleaned_result = cleaner.result().await;

/* ----------- Finish ----------- */
pft.finalize_cleaning().await

For example, see

let cleaner = self.start_clean(4096, Some(path)).await?;

  1. drop XetConfig dependency. Right now there are some helper functions to map XetConfig to new configurations (see

    pub async fn translator_config_from(
    ), these are just for testing the correctness of the new data processing logic using the existing test set up.

  2. make repo salt optional for dedup

All integration tests pass.
Same clean speed as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant