Skip to content

Training Dataset Check#13

Closed
MichaelTj02 wants to merge 16 commits intomainfrom
dataset_check
Closed

Training Dataset Check#13
MichaelTj02 wants to merge 16 commits intomainfrom
dataset_check

Conversation

@MichaelTj02
Copy link
Copy Markdown
Collaborator

  • Perform check on dataset prior to training (resolution, consistency, format, colour channels)
  • Add lazy loading to thumbnails section in preprocessing model improving performance on larger file imports

Copy link
Copy Markdown
Collaborator

@ucodia ucodia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found multiple bugs in this PR.

I would suggest you split things related to dataset checks and thumbnail lazy loading separately for easier review and fixing.

_, new_data_path = imgui_utils.input_text("##preprocessing_data_path", self.preprocessing_data_path, 1024, 0,
_, new_data_path = imgui_utils.input_text("##preprocessing_data_path", str(self.preprocessing_data_path), 1024, 0,
width=imgui.get_window_width() - self.menu.app.button_w - imgui.calc_text_size("Browse")[0])
if new_data_path != self.preprocessing_data_path:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: We are using != to compare a str with a Path, therefore this check will always be True. I advice you wrap Path objects in str in comparisons from input_text like you did on line 279.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to review all types to make sure we are not duplicating this issue elsewhere and are consistent in the types we use

try:
if Path(self.icon_path).exists():
help_img = cv2.imread(Path(self.icon_path).as_posix(), cv2.IMREAD_UNCHANGED)
help_img = cv2.imread(Path(self.icon_path), cv2.IMREAD_UNCHANGED)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug: Does imread really take a Path object? str seemed safer.

self.save_path = directory_path
else:
print("No save path selected")
self.save_path = self.save_path
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was that a mistake?

target_dataset_path = Path(self.data_path)
if target_dataset_path.is_dir():
image_files = [f for f in target_dataset_path.iterdir()
if f.is_file()]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: We used to have a .png filter meaning we would not try to open non image files before, if a .txt exist in the dataset folder, I think PIL is gonna crash. Any reason for removing that filter?

@@ -1,9 +1,11 @@
from pathlib import Path
import zipfile
import io
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Added but never used

for file_path in file_paths:
self.get_thumbnail(file_path)

self.file_index_map = {fp: idx for idx, fp in enumerate(file_paths)}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create file_index_map in this function, then clear it in clear_thumbnails but never seem to use it anywhere. Either we use it or we remove it.

Comment thread training/dataset.py
raise ValueError(
"Invalid dataset:\n"
f"- Image '{filename}' is not square ({width}x{height}).\n"
"- StyleGAN3 training only accepts square images."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just mention StyleGAN, version 3 is irrelevant to the error

Comment thread training/dataset.py
if height != expected_height or width != expected_width:
raise ValueError(
"Invalid dataset:\n"
f"- Image '{filename}' has resolution {width}x{height}, expected {expected_width}x{expected_height}.\n"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest the following error format instead for clarity:

Inconsistent dataset: 'foo.png' is 512x512 but the detected dataset resolution is 256x256. Ensure all images have been preprocessed to the same size.

@ucodia
Copy link
Copy Markdown
Collaborator

ucodia commented Apr 16, 2026

Closing in favor of #15 #16 and #17

@ucodia ucodia closed this Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants