You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here's where to improve these results (use findings from attempt #6)
attempt to learn scikit-learn's API to see if there's value there (for readability perhaps & CV runtime I bet)
xgb cv model for HPO
attempt to learn PyCaret's API (for readability & dev speed)
ask why there's a clear trend of overvalued cheap houses and undervalued expensive houses...
but all my attempts to correct for it fail
attempt #7
think about pulling out specific data from MSSubClass & HouseStyle & dropping the columns:
isPUD
NumberStories
UnfinishedSQFT
isSplitFoyer
isDuplex
Compare LandContour 1-hot vs ordinal
attempt dropping more sparsely populated columns
make separate model for houses without LotFrontage, without garage, without basement
handle that 1 house that has Sewage for Utilities
handle that 1 house that doesn't have electrical
when applying the median/mode for a given column, consider using median from just the training set vs median from train+test sets
handle sale type & condition
drop rows BEFORE splitting the data
write a more efficient 1-hot function that is totally unreadable but uses binary to encode in the most column-efficient manner (but can still be readable when it doesn't matter)
ML questions
why do my XGB predictions (from Python) differ so wildly from Erik's (from R) (up to 15%-20% depending on training parameters)
why does my custom CV differ so wildly from Python's built-in CV
clean it up and publish it asking for help.
why does Lasso alpha depend so heavily on number of CV splits &/or random_state?
the difference in alpha swings price disagreements up to 20%
when close to the lower bound it predicts NaN for price
How might I empirically answer these questions since cross-validated error differs by ~20%
ie. how does removing GarageYrBlt perform relative to removing Age vs keeping both
1 option would be to try every possible cross-validation slicing but that would factorial runtime and surely infeasible
When should high correlations between one-hot-encoded and ordinal or numeric data result in dropping columns
When encoding categorized data numerically (ie. Great: 2, average: 1, poor: 0)
Would it make a difference if I weighted the values (ie. Great: 2.5, average: 0.5, poor: 0.1)
When given a data set that can be segmented in 2 (ie. houses with Garages vs houses without)
does it make sense to train 2 separate models?
Why does changing the SalePrice to float increase the error?
How do I choose the right number of estimators when different validation set splits yield wildly different number of rounds til stopping (+/- ~50%)
Is there a good way to get a measure of confidence with each prediction from a model (ie. SalePrice for house 123 could be off by 5%, SalePrice for house 456 could be off by like %50)