TADPREP/method_beta_test.py at main · don-c-smith/TADPREP · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
import numpy as np
import pandas as pd
import tadprep as tp
pd.set_option('display.max_columns', None)
# This is the file we will use for beta testing of public-facing methods before subsequent debugging

# Load test data
df = pd.read_csv(r'C:\Users\doncs\Documents\GitHub\TADPREP\data\river_data.csv')
# Print check
# print(df)

'''
Testing the method_list method:
Prints the names and brief descriptions of all callable methods in the TADPREP library.
'''
# tp.method_list()

'''
Questions:
- Is the name of this method appropriate?
Yes
- Does it do what a reasonable person would expect it to do?
Yes
- Are we missing any major capabilities?
No
- Are there extraneous capabilities present in the method?
No
- Are all parameters/modes necessary and/or appropriate?
N/A
- What problems or needed changes were identified?
None
'''


'''
Testing the df_info method:
This method prints comprehensive information about a DataFrame's structure, contents, and potential data quality issues.
Parameters: verbose (bool, default=True)
    - Controls whether detailed feature information and data quality checks are displayed

Returns: None
'''
# Test non-verbose mode first
# tp.df_info(df, verbose=False)

# Test verbose mode
# tp.df_info(df, verbose=True)

'''
Questions:
- Is the name of this method appropriate?
Rename to 'summarize'
- Does it do what a reasonable person would expect it to do?
Yes
- Are we missing any major capabilities? Is this all the 'info' we need?
Check for features which are all ints or floats but typed as strings
Check for any instances which are all-Null in all features
- Are there extraneous capabilities present in the method?
No
- Are all parameters/modes necessary and/or appropriate?
Remove verbose mode entirely. It's too simple/low-detail. Just have this be a zero-param method.
- What problems or needed changes were identified?
    1. Remove verbose mode entirely, both from driver function and public method
    2. Implement check for ints or floats typed as strings
    3. Implement check for all-Null instances

** Refactoring to be done by Don **
'''

'''
Testing refactored 'summary' method (was df_info):
'''
# Using normal dataset
# tp.summary(df)

# Stress-testing the data quality checks in the method
# Create a test dataframe with examples for each data quality check
# test_df = pd.DataFrame({
    # Near-constant feature (>95% single value)
#     'near_constant': ['common_value'] * 19 + ['rare_value'],  # 95% same value
#     # Feature with infinite values
#     'has_inf': [1.0, 2.0, float('inf'), 4.0, 5.0] + [6.0] * 15,
#     # Feature with empty strings (distinct from NaN)
#     'empty_strings': ['value1', '', 'value2', '', 'value3'] + ['value4'] * 15,
#     # Numeric data stored as strings
#     'num_as_string': ['100', '200', '300', '400', '500'] + ['600'] * 15,
#     # Normal numeric feature (for comparison)
#     'normal_num': [10, 20, 30, 40, 50] + [60] * 15,
#     # Normal string feature (for comparison)
#     'normal_string': ['apple', 'banana', 'cherry', 'date', 'elderberry'] + ['fig'] * 15
# })

# Add duplicate rows
# dup_row = pd.DataFrame({
#     'near_constant': ['common_value'],
#     'has_inf': [6.0],
#     'empty_strings': ['value4'],
#     'num_as_string': ['600'],
#     'normal_num': [60],
#     'normal_string': ['fig']
# })
# test_df = pd.concat([test_df, dup_row, dup_row], ignore_index=True)  # Add 2 duplicate rows

# Add a completely empty row (all NaN)
# empty_row = pd.DataFrame([{col: np.nan for col in test_df.columns}])
# test_df = pd.concat([test_df, empty_row], ignore_index=True)

# Print check for test dataframe
# print(test_df)

# Test data quality checks
# tp.summary(test_df)


'''
Testing the subset method:
This method subsets the input DataFrame according to user specification.
Parameters: verbose (bool, default=True)
    - Controls whether detailed process information and methodological guidance is displayed

Returns: The modified DataFrame as subset by the user's specifications
'''
# Test non-verbose mode first
# df_subset = tp.subset(df, verbose=False)

# Test verbose mode
# df_subset = tp.subset(df, verbose=True)

# Print subsetted dataframe
# print(df_subset)

'''
Questions:
- Is the name of this method appropriate?
Yes
- Does it do what a reasonable person would expect it to do?
Yes
- Are we missing any major capabilities? Is this all the subset capacity we usefully need?
 No
- Are there extraneous capabilities present in the method?
No
- Are all parameters/modes necessary and/or appropriate?
Yes
- What problems or needed changes were identified?
    1. BUG: Loop is stepping from proportion entry back to feature selection in stratified sampling when invalid input
        is entered for the proportion to subset. Likely a while loop scope problem.
    2. When subsetting by date, error messages for entering an invalid date need to be more clear/informative
        (e.g. if you enter 200 for the year, we need something more informative than "nanosecond error")
    3. Investigate datetime format handling - what's available, what's missing, is there a better/more complete way?
        Should the current verbose implementation persist in non-verbose mode?
    4. The "Randomly dropped 25.0% of instances. 45 instances remain."-type message should print in non-verbose mode.
        Consider expressing as pre-sampling/post-sampling information. Should be consistent for all sampling methods.
    5. BUG: Bizarre error message printing when attempting to stratify by feature with missing values: INVESTIGATE
    6. Asking user for explanation in verbose mode is redundant, if verbose, print explanations

** Refactoring to be done by Gabor **
'''


'''
Testing the reshape method:
This method interactively reshapes the input DataFrame according to user specification.
Parameters: verbose (bool, default=True)
    - Controls whether detailed process information is displayed

Returns: None
'''
# Test non-verbose mode first
# tp.reshape(df, verbose=False)

# Test verbose mode
# tp.reshape(df, verbose=True)

'''
Questions:
- Is the name of this method appropriate?
Yes
- Does it do what a reasonable person would expect it to do?
Yes
- Are we missing any major capabilities? Is this all the reshaping capacity we usefully need?
Yes
- Are there extraneous capabilities present in the method?
No
- Are all parameters/modes necessary and/or appropriate?
Yes
- What problems or needed changes were identified?
    - Clarify ability to select multiple reshape methods at user prompt
    - Add ability to select individual features to drop if list not passed
    - Counts of missing values by feature should be made more legible
    - Enumerate features to drop rows by and have user pass indices, not feature names
    - BUG: Generalized degree of population, default decimal-percent throwing traceback (final_thresh feature)
    - Move explanation out of input dependence, have it run if verbose is true
'''


'''
Testing the find_corrs method:
This method finds correlations in numerical features of a DataFrame using a specified detection method.
Args:
    df (pd.DataFrame): The DataFrame to analyze for correlated features
    method (str, optional): Correlation method to use. Options:
        - 'pearson': Standard correlation coefficient (default)
        - 'spearman': Rank correlation, robust to outliers and non-linear relationships
        - 'kendall': Another rank correlation, more robust for small samples
    threshold (float, optional): Correlation coefficient threshold (absolute value).
        Defaults to 0.8. Values should be between 0 and 1.
    verbose (bool, optional): Whether to print detailed information about correlations.
        Defaults to True.

Returns: A dictionary containing correlation information with summary statistics and detailed pair information.
'''
# df_corrs_errors = pd.DataFrame({
#     'num_1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
#     'num_2': [2, 5, 8, 9, 11, 12, 14, 15, 17, 20],
#     'cat': ['dog', 'dog', 'dog', 'dog', 'dog', 'dog', 'dog', 'dog', 'dog', 'dog']
# })

# Make sure 'invalid method' catch fires
# tp.find_corrs(df_corrs_errors, method='magic_wizard')  # Error was caught

# Make sure 'custom threshold must be between 0 and 1' catch fires
# tp.find_corrs(df_corrs_errors, threshold=1.5)  # Too-high error was caught
# tp.find_corrs(df_corrs_errors, threshold=-1.5)  # Too-low error was caught

# Make sure 'at least two numerical features must be present' catch fires
# tp.find_corrs(df_corrs_errors)  # Error was caught

# # Build useful data
# df_corrs = pd.DataFrame({
#     'linear': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
#     'linear_double': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
#     'linear_neg': [-1, -2, -3, -4, -5, -6, -7, -8, -9, -10],
#     'near_linear': [1, 8, 13, 21, 36, 40, 53, 65, 77, 89],
#     'noise': [28, -70, 576, 2856, -7798, 44, -90, 49607, 1000000, -2568637],
#     'cat': ['dog', 'fish', 'bear', 'cat', 'kangaroo', 'whale', 'leopard', 'mongoose', 'badger', 'elephant'],
#     'missing_vals': [567, 265, 476, 244, 670, None, None, None, None, None]
# })
# # Test method's normal operation
# corr_dict = tp.find_corrs(df_corrs)

# Test non-verbose operation
# corr_dict = tp.find_corrs(df_corrs, verbose=False)

# Test alternate methods
# corr_dict = tp.find_corrs(df_corrs, method='spearman')
# corr_dict = tp.find_corrs(df_corrs, method='kendall')

# Test custom threshold
# corr_dict = tp.find_corrs(df_corrs, threshold=1)

# Print correlation dictionary
# print(corr_dict)

'''
Questions:
- Is the name of this method appropriate?
Yes
- Does it do what a reasonable person would expect it to do?
Yes
- Are we missing any major capabilities? Is this all the correlation detection capacity we usefully need?
I believe so
- Are there extraneous capabilities present in the method?
No
- Are all parameters/modes necessary and/or appropriate?
Yes
- What problems or needed changes were identified?
We need a better way of handling missing values more gracefully. Right now we just have a fail-out if all arrays aren't
of the exact same length. We need to print something more explicative to the user and maybe suggest imputation or
dropping features with missing values using either the reshape or impute methods.
'''


'''
Testing the make_plots method:
This method interactively creates and displays plots for features in a DataFrame.
Parameters
----------
df : pandas.DataFrame
    The DataFrame containing features to plot.
features_to_plot : list[str] | None, default=None
    Optional list of specific features to consider for plotting. If None, the
    function will use all features in the DataFrame.

Returns
-------
None
    This function displays plots but does not return any values.
'''
# Test plotting
# Features are date, season, volume, avg_flag, clarity, samples, traffic

# Test first with no passed feature list
# tp.make_plots(df)

# Test with passed feature list
# tp.make_plots(df, features_to_plot=['season', 'volume', 'clarity'])

'''
Questions:
- Is the name of this method appropriate?
I'd rather call it plot_features
- Does it do what a reasonable person would expect it to do?
Yes
- Are we missing any major capabilities? Is this all the plotting capacity we usefully need?
I think at a top level, yes. More complex plots should be hand-coded. This is an EDA tool.
- Are there extraneous capabilities present in the method?
I don't think so.
- Are all parameters/modes necessary and/or appropriate?
Yes. The list-to-plot is useful if you have a lot of features or some prior knowledge of what you want to look at.
- What problems or needed changes were identified?
    - The "feature 'feature_name' contains datetime-like values and will be treated as datetime" message should only
        print if a list of features is passed of one of the passed features is datetime.
'''

'''
Testing the rename_and_tag method:
This method interactively renames features and allows the user to tag them as ordinal or target features, if desired.
Parameters
----------
df : pandas.DataFrame
    The DataFrame whose features need to be renamed and/or tagged
verbose : bool, default = True
    Controls whether detailed process information is displayed
tag_features : default = False
    Controls whether activate the feature-tagging process is activated

Returns
-------
pandas.DataFrame
    The DataFrame with renamed/tagged features


'Bad' input which should be tried to check for 'catches' when renaming features are new feature names which:
- Contain spaces
- Contain special characters (e.g. @, %, &)
- Contain double underscores
- Start with an integer
- Contains a python keyword (e.g. class, True, for)

'Poor practice' input which should be tried to check for 'catches' when renaming features are new feature names which:
- Are all uppercase
- Are quite short (<=2 characters)
- Are quite long (>=30 characters)
'''

# Testing default settings first
# df_renamed = tp.rename_and_tag(df, verbose=True, tag_features=False)

# Testing non-verbose operation
# df_renamed = tp.rename_and_tag(df, verbose=False, tag_features=False)

# Testing verbose feature tagging
# df_renamed = tp.rename_and_tag(df, verbose=True, tag_features=True)

# Testing non-verbose feature tagging
# df_renamed = tp.rename_and_tag(df, verbose=False, tag_features=True)

# print(df_renamed)
'''
Questions:
- Is the name of this method appropriate?
Yes
- Does it do what a reasonable person would expect it to do?
Yes
- Are we missing any major capabilities? Is this all the renaming capacity we usefully need?
I think so
- Are there extraneous capabilities present in the method?
I'm still not sure the tagging feature is needed.
- Are all parameters/modes necessary and/or appropriate?
I believe so
- What problems or needed changes were identified?
    - We need to add a 'finished with tagging?' check for the ordinal and target tagging steps in case someone forgets
        to enter all of the features they want to tag
'''

# NOTE: I can use the same 'river' data to test feature_stats, encode, and scale
# However, I will need to import the sparse dataframe to test impute
'''
Testing the feature_stats method:
This method displays feature-level statistics for each feature in the DataFrame.

For each feature, displays missingness information and appropriate descriptive statistics
based on the feature's datatype (boolean, datetime, categorical, or numerical).
Features are automatically classified by type for appropriate statistical analysis.

Parameters
----------
df : pandas.DataFrame
    The DataFrame to analyze
verbose : bool, default=True
    Whether to print detailed statistical information and more extensive visual formatting

Returns
-------
None
    This is a void method that prints information to the console.
'''
# Testing verbose mode first
# tp.feature_stats(df, verbose=True)

# Testing non-verbose mode
# tp.feature_stats(df, verbose=False)

'''
Questions:
- Is the name of this method appropriate?
Yes
- Does it do what a reasonable person would expect it to do?
Yes
- Are we missing any major capabilities? Is this all the descriptive stats capacity we usefully need?
I think we're good.
- Are there extraneous capabilities present in the method?
I don't think so.
- Are all parameters/modes necessary and/or appropriate?
I should consider whether the information provided in non-verbose mode is TOO sparse/limited.
- What problems or needed changes were identified?
    - Add "of 1" to the end of high entropy text parenthetical
    - Move note about what IQR is to *after* the IQR value is printed
    - Line break needed in between kurtosis and coefficient of variation
    - Line break needed in between each feature, i.e. before the hashes appearing before "key values for..." text
'''

'''
Testing the encode method:
Interactively encodes categorical features in the DataFrame using specified encoding methods.

    Parameters
    ----------
    df : pandas.DataFrame
        The DataFrame containing features to encode.
    features_to_encode : list[str] | None, default=None
        Optional list of features to encode - if None, method will help identify categorical features.
    verbose : bool, default=True
        Controls whether detailed guidance and explanations are displayed.
    skip_warnings : bool, default=False
        Controls whether all best-practice-related warnings about encoding are skipped.
    preserve_features : bool, default=False
        Whether to keep original features in the DataFrame alongside encoded ones.
        When True, original categorical columns are retained after encoding.

    Returns
    -------
    pandas.DataFrame
        The DataFrame with encoded categorical features
'''
# NOTE: There are many different permutations of this method in terms of arguments that need to be tested

# Test default settings first
# df_encoded = tp.encode(df, features_to_encode=None, verbose=True, skip_warnings=False, preserve_features=False)

# Test non-verbose operation
# df_encoded = tp.encode(df, features_to_encode=None, verbose=False, skip_warnings=False, preserve_features=False)

# Test skipped warnings mode
# df_encoded = tp.encode(df, features_to_encode=None, verbose=True, skip_warnings=True, preserve_features=False)

# Test passed feature list mode
# df_encoded = tp.encode(df, features_to_encode=['clarity', 'season'], verbose=True, skip_warnings=False,
# preserve_features=False)

# Test feature preservation mode
# df_encoded = tp.encode(df, features_to_encode=['clarity', 'season'], verbose=True, skip_warnings=False,
#                        preserve_features=True)
#
# print(df_encoded)  # Print check

'''
Questions:
- Is the name of this method appropriate?
Yes
- Does it do what a reasonable person would expect it to do?
Yes
- Are we missing any major capabilities? Is this all the encoding capacity we usefully need?
I think we're good.
- Are there extraneous capabilities present in the method?

- Are all parameters/modes necessary and/or appropriate?

- What problems or needed changes were identified?
    - We should ask about a custom prefix only if the user decides to proceed with encoding a given feature
    - Make sure of default behavior re: NaN values in non-verbose operation - should be treated as separate category
    - Make sure 'drop rows with missing values' functionality is working
'''

# Load sparse data
df_sparse = pd.read_csv(r'C:\Users\doncs\Documents\GitHub\TADPREP\data\sample_data_sparse.csv')
# Print check
# print(df_sparse)

'''
Testing the impute method:
Interactively imputes missing values in the DataFrame using user-specified simple imputation methods.

    Parameters
    ----------
    df : pandas.DataFrame
        The DataFrame containing missing values to impute
    verbose : bool, default = True
        Controls whether detailed process information is displayed
    skip_warnings : bool, default = False
        Controls whether missingness threshold warnings are displayed

    Returns
    -------
    pandas.DataFrame
        A new DataFrame with imputed values
'''
# Test default behavior first
df_imputed = tp.impute(df_sparse, verbose=True, skip_warnings=False)

# Test non-verbose operation
# df_imputed = tp.impute(df_sparse, verbose=False, skip_warnings=False)

# Test skipped warnings operation
# df_imputed = tp.impute(df_sparse, verbose=True, skip_warnings=True)

# Test minimal-weight operation
# df_imputed = tp.impute(df_sparse, verbose=False, skip_warnings=True)
# Print check

print(df_imputed)