DataExploder/DataExploder.py at main · AleSacco/DataExploder · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
"""
DATA EXPLODER FOR PRINCIPAL COMPONENT ANALYSIS
v1.0 (09/2021)
Updates, discussions, etc. can be found here in the Data Exploder project: github.com/AleSacco
For further inquiries write to: Alessio Sacco (a.sacco@inrim.it)


This script consists in:
- This main "DataExploder.py" file, to be run;
- (OPTIONAL) a configuration file: "config.py".

The script looks for a "config.py" file, in the directory in which the script is run. If the file is not found,
the script uses default parameters, which are then written in a config file which is created by default if not found.


This script takes as input 2 .csv files describing data as tables:
- a file containing the best estimate for each data point (default file name: data.csv);
- a file containing an uncertainty value for each data point (dafault file name: uncertainties.csv).

Any data point's complete information is contained at a specific table coordinate which is the same in both
files: the estimate file contains the best estimate for that data point, while the uncertainty file contains a value
pertaining the uncertainty. The script also accepts data measured as below the limit of detection (LOD), in which case
the best estimates table entry is to contain the LOD of the measurement (NOT the value 0), while the corresponding
uncertainties entry can contain a blank value, or any non-numerical string to indicate that the first value is a LOD;
the number 0 can also be used for this purpose, but this is not recommended.

Both tables must have the exact same structure in terms of row/column positions, number of label columns, etc.
THE FIRST ROW must be the same for both tables, containing the unique names for each of the columns
(variable names or types of label) and will not be treated as data.
In the configuration file, "Number of label columns" is an integer indicating the number of label columns,
i.e. the number of leftmost COLUMNS THAT WILL BE IGNORED in the Monte Carlo data generation:
these entries in each row will be replicated verbatim for the corresponding generated samples. These can include the
sample names and/or categorical variables, intended for later analysis.


In this version of the Data Exploder, each single datum consists in two inputs: best estimate, either a number or any
non-numeric string for absent data (such as "N/A" or "NA", or no data), and uncertainty.
If non-numeric strings are found in the best estimates table, the corresponding variables will be IGNORED FOR ALL DATA.
If a numeric, non-zero uncertainty input is present in the correspondent file, the script interprets it as half of the
confidence interval on the measurement with a Gaussian probability density function (pdf), i.e. expanded uncertainty;
if an uncertainty input is a string, NaN (not a number), or zero, this is interpreted as an indication that the datum
is to be read as BELOW THE LIMIT OF DETECTION: a uniform pdf is used for the data point instead, ranging from zero to
the value indicated in the best estimate table.

Using the appropriate pdf, the script then "explodes" each datum (generates Monte Carlo samples) accordingly, using for
the Gaussian pdfs a coverage factor, usually named "k", which changes according to the choice of confidence level.
As default, in this script k=1.96, corresponding to a confidence level of 95% for a Gaussian pdf, but his can be
changed in the config file.


Configuration file variables:
	"k" (decimal/float, default 1.96): coverage factor, used for computing Gaussian width parameters from
		 							   uncertainties;
	"Measurements file name" (string, default "data.csv"): name of the file containing the best estimates data;
	"Uncertainties file name" (string, default "uncertainties.csv"): name of the file containing the uncertainties data;
	"Destination file name" (string, default "Exploded data.csv"): name of the file generated by the script, containing
							 									   the Monte Carlo samples data;
	"Number of samples" (integer or decimal/float, default 1E3): number of Monte Carlo samples that will be generated
																 for each data point (rounded to integer if necessary);
	"Number of label columns" (integer, default 3): number of leftmost columns to be ignored, containing labels and
													categorical data, unique or not.

"""

import importlib
import os
import sys
from decimal import Decimal
import numpy as np
import pandas as pd
from tqdm import tqdm


# Default configuration
config_default = {
	'k': 1.96,
	'Measurements file name': 'data.csv',
	'Uncertainties file name': 'uncertainties.csv',
	'Destination file name': 'Exploded data.csv',
	'Number of samples': 1E3,
	'Number of label columns': 3
}


def LoadConfigFile(module):  # Loads config.py if it exists, or it creates one with values of config_default if not
	try:
		confmod = importlib.import_module(module)
	except ModuleNotFoundError:
		config = config_default
		CreateDefaultConfigFile()
		print(TextColors.WARNING + 'Configuration file "config.py" cannot be found. '
			'Default values were loaded and configuration file was created with these values.' + TextColors.ENDC)
	else:
		config = confmod.config
	return config

def CreateDefaultConfigFile():
	with open('config.py', 'w') as f:
		s = 'config = {\n'
		for key, value in config_default.items():
			key2 = '"' + key + '"'
			value2 = '"' + value + '"' if type(value) == str else str(value)
			s += '    ' + key2 + ': ' + value2 + ',\n'
		s = s[:-2]
		s += '\n}'
		f.write(s)
	return

def LoadConfig(config):  # Closure to make LoadConfigVar() use config data
	def LoadConfigVar(varName):  # Function to load a variable from config file without risks
		try:
			var = config[varName]
		except KeyError:
			print(TextColors.FAIL + 'Configuration file "config.py" appears to be damaged or invalid. '
									'The present "config.py" will be renamed "config_old.py".' + TextColors.ENDC)
			if os.path.isfile('config_old.py'):
				print(TextColors.WARNING + '"config_old.py" already esists. '
						'Renaming it to "config_old_old.py", overwriting if necessary.'
					  + TextColors.ENDC)
				if os.path.isfile('config_old_old.py'):
					os.remove('config_old_old.py')
				os.rename('config_old.py', 'config_old_old.py')
			os.rename('config.py', 'config_old.py')
			print('Restarting script to load default values and to create default "config.py" configuration file...\n')
			os.execv(sys.executable, [sys.executable, '"' + sys.argv[0] + '"'] + sys.argv[1:])
		return var
	return LoadConfigVar

def LoadData(measurementsCsv, uncertaintiesCsv):
	# Read data
	try:
		data = pd.read_csv(measurementsCsv, dtype='str')
		uncertainties = pd.read_csv(uncertaintiesCsv, dtype='str')
	except FileNotFoundError as ex:
		print(TextColors.FAIL + 'Error! ' + str(ex) + TextColors.ENDC)
		sys.exit(-1)
	# Check if data and uncert matrices have the same shape
	if data.shape != uncertainties.shape:
		print(TextColors.FAIL
			  + 'Error! '
			  + measurementsCsv + ' (shaped ' + str(data.shape) + '), and ' + uncertaintiesCsv + ' (shaped ' +
			  str(uncertainties.shape) + ') have different shapes.' + TextColors.ENDC)
		sys.exit(-1)
	return data, uncertainties

def DataClean(data, uncertainties, numberOfLabelColumns):
	dataCropped = data.iloc[:, numberOfLabelColumns:]
	dataCroppedMask = dataCropped.apply(pd.to_numeric, errors='coerce')
	colsToDrop = dataCroppedMask.columns[dataCroppedMask.isna().any()].tolist()
	dataClean = data.drop(columns=colsToDrop)
	uncertaintiesClean = uncertainties.drop(columns=colsToDrop)
	return dataClean, uncertaintiesClean

def CutToSignificantDigits(x: str, roundTo=2) -> (int, Decimal):  # Works unexpectedly if string is not int or decimal
	"""
	Takes a string containing an integer or a decimal and approximates it to <roundTo> significant digits.
	This is useful to avoid a needlessly large exploded file, which would contain lots of insignificant digits that
	can double or triple the file size without carrying any useful information.
	<roundTo> parameter should not be too small, lest introducing "binning" issues.
	In the script, this function is used to establish from the uncertainty value (if present) the correct number of
	rounding digits for the exploded data. For example, if a datum and its respective uncertainty are given as
	1.2345 and 0.1234 respectively, with roundTo=2 the script generates random data and rounds them as
	[1.23, 1.31, 1.15...] because they are rounded to the decimal corresponding to 2 significant digits of the
	uncertainty (e.g. 0.12).
	Args:
		x (str): The input string containing a number. Works unexpectedly if this string is not int or decimal.
		roundTo (int): The number of significant digits to which to round the datum x.

	Returns:
		decimals (int): the number of decimals to which the datum was rounded. Negative if rounded to tens, hundreds...
		roundNum (Decimal): the rounded number.
	"""
	magicString = '{0:.'+str(roundTo)+'g}'
	roundNum = '{0}'.format(float(magicString.format(Decimal(x))))
	roundNum = roundNum.rstrip('0').rstrip('.')
	split = roundNum.split('.')
	if len(split) == 1:
		decimals = -(len(split[0])-len(split[0].strip('0')))
	else:
		decimals = len(split[1])
	return decimals, roundNum

def ExplodeData(data, uncertainties, sampleNum=1E4, labelColumnsNum=1):
	"""
	Takes "data" and "uncertainties" dataframes and draws <sampleNum> from a distribution whose function and parameters
	depend on the datum and uncertainty:
	- if the uncertainty exists as an integer or decimal, the employed distribution is a Gaussian with mean=<datum> and
	st.dev. = abs(<uncertainty>)/k, where k is the coverage factor indicated in "config.py";
	- if the uncertainty does not exist (i.e. is not numeric) or is zero for that datum, it defaults to a uniform
	distribution ranging from 0 to <datum>*2. The omission of the uncertainty value is to indicate that the datapoint
	is below the limit of detection (LOD): in this case, <datum> indicates LOD/2 for that point, i.e. the best estimate.

	The function cycles through rows and columns of data matrix and creates Monte Carlo samples for each as described,
	then creates the entire DataFrame.
	Args:
		data (dataframe): DataFrame containing data (best estimates for each measurement).
		uncertainties (dataframe): DataFrame containing uncertainties if available,
		or either 0 or a string different from "NAmark"	to indicate that the corresponding datapoint was measured
		below the limit of detection, triggering a uniform distribution for sample generation for the datum in question.
		sampleNum (int, float): Number of samples to generate for each datapoint. If float, it gets converted to int.
		labelColumnsNum (int): Number of non-data columns starting from the left (e.g. name columns, labels...).

	Returns:
		explodedData: DataFrame containing <sampleNum> draws for each measurement and their corresponding label columns.
	"""
	sampleNum = int(sampleNum)
	frames = []
	columnsNames = data.columns
	for index, row in tqdm(data.iterrows(), total=data.shape[0], unit=' samples'):
		df = pd.DataFrame(index=range(sampleNum), columns=columnsNames)
		for col in columnsNames[:labelColumnsNum]:  # Write label columns
			df[col] = row[col]
		for col in columnsNames[labelColumnsNum:]:  # Generate random numbers
			if not isinstance(pd.to_numeric(uncertainties[col][index], errors='ignore'), (int, float)) \
					or pd.isnull(uncertainties[col][index]) or uncertainties[col][index] == 0:
				# Uniform distribution from 0 to 2*value if uncertainty is string or NaN or zero
				rounding = CutToSignificantDigits(data[col][index])[0]
				rands = np.around(np.random.uniform(0, 2 * pd.to_numeric(data[col][index]), sampleNum),
								  decimals=rounding)
			else:
				# Gaussian distribution if uncertainty is a number different than 0
				rounding = CutToSignificantDigits(uncertainties[col][index])[0]
				rands = np.around(
					np.random.normal(pd.to_numeric(data[col][index]), pd.to_numeric(uncertainties[col][index]) / 1.96,
									 sampleNum), decimals=rounding)
			df[col] = rands
		frames.append(df)
	print('Data generation complete. Exporting data...')
	explodedData = pd.concat(frames, ignore_index=True)
	return explodedData

def SaveExplodedData(data, filename):
	data.to_csv(filename, index=False)
	print(TextColors.GREEN +
		  'Data successfully exported to file: "' + filename + '".' +
		  TextColors.ENDC)
	return

class TextColors:
	ENDC = '\033[0m'
	BOLD = '\033[1m'
	UNDERLINE = '\033[4m'
	FAIL = '\033[91m'
	GREEN = '\033[92m'
	WARNING = '\033[93m'
	BLUE = '\033[94m'
	HEADER = '\033[95m'
	CYAN = '\033[96m'


### Script

def DataExploder():

	# Load configuration variables and data
	config = LoadConfigFile('config')
	LoadConfigVariable = LoadConfig(config)
	measurementsFileName = LoadConfigVariable('Measurements file name')
	uncertaintiesFileName = LoadConfigVariable('Uncertainties file name')
	destinationFileName = LoadConfigVariable('Destination file name')
	numberOfSamples = LoadConfigVariable('Number of samples')
	numberOfLabelColumns = LoadConfigVariable('Number of label columns')
	data, uncertainties = LoadData(measurementsFileName, uncertaintiesFileName)

	# Clean data, generate samples and save them to file
	data, uncertainties = DataClean(data, uncertainties, numberOfLabelColumns)
	explodedData = ExplodeData(data, uncertainties, sampleNum=numberOfSamples, labelColumnsNum=numberOfLabelColumns)
	SaveExplodedData(explodedData, destinationFileName)

	return explodedData


if __name__ == '__main__':
	explodedData = DataExploder()