Conversation
| f.extend(output_types); | ||
|
|
||
| // output_structure - sorted deduped discriminants | ||
| let structure_types = sorted_deduped( |
There was a problem hiding this comment.
It seems wrong that this is a vec. Can outputs be structured in more than one way ?
There was a problem hiding this comment.
I guess yes... like this isn't comparable across txs because the vec has no fixed schema. the position of each field shifts depending on the txs data, e.g. a tx with two output types will push two values before output_structure, while a tx with one output type pushes only one, so ends up at a different index in each vec
seems like one-hot encoding could be useful here, its a machine learning technique. instead of putting the discriminant into the vec, asking binary questions, so each possible characteristic gets its own fixed position with a binary 0/1 value, making them comparable
makes sense?
There was a problem hiding this comment.
yes to oneshot encoding.
Encoding aside, its just odd to me that we can have outputs be organized in more than 1 way. e.g first sort inputs by bip69 then sort them again by age (?). We would only register two enum variants if oldest coins happen to be in lexicographical order. Talking through it now it seems wrong. For some context, this part of the fingerprinting code was ported from the python library. So I might be missing something
There was a problem hiding this comment.
checking the python code, get_output_structure intentionally returns multiple variants because DOUBLE/MULTI describe output count and BIP69/CHANGE_LAST describe ordering, different dimensions can coexist
def get_output_structure(tx):
if len(vout) == 2:
output_structure.append(OutputStructureType.DOUBLE)
else:
output_structure.append(OutputStructureType.MULTI)
if change_index == len(tx["vout"]) - 1:
output_structure.append(OutputStructureType.CHANGE_LAST)
if sorted(amounts) == amounts:
output_structure.append(OutputStructureType.BIP69)
the problem is the port to a flat fingerprint vec where position = meaning
also noticed CHANGE_LAST exists in python but not in the rust port
There was a problem hiding this comment.
also noticed CHANGE_LAST exists in python but not in the rust port
That was deliberate. The plan was to build support for change identification outside of the fingerprinting crate.
There was a problem hiding this comment.
if len(vout) == 2:
output_structure.append(OutputStructureType.DOUBLE)
else:
output_structure.append(OutputStructureType.MULTI)
Yea. I think we should apply seperation of concerns here. 1. method is_bip69 another 2. output_structure -> [no_change, with_change, batch_payments, consolidation, unknown (coinjoin output decomposition would be in this catagory].
One method is concerned with how the outputs are sorted. The other is with the precieved semantics of the outputs (is it a batch payment vs consolidatinon vs something else)
| f.extend(output_types); | ||
|
|
||
| // output_structure - sorted deduped discriminants | ||
| let structure_types = sorted_deduped( |
There was a problem hiding this comment.
I guess yes... like this isn't comparable across txs because the vec has no fixed schema. the position of each field shifts depending on the txs data, e.g. a tx with two output types will push two values before output_structure, while a tx with one output type pushes only one, so ends up at a different index in each vec
seems like one-hot encoding could be useful here, its a machine learning technique. instead of putting the discriminant into the vec, asking binary questions, so each possible characteristic gets its own fixed position with a binary 0/1 value, making them comparable
makes sense?
For storing vecs of generic data. Such as noramlized vector of wallet fingeprints.
AST note for collecting a normalized vec of fingerprinting values. Currently the loose version is not supported. Have to mess around with getting actual txids supported.
Bip69 is concerned with only how outputs are sorted while output structure attempts to infer meaning. Currently its not inferring much
|
I noticed this PR has some conflicts and is still marked as draft would you prefer it to be reviewed now or should we wait until it’s ready? |
|
please wait. I will get this shapped up soon |
The first commit creates a new value type (something that can be created by an AST node). its a generic vec type. I imagine this may be useful in other analysis as well. The second commit normalizes fingerprints of a tx set. The idea here to get an idea of how sparse or dense the fingerprints are on chain. In the future we will also want to assign this normalized fingerprinting vector to clusters as well.
cc @Mshehu5 @bc1cindy
cherry picked from #16