int8*int8 -> float?

Hey,

I'm looking to perform `int8 * int8 -> fp32`. where at the output stage I dequantise the `int32_t` result into `float` (and then potentially add a bias. I was following the example from https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc#L305
But it seems that in order to unquantise to `float` you compute the quantisation parameters from the fp32 result that you had already computed before, which in practise I wouldn't know.  I can compute it with a compensation factor, but it becomes incredibly complicated and computationally (and memory) expensive. Any alternatives?

If I am able to assume quantisation into `int8` as opposed to `uint8` as in the example, I would be able to have quantisation without the zero_point parameter (assuming zero cantered distribution) which would massively simplify dequantisation. Do you support this? Do you have any examples in the codebase where something like this is done?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int8*int8 -> float? #203

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

int8*int8 -> float? #203

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions