When I did an exhaustive test on floating points, I noticed performance issue with fault generated testbench. The test vectors size is 0x500 * 0x500 = 0x190000 ~ 1 million data points. Here is the profiling result:

The entire runtime takes 22k seconds, about 6 hours. As a comparison, complete exhaustive test using SystemVerilog + DPI only takes 10 hours to finish (4 billion test points). So the performance gap is orders of magnitude (2000x).
Notice that the RTL simulation takes about 12k seconds to run. This is due to the sheer size of testbench code generated, which is 408M.
I think there are several places where it can be improved:
hwtypes computation. The native C implementation is about 100x faster. I looked at the actual float implementation and it is very inefficient:
exp = 1 - bias
s = ['-0.' if sign[0] else '0.', mantissa.binary_string()]
s.append('e')
s.append(str(exp))
return cls(gmpy2.mpfr(''.join(s),cls.mantissa_size+1, 2))
It is converting to and back from native gmp object using string. I think native conversion is available?
2. Getting frame information, particular filename and line number. I had the same performance issue before and was able to resolve it by implementing the same logic in native Python C API. It is included in kratos package. I can give you a pointer on how to use it, if you're interested. On my benchmark it is about 100-500x faster than doing the same thing in Python.
The real questions is whether the inefficient test bench is an intrinsic problem in fault. My understanding is that all the expect values have to be known during compilation time, which makes the testbench unscalable when there are lots of test vectors. One way to solve that is to allow simulator call python code during simulation. I have a prototype working: https://github.com/Kuree/kratos-dpi/blob/master/tests/test_function.py#L18-L28
It works with both Verilator and commercial simulators.
Please let me know what you think.
When I did an exhaustive test on floating points, I noticed performance issue with fault generated testbench. The test vectors size is 0x500 * 0x500 = 0x190000 ~ 1 million data points. Here is the profiling result:

The entire runtime takes 22k seconds, about 6 hours. As a comparison, complete exhaustive test using SystemVerilog + DPI only takes 10 hours to finish (4 billion test points). So the performance gap is orders of magnitude (2000x).
Notice that the RTL simulation takes about 12k seconds to run. This is due to the sheer size of testbench code generated, which is 408M.
I think there are several places where it can be improved:
hwtypescomputation. The native C implementation is about 100x faster. I looked at the actual float implementation and it is very inefficient:It is converting to and back from native gmp object using string. I think native conversion is available?
2. Getting frame information, particular filename and line number. I had the same performance issue before and was able to resolve it by implementing the same logic in native Python C API. It is included in
kratospackage. I can give you a pointer on how to use it, if you're interested. On my benchmark it is about 100-500x faster than doing the same thing in Python.The real questions is whether the inefficient test bench is an intrinsic problem in fault. My understanding is that all the expect values have to be known during compilation time, which makes the testbench unscalable when there are lots of test vectors. One way to solve that is to allow simulator call python code during simulation. I have a prototype working: https://github.com/Kuree/kratos-dpi/blob/master/tests/test_function.py#L18-L28
It works with both Verilator and commercial simulators.
Please let me know what you think.