Hello everyone, does this ring a bell for anybody?
I am running many instances (100K+) of a relatively complicated and computationally intensive program. A few instances fail, and it looks like concurrency bugs, which may be in my own code and not in CM3. It is also a complicated system in the sense that it links with third-party C++ software through pipes ... the C++ is linked with C to make a simple front-end, the M3 talks to the C++/C code via two-way pipes (send commands, retrieve results).
I attached to one of the instances stuck at 100% CPU and see the following in what seems to be the active thread:
#0 ThreadPThread__SignalThread (M3_DMxDjQ_act=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/thread/PTHREAD/ThreadPThread.m3:1078
#1 0x00000000004dbb57 in ThreadPThread__StartWorld () at ../src/thread/PTHREAD/ThreadPThread.m3:1234
#2 0x00000000004daa43 in RTThread__ResumeOthers () at ../src/thread/PTHREAD/ThreadPThread.m3:893
#3 0x00000000004c0f96 in RTCollector__CollectSomeInStateZero () at ../src/runtime/common/RTCollector.m3:911
#4 0x00000000004c060f in RTCollector__CollectSome () at ../src/runtime/common/RTCollector.m3:731
#5 0x00000000004c028a in RTHeapRep__CollectEnough () at ../src/runtime/common/RTCollector.m3:665
#6 0x00000000004bce2e in RTAllocator__AllocTraced (M3_Cwb5VA_dataSize=<error reading variable: inner value printing not implemented for language "Auto">, M3_Cwb5VA_dataAlignment=<error reading variable: inner value printing not implemented for language "Auto">, M3_A1A6Wn_thread=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTAllocator.m3:367
#7 0x00000000004bbf08 in RTAllocator__GetTracedObj (M3_Eic7CK_def=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTAllocator.m3:224
#8 0x00000000004bb8ab in RTHooks__AllocateTracedObj (M3_AJWxb1_defn=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTAllocator.m3:122
#9 0x0000000000497f8e in TextRd__New (M3_Bd56fi_t=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/rw/TextRd.m3:41
#10 0x00000000004a52e9 in Scan__ScanWord (M3_Bd56fi_txt=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/fmtlex/Scan.m3:41
#11 0x00000000004a5438 in Scan__Int (M3_Bd56fi_txt=<error reading variable: inner value printing not implemented for language "Auto">, M3_AwZxt9_defaultBase=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/fmtlex/Scan.m3:53
#12 0x000000000047b8d1 in TextReader__GetInt (M3_A0o49i_t=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/TextReader.m3:283
#13 0x000000000041b8f6 in Fsdb__Parse__GetTimesteps.1255 (M3_Cwb5VA_fsdbId=<error reading variable: inner value printing not implemented for language "Auto">, M3_DRfo9P_steps=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/Fsdb.m3:585
#14 0x000000000041a987 in Fsdb__Parse (M3_Bd56fi_wd=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_ofn=<error reading variable: inner value printing not implemented for language "Auto">, M3_DKdYgt_names=<error reading variable: inner value printing not implemented for language "Auto">, M3_Cwb5VA_maxFiles=<error reading variable: inner value printing not implemented for language "Auto">,
M3_B4NT1D_nFiles=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_timeScaleFactor=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_timeOffset=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_voltageScaleFactor=<error reading variable: inner value printing not implemented for language "Auto">,
M3_CtKayy_voltageOffset=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_dutName=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_fsdbPath=<error reading variable: inner value printing not implemented for language "Auto">, M3_AicXUJ_wait=<error reading variable: inner value printing not implemented for language "Auto">,
M3_CT76zs_restrictNodes=<error reading variable: inner value printing not implemented for language "Auto">, M3_DbnU2S_restrictRegEx=<error reading variable: inner value printing not implemented for language "Auto">, M3_Cwb5VA_maxNodes=<error reading variable: inner value printing not implemented for language "Auto">, M3_AicXUJ_translate=<error reading variable: inner value printing not implemented for language "Auto">,
M3_AicXUJ_noX=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_scopesep=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_cmdPath=<error reading variable: inner value printing not implemented for language "Auto">, M3_Ehfbg0_compress=<error reading variable: inner value printing not implemented for language "Auto">,
M3_Cwb5VA_threads=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_interpolate=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_maxTime=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/Fsdb.m3:766
#15 0x0000000000409275 in ConvertTrace__DoConversion (M3_DfYBRk_conv=...) at ../src/ConvertTrace.m3:348
#16 0x000000000040e112 in ConvertTrace_M3 (M3_AcxOUs_mode=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/ConvertTrace.m3:762
#17 0x00000000004cb75a in RTLinker__RunMainBody (M3_DjPxE3_m=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTLinker.m3:408
#18 0x00000000004ca965 in RTLinker__AddUnitI (M3_DjPxE3_m=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTLinker.m3:115
#19 0x00000000004caa11 in RTLinker__AddUnit (M3_DjPxE5_b=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTLinker.m3:124
#20 0x00000000004061f8 in main (argc=21, argv=0x7fffffffcbd8, envp=0x7fffffffcc88) at _m3main.c:31
Looking at what's happening, I see this:
(gdb) dir /p/cth/rtl/cad/x86-64_linux44/opensource/cm3/d5.11.0-20210425/pkg/m3core/src/
Source directories searched: /p/cth/rtl/cad/x86-64_linux44/opensource/cm3/d5.11.0-20210425/pkg/m3core/src:$cdir:$cwd
(gdb) list
1074 BEGIN
1075 <ASSERT SIG_SUSPEND # 0>
1076 LOOP
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
1078 IF z = 0 THEN EXIT END;
1079 IF z # Uerror.EAGAIN THEN DieI(ThisLine(), z) END;
1080 (* try it again... *)
1081 END;
1082 END;
1083 END SignalThread;
(gdb) next
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
(gdb)
1078 IF z = 0 THEN EXIT END;
(gdb)
1079 IF z # Uerror.EAGAIN THEN DieI(ThisLine(), z) END;
(gdb)
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
(gdb)
1078 IF z = 0 THEN EXIT END;
(gdb)
1079 IF z # Uerror.EAGAIN THEN DieI(ThisLine(), z) END;
(gdb)
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
(gdb)
1078 IF z = 0 THEN EXIT END;
(gdb)
1079 IF z # Uerror.EAGAIN THEN DieI(ThisLine(), z) END;
(gdb)
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
(gdb)
1078 IF z = 0 THEN EXIT END;
That is pthread_kill keeps getting EAGAIN repeatedly from ... attempting to start the other thread?
Here is the calling code:
(gdb) list
1229 newlySent := 0;
1230 act := me.next;
1231 WHILE act # me DO
1232 <ASSERT act.state # ActState.Stopping>
1233 IF act.state # ActState.Started THEN
1234 SignalThread(act); <======= HERE
1235 INC(newlySent);
1236 END;
1237 act := act.next;
1238 END;
(gdb)
and the caller of that
888 PROCEDURE ResumeOthers () =
889 (* LL=activeMu. Always preceded by SuspendOthers. *)
890 BEGIN
891 <ASSERT suspended>
892 suspended := FALSE;
893 StartWorld(); <============ HERE
894 WITH r = pthread_mutex_unlock(activeMu) DO <ASSERT r=0> END;
895 END ResumeOthers;
Any thoughts, hints?
Hello everyone, does this ring a bell for anybody?
I am running many instances (100K+) of a relatively complicated and computationally intensive program. A few instances fail, and it looks like concurrency bugs, which may be in my own code and not in CM3. It is also a complicated system in the sense that it links with third-party C++ software through pipes ... the C++ is linked with C to make a simple front-end, the M3 talks to the C++/C code via two-way pipes (send commands, retrieve results).
I attached to one of the instances stuck at 100% CPU and see the following in what seems to be the active thread:
#0 ThreadPThread__SignalThread (M3_DMxDjQ_act=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/thread/PTHREAD/ThreadPThread.m3:1078
#1 0x00000000004dbb57 in ThreadPThread__StartWorld () at ../src/thread/PTHREAD/ThreadPThread.m3:1234
#2 0x00000000004daa43 in RTThread__ResumeOthers () at ../src/thread/PTHREAD/ThreadPThread.m3:893
#3 0x00000000004c0f96 in RTCollector__CollectSomeInStateZero () at ../src/runtime/common/RTCollector.m3:911
#4 0x00000000004c060f in RTCollector__CollectSome () at ../src/runtime/common/RTCollector.m3:731
#5 0x00000000004c028a in RTHeapRep__CollectEnough () at ../src/runtime/common/RTCollector.m3:665
#6 0x00000000004bce2e in RTAllocator__AllocTraced (M3_Cwb5VA_dataSize=<error reading variable: inner value printing not implemented for language "Auto">, M3_Cwb5VA_dataAlignment=<error reading variable: inner value printing not implemented for language "Auto">, M3_A1A6Wn_thread=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTAllocator.m3:367
#7 0x00000000004bbf08 in RTAllocator__GetTracedObj (M3_Eic7CK_def=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTAllocator.m3:224
#8 0x00000000004bb8ab in RTHooks__AllocateTracedObj (M3_AJWxb1_defn=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTAllocator.m3:122
#9 0x0000000000497f8e in TextRd__New (M3_Bd56fi_t=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/rw/TextRd.m3:41
#10 0x00000000004a52e9 in Scan__ScanWord (M3_Bd56fi_txt=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/fmtlex/Scan.m3:41
#11 0x00000000004a5438 in Scan__Int (M3_Bd56fi_txt=<error reading variable: inner value printing not implemented for language "Auto">, M3_AwZxt9_defaultBase=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/fmtlex/Scan.m3:53
#12 0x000000000047b8d1 in TextReader__GetInt (M3_A0o49i_t=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/TextReader.m3:283
#13 0x000000000041b8f6 in Fsdb__Parse__GetTimesteps.1255 (M3_Cwb5VA_fsdbId=<error reading variable: inner value printing not implemented for language "Auto">, M3_DRfo9P_steps=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/Fsdb.m3:585
#14 0x000000000041a987 in Fsdb__Parse (M3_Bd56fi_wd=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_ofn=<error reading variable: inner value printing not implemented for language "Auto">, M3_DKdYgt_names=<error reading variable: inner value printing not implemented for language "Auto">, M3_Cwb5VA_maxFiles=<error reading variable: inner value printing not implemented for language "Auto">,
M3_B4NT1D_nFiles=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_timeScaleFactor=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_timeOffset=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_voltageScaleFactor=<error reading variable: inner value printing not implemented for language "Auto">,
M3_CtKayy_voltageOffset=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_dutName=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_fsdbPath=<error reading variable: inner value printing not implemented for language "Auto">, M3_AicXUJ_wait=<error reading variable: inner value printing not implemented for language "Auto">,
M3_CT76zs_restrictNodes=<error reading variable: inner value printing not implemented for language "Auto">, M3_DbnU2S_restrictRegEx=<error reading variable: inner value printing not implemented for language "Auto">, M3_Cwb5VA_maxNodes=<error reading variable: inner value printing not implemented for language "Auto">, M3_AicXUJ_translate=<error reading variable: inner value printing not implemented for language "Auto">,
M3_AicXUJ_noX=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_scopesep=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_cmdPath=<error reading variable: inner value printing not implemented for language "Auto">, M3_Ehfbg0_compress=<error reading variable: inner value printing not implemented for language "Auto">,
M3_Cwb5VA_threads=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_interpolate=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_maxTime=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/Fsdb.m3:766
#15 0x0000000000409275 in ConvertTrace__DoConversion (M3_DfYBRk_conv=...) at ../src/ConvertTrace.m3:348
#16 0x000000000040e112 in ConvertTrace_M3 (M3_AcxOUs_mode=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/ConvertTrace.m3:762
#17 0x00000000004cb75a in RTLinker__RunMainBody (M3_DjPxE3_m=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTLinker.m3:408
#18 0x00000000004ca965 in RTLinker__AddUnitI (M3_DjPxE3_m=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTLinker.m3:115
#19 0x00000000004caa11 in RTLinker__AddUnit (M3_DjPxE5_b=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTLinker.m3:124
#20 0x00000000004061f8 in main (argc=21, argv=0x7fffffffcbd8, envp=0x7fffffffcc88) at _m3main.c:31
Looking at what's happening, I see this:
(gdb) dir /p/cth/rtl/cad/x86-64_linux44/opensource/cm3/d5.11.0-20210425/pkg/m3core/src/
Source directories searched: /p/cth/rtl/cad/x86-64_linux44/opensource/cm3/d5.11.0-20210425/pkg/m3core/src:$cdir:$cwd
(gdb) list
1074 BEGIN
1075 <ASSERT SIG_SUSPEND # 0>
1076 LOOP
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
1078 IF z = 0 THEN EXIT END;
1079 IF z # Uerror.EAGAIN THEN DieI(ThisLine(), z) END;
1080 (* try it again... *)
1081 END;
1082 END;
1083 END SignalThread;
(gdb) next
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
(gdb)
1078 IF z = 0 THEN EXIT END;
(gdb)
1079 IF z # Uerror.EAGAIN THEN DieI(ThisLine(), z) END;
(gdb)
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
(gdb)
1078 IF z = 0 THEN EXIT END;
(gdb)
1079 IF z # Uerror.EAGAIN THEN DieI(ThisLine(), z) END;
(gdb)
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
(gdb)
1078 IF z = 0 THEN EXIT END;
(gdb)
1079 IF z # Uerror.EAGAIN THEN DieI(ThisLine(), z) END;
(gdb)
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
(gdb)
1078 IF z = 0 THEN EXIT END;
That is pthread_kill keeps getting EAGAIN repeatedly from ... attempting to start the other thread?
Here is the calling code:
(gdb) list
1229 newlySent := 0;
1230 act := me.next;
1231 WHILE act # me DO
1232 <ASSERT act.state # ActState.Stopping>
1233 IF act.state # ActState.Started THEN
1234 SignalThread(act); <======= HERE
1235 INC(newlySent);
1236 END;
1237 act := act.next;
1238 END;
(gdb)
and the caller of that
888 PROCEDURE ResumeOthers () =
889 (* LL=activeMu. Always preceded by SuspendOthers. *)
890 BEGIN
891 <ASSERT suspended>
892 suspended := FALSE;
893 StartWorld(); <============ HERE
894 WITH r = pthread_mutex_unlock(activeMu) DO <ASSERT r=0> END;
895 END ResumeOthers;
Any thoughts, hints?