Skip to content

Some sort of GC livelock issue... #1200

@mikanystrom-intel

Description

@mikanystrom-intel

Hello everyone, does this ring a bell for anybody?

I am running many instances (100K+) of a relatively complicated and computationally intensive program. A few instances fail, and it looks like concurrency bugs, which may be in my own code and not in CM3. It is also a complicated system in the sense that it links with third-party C++ software through pipes ... the C++ is linked with C to make a simple front-end, the M3 talks to the C++/C code via two-way pipes (send commands, retrieve results).

I attached to one of the instances stuck at 100% CPU and see the following in what seems to be the active thread:

#0 ThreadPThread__SignalThread (M3_DMxDjQ_act=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/thread/PTHREAD/ThreadPThread.m3:1078
#1 0x00000000004dbb57 in ThreadPThread__StartWorld () at ../src/thread/PTHREAD/ThreadPThread.m3:1234
#2 0x00000000004daa43 in RTThread__ResumeOthers () at ../src/thread/PTHREAD/ThreadPThread.m3:893
#3 0x00000000004c0f96 in RTCollector__CollectSomeInStateZero () at ../src/runtime/common/RTCollector.m3:911
#4 0x00000000004c060f in RTCollector__CollectSome () at ../src/runtime/common/RTCollector.m3:731
#5 0x00000000004c028a in RTHeapRep__CollectEnough () at ../src/runtime/common/RTCollector.m3:665
#6 0x00000000004bce2e in RTAllocator__AllocTraced (M3_Cwb5VA_dataSize=<error reading variable: inner value printing not implemented for language "Auto">, M3_Cwb5VA_dataAlignment=<error reading variable: inner value printing not implemented for language "Auto">, M3_A1A6Wn_thread=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTAllocator.m3:367
#7 0x00000000004bbf08 in RTAllocator__GetTracedObj (M3_Eic7CK_def=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTAllocator.m3:224
#8 0x00000000004bb8ab in RTHooks__AllocateTracedObj (M3_AJWxb1_defn=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTAllocator.m3:122
#9 0x0000000000497f8e in TextRd__New (M3_Bd56fi_t=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/rw/TextRd.m3:41
#10 0x00000000004a52e9 in Scan__ScanWord (M3_Bd56fi_txt=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/fmtlex/Scan.m3:41
#11 0x00000000004a5438 in Scan__Int (M3_Bd56fi_txt=<error reading variable: inner value printing not implemented for language "Auto">, M3_AwZxt9_defaultBase=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/fmtlex/Scan.m3:53
#12 0x000000000047b8d1 in TextReader__GetInt (M3_A0o49i_t=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/TextReader.m3:283
#13 0x000000000041b8f6 in Fsdb__Parse__GetTimesteps.1255 (M3_Cwb5VA_fsdbId=<error reading variable: inner value printing not implemented for language "Auto">, M3_DRfo9P_steps=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/Fsdb.m3:585
#14 0x000000000041a987 in Fsdb__Parse (M3_Bd56fi_wd=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_ofn=<error reading variable: inner value printing not implemented for language "Auto">, M3_DKdYgt_names=<error reading variable: inner value printing not implemented for language "Auto">, M3_Cwb5VA_maxFiles=<error reading variable: inner value printing not implemented for language "Auto">,
M3_B4NT1D_nFiles=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_timeScaleFactor=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_timeOffset=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_voltageScaleFactor=<error reading variable: inner value printing not implemented for language "Auto">,
M3_CtKayy_voltageOffset=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_dutName=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_fsdbPath=<error reading variable: inner value printing not implemented for language "Auto">, M3_AicXUJ_wait=<error reading variable: inner value printing not implemented for language "Auto">,
M3_CT76zs_restrictNodes=<error reading variable: inner value printing not implemented for language "Auto">, M3_DbnU2S_restrictRegEx=<error reading variable: inner value printing not implemented for language "Auto">, M3_Cwb5VA_maxNodes=<error reading variable: inner value printing not implemented for language "Auto">, M3_AicXUJ_translate=<error reading variable: inner value printing not implemented for language "Auto">,
M3_AicXUJ_noX=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_scopesep=<error reading variable: inner value printing not implemented for language "Auto">, M3_Bd56fi_cmdPath=<error reading variable: inner value printing not implemented for language "Auto">, M3_Ehfbg0_compress=<error reading variable: inner value printing not implemented for language "Auto">,
M3_Cwb5VA_threads=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_interpolate=<error reading variable: inner value printing not implemented for language "Auto">, M3_CtKayy_maxTime=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/Fsdb.m3:766
#15 0x0000000000409275 in ConvertTrace__DoConversion (M3_DfYBRk_conv=...) at ../src/ConvertTrace.m3:348
#16 0x000000000040e112 in ConvertTrace_M3 (M3_AcxOUs_mode=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/ConvertTrace.m3:762
#17 0x00000000004cb75a in RTLinker__RunMainBody (M3_DjPxE3_m=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTLinker.m3:408
#18 0x00000000004ca965 in RTLinker__AddUnitI (M3_DjPxE3_m=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTLinker.m3:115
#19 0x00000000004caa11 in RTLinker__AddUnit (M3_DjPxE5_b=<error reading variable: inner value printing not implemented for language "Auto">) at ../src/runtime/common/RTLinker.m3:124
#20 0x00000000004061f8 in main (argc=21, argv=0x7fffffffcbd8, envp=0x7fffffffcc88) at _m3main.c:31

Looking at what's happening, I see this:

(gdb) dir /p/cth/rtl/cad/x86-64_linux44/opensource/cm3/d5.11.0-20210425/pkg/m3core/src/
Source directories searched: /p/cth/rtl/cad/x86-64_linux44/opensource/cm3/d5.11.0-20210425/pkg/m3core/src:$cdir:$cwd
(gdb) list
1074 BEGIN
1075 <ASSERT SIG_SUSPEND # 0>
1076 LOOP
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
1078 IF z = 0 THEN EXIT END;
1079 IF z # Uerror.EAGAIN THEN DieI(ThisLine(), z) END;
1080 (* try it again... *)
1081 END;
1082 END;
1083 END SignalThread;
(gdb) next
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
(gdb)
1078 IF z = 0 THEN EXIT END;
(gdb)
1079 IF z # Uerror.EAGAIN THEN DieI(ThisLine(), z) END;
(gdb)
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
(gdb)
1078 IF z = 0 THEN EXIT END;
(gdb)
1079 IF z # Uerror.EAGAIN THEN DieI(ThisLine(), z) END;
(gdb)
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
(gdb)
1078 IF z = 0 THEN EXIT END;
(gdb)
1079 IF z # Uerror.EAGAIN THEN DieI(ThisLine(), z) END;
(gdb)
1077 WITH z = pthread_kill(act.handle, SIG_SUSPEND) DO
(gdb)
1078 IF z = 0 THEN EXIT END;

That is pthread_kill keeps getting EAGAIN repeatedly from ... attempting to start the other thread?

Here is the calling code:

(gdb) list
1229 newlySent := 0;
1230 act := me.next;
1231 WHILE act # me DO
1232 <ASSERT act.state # ActState.Stopping>
1233 IF act.state # ActState.Started THEN
1234 SignalThread(act); <======= HERE
1235 INC(newlySent);
1236 END;
1237 act := act.next;
1238 END;
(gdb)

and the caller of that

888 PROCEDURE ResumeOthers () =
889 (* LL=activeMu. Always preceded by SuspendOthers. *)
890 BEGIN
891 <ASSERT suspended>
892 suspended := FALSE;
893 StartWorld(); <============ HERE
894 WITH r = pthread_mutex_unlock(activeMu) DO <ASSERT r=0> END;
895 END ResumeOthers;

Any thoughts, hints?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions