feat: Add supervisor:get_tree/1 for full supervision tree introspection#10972
feat: Add supervisor:get_tree/1 for full supervision tree introspection#10972km-git007 wants to merge 2 commits intoerlang:masterfrom
Conversation
CT Test Results 2 files 100 suites 1h 8m 54s ⏱️ For more details on these failures, see this check. Results for commit 3388e9a. ♻️ This comment has been updated with latest results. To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts
// Erlang/OTP Github Action Bot |
| SupPid = get_sup_pid(Supervisor), | ||
| {SupPid, [build_tree_node(Id, Child, Type, Modules) || | ||
| {Id, Child, Type, Modules} <- Children]} |
There was a problem hiding this comment.
It seems get_sup_pid is a bit unrelated to this function. If you need the supervisor pid, then perhaps you could call a separate function and expand it yourself, and keep it outside of this new functionality?
Furthermore, there is a subtle race condition in this code. The Supervisor can actually change between which_children and get_sup_pid, so you would return the children of one supervisor and the Pid of another.
There was a problem hiding this comment.
Thanks for catching both issues. I've fixed the race condition by resolving the supervisor reference first via resolve_supervisor_reference/1 and then passing the stable PID to which_children/1. I've also extracted the helper into a standalone resolve_supervisor_reference/1 function with a spec and a comment explaining the race-avoidance rationale, keeping it separate from the tree-building logic.
| %% Add a worker child | ||
| {ok, Child1} = supervisor:start_child(SupPid, #{id => child1, | ||
| start => {supervisor_1, start_child, []}}), | ||
| {SupPid, [{child1, Child1, worker, [supervisor_1], []}]} = supervisor:get_tree(SupPid), |
There was a problem hiding this comment.
This is not testing many of the scenarios in the code above, such as different names, undefined children, etc.
There was a problem hiding this comment.
Good catch. I've added 5 new tests covering: local atom name reference, global name reference, undefined (terminated) children, nested supervisor hierarchies, and mixed running/undefined child states. These are now grouped under a get_tree test group in the suite.
| %% Helper function to build a tree node for a single child | ||
| build_tree_node(Id, Child, Type, Modules) -> | ||
| Subtree = case Type of | ||
| supervisor when is_pid(Child), Child =/= undefined -> |
There was a problem hiding this comment.
If Child is a pid, then it is certainly =/= from undefined:
| supervisor when is_pid(Child), Child =/= undefined -> | |
| supervisor when is_pid(Child) -> |
There was a problem hiding this comment.
Correct, is_pid/1 already excludes undefined. Applied the suggestion.
Operators debugging distributed Erlang applications often need to understand the complete supervision structure at runtime. Previously, this required manually walking the supervision tree by repeatedly calling supervisor:which_children/1 on each supervisor found. This commit introduces supervisor:get_tree/1, which returns a complete hierarchical representation of the supervision tree rooted at a given supervisor. The function recursively builds a nested structure containing: - Child ID, PID, type (worker/supervisor), and modules - Full subtrees for child supervisors - Empty subtrees for worker processes The feature enables: - Better debugging of supervision hierarchies - Improved application visualization tools - Runtime introspection for operational monitoring - Automation of cluster topology discovery The implementation: - Uses existing supervisor:which_children/1 as its foundation - Handles all supervisor reference types (pid, atom, global, via) - Gracefully handles errors and nonexistent supervisors - Returns structured data suitable for JSON serialization or display Added comprehensive tests for: - Basic tree building with multiple children - Supervisor and worker process types - Terminated and undefined child processes - Error handling for invalid references
- Fix race condition in get_tree/1: resolve supervisor reference once via resolve_supervisor_reference/1 and reuse the stable PID for which_children/1, preventing children of one supervisor being paired with the PID of another when a registered name rebinds between calls - Rename get_sup_pid/1 to resolve_supervisor_reference/1 with -spec and comment explaining the race-avoidance intent; separates reference resolution from tree-building logic - Simplify build_tree_node/4 guard: is_pid(Child) already excludes undefined, so the redundant Child =/= undefined clause is removed - Add 5 tests covering previously untested scenarios: local atom name reference, global name reference, undefined children, nested supervisor hierarchies, and mixed running/undefined child states
feff915 to
3388e9a
Compare
| -type supervision_tree() :: {Pid :: pid(), | ||
| [supervision_tree_node()]} | | ||
| {error, no_supervision_info}. |
There was a problem hiding this comment.
The error response does not belong in this type...
| -type supervision_tree() :: {Pid :: pid(), | |
| [supervision_tree_node()]} | | |
| {error, no_supervision_info}. | |
| -type supervision_tree() :: {Pid :: pid(), | |
| [supervision_tree_node()]}. |
Instead it should be...
| ``` | ||
| """. | ||
| -doc(#{since => <<"OTP 28.0">>}). | ||
| -spec get_tree(SupRef) -> supervision_tree() when |
There was a problem hiding this comment.
... here.
| -spec get_tree(SupRef) -> supervision_tree() when | |
| -spec get_tree(SupRef) -> supervision_tree() | {error, no_supervision_info} when |
There was a problem hiding this comment.
Yes, agree.
But {error, no_supervision_info} is in itself not all that great. It glosses over, like, everything. SupRef could not be a process identifier (of sorts). Even if, the process could not be alive (any more). Even if, it could be that the process is not a supervisor. Or something else could have gone wrong when building the tree. All that is conflated into this one error tuple, and even that I would object to: it should rather be a badarg exception, depending on what went wrong.
There was a problem hiding this comment.
@Maria-12648430 you're not putting too fine a point on it, but I agree.
There was a problem hiding this comment.
Yeah, sorry... 🙍♀️ Just when the jetlag from going to Singapore started wearing off, I had to return to Germany, and there it is back again... 😩 So, sorry @km-git007 for being blunt. Nevertheless, my point stands.
|
@km-git007 We think the idea is interesting. It will take some polishing, and it is not in time for OTP-29.0 so we will come back to you with OTP-team feedback after the the release, as we need to prioritize the release right now. @josevalim , @Maria-12648430 and@juhlig usually have good feedback, so that you can consider in the mean time. It could end up in later version of 29 or in 30 depending on a lot of things that we can not address right now, just saying hang in there. |
| @@ -2537,3 +2596,43 @@ dyn_init(Child,State) when ?is_temporary(Child) -> | |||
| State#state{dynamics={mapsets,maps:new()}}; | |||
| dyn_init(_Child,State) -> | |||
| State#state{dynamics={maps,maps:new()}}. | |||
|
|
|||
| %% Resolve a supervisor reference (pid | atom | {global,Name} | {via,Mod,Name}) | |||
There was a problem hiding this comment.
A sup_ref can also be {Name, Node}, which your implementation does not consider.
Aside from that, I think a function like this would be better placed in the gen module (with another name, of course), since it equally applies to all other gen_* behaviors.
There was a problem hiding this comment.
Yeah. The other thing is that this function makes many assumptions (for example, it will catch child failures, it will return the PID, etc), which I am not sure are general enough for inclusion either here or in gen. It feels this is app specific code or, if the goal is debugging, add a version the prints the tree to the c module, without exposing the implementation details seen here.
There was a problem hiding this comment.
@josevalim are you referring to the build_tree_node function below? As far as that goes, I agree 👍
But in this comment, I was referring to only resolve_supervisor_reference, which determines the pid for a given sup_ref.
There was a problem hiding this comment.
since it equally applies to all other
gen_*behaviors.
Oh, that would indeed be handy, I'd love that 🥰
Operators debugging distributed Erlang applications often need to understand the complete supervision structure at runtime. Previously, this required manually walking the supervision tree by repeatedly calling supervisor:which_children/1 on each supervisor found.
This commit introduces supervisor:get_tree/1, which returns a complete hierarchical representation of the supervision tree rooted at a given supervisor. The function recursively builds a nested structure containing:
The feature enables:
The implementation:
Added comprehensive tests for: