Skip to content

No meshnet gRPC connections made / timeout - ceos multi-host lab #633

@mgisch

Description

@mgisch

I'm trying to setup a 2 node ceos lab using the topology from the examples (https://github.com/openconfig/kne/blob/main/examples/arista/ceos-150/ceos-150.pb.txt) scaled down to only 4 nodes for now wired in a ring like that example. ceos version is 4.33.1F if that is relevant.

The control-plane node has pod placement enabled.
When I force all pods to place on the same node (either one) everything works.
When they're split, any pods that happen to have both neighbors placed on same node will start but all pods with a neighbor on the other node will be stuck in init state forever. ie - all cross-node links fail to be created and the init-wait image never completes.

Event log contains many of these:
26s Warning FailedCreatePodSandBox pod/r3 (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_r3_ceos-4_a70c8de5-bf1e-446d-9d34-517bf09fa6f6_0(7e6f3ad44bc0105b7320f290222a9b837ce650585d2d5ab659849ba7f0248f1f): error adding pod ceos-4_r3 to CNI network "cbr0": plugin type="meshnet" name="meshnet" failed (add): rpc error: code = Unavailable desc = failed to receive server preface within timeout

In earlier testing that log additionally contained the other nodes address:51111 reference but after a recent reboot it now only shows this generic timeout error; I'm not sure what changed but the symptom is the same either way.

Based on tcpdump tests neither node ever sends a port 51111 packet on the wire.
Port 51111 is allowed in both host firewalls, the symptom is the same with firewall disabled and using nc to test connectivity I can connect to that port on both nodes from each other by both IP and hostname. A tcpdump on loopback shows many packets but nothing on the outside interface.
Something is preventing it from even attempting to make these grpc connections to the other node.

Kubernetes version is 1.28. OS is Oracle Linux 9. Selinux is disabled.
KNE was installed exactly as per these instructions: https://github.com/openconfig/kne/blob/main/docs/multinode.md but on a local bare metal cluster. Meshnet daemonset image says it's version 0.3.2, meshnet binary says it's version 0.3.0. Flannel appears to work (can ssh to pods on remote node)

Any ideas on how to troubleshoot further to make inter-host grpc links work?
Is vxlan mode still supported as alternative? Re-applying meshnet vxlan manifest appeared to change nothing...meshnet still starts in grpc mode.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions