Skip to content

Create code_graph.py extraction boundary module #7

@verkligheten

Description

@verkligheten

Parent Epic

Part of #5 — Integrate Graphify for zero-cost code entity extraction

Task

Create agent_notes/services/code_graph.py — a boundary module that encapsulates all Graphify interaction. No Graphify types leak into the rest of the codebase; every function works with plain Python dicts and Path objects.

Location

/agent_notes/services/code_graph.py (new file, follows existing pattern: wiki_backend.py, memory_backend.py, credentials.py)

Functions

1. graphify_available() -> bool

def graphify_available() -> bool:
    """Return True if the graphifyy package is importable."""
    try:
        import graphify.extract  # noqa: F401
        return True
    except ImportError:
        return False

2. extract_code_graph(folder_path, *, extensions=None, skip_dirs=None) -> dict

Core extraction function. Runs tree-sitter parsing via Graphify's Python API.

Parameters:

  • folder_path: Path — directory to scan
  • extensions: set[str] | None — allowed code extensions (default: _CODE_EXTENSIONS)
  • skip_dirs: set[str] | None — directories to skip (reuse wiki_backend._SKIP_DIRS)

Returns:

{
    "nodes": [
        {"id": "auth_userservice", "label": "UserService", "source_file": "auth.py",
         "source_location": "L42", "type": "class"}
    ],
    "edges": [
        {"source": "auth_userservice", "target": "payments_gateway",
         "relation": "calls", "confidence": "EXTRACTED"}
    ],
    "communities": {0: ["auth_userservice", "auth_login"], 1: ["payments_gateway"]},
    "cohesion": {0: 0.85, 1: 0.72},
    "god_nodes": [{"label": "UserService", "degree": 12}],
    "stats": {"files_parsed": 5, "nodes": 23, "edges": 41, "communities": 3}
}

Implementation logic:

def extract_code_graph(folder_path: Path, *, extensions=None, skip_dirs=None):
    from graphify.extract import collect_files, extract
    from graphify.build import build_from_json
    from graphify.cluster import cluster, score_all
    from graphify.analyze import god_nodes

    # Step 1: Collect code files
    code_files = collect_files(folder_path)

    # Step 2: Filter by extensions if specified
    if extensions:
        code_files = [f for f in code_files if f.suffix in extensions]

    # Step 3: Filter by skip_dirs if specified
    if skip_dirs:
        code_files = [f for f in code_files
                      if not any(d in f.parts for d in skip_dirs)]

    if not code_files:
        return _empty_graph()

    # Step 4: Extract AST (zero API cost)
    extraction = extract(code_files)
    if not extraction.get("nodes"):
        return _empty_graph()

    # Step 5: Build graph
    G = build_from_json(extraction)

    # Step 6: Community detection
    communities = cluster(G)
    cohesion = score_all(G, communities)
    gods = god_nodes(G)

    # Step 7: Convert to plain dict
    nodes = [
        {
            "id": n,
            "label": G.nodes[n].get("label", n),
            "source_file": G.nodes[n].get("source_file", ""),
            "source_location": G.nodes[n].get("source_location", ""),
            "type": G.nodes[n].get("file_type", "code"),
        }
        for n in G.nodes
    ]
    edges = [
        {
            "source": u,
            "target": v,
            "relation": d.get("relation", "related"),
            "confidence": d.get("confidence", "EXTRACTED"),
        }
        for u, v, d in G.edges(data=True)
    ]

    return {
        "nodes": nodes,
        "edges": edges,
        "communities": {k: list(v) for k, v in communities.items()},
        "cohesion": {k: v for k, v in cohesion.items()},
        "god_nodes": gods,
        "stats": {
            "files_parsed": len(code_files),
            "nodes": len(nodes),
            "edges": len(edges),
            "communities": len(communities),
        },
    }

3. graph_to_wiki_terms(graph_data) -> dict

Maps Graphify nodes and communities to wiki-compatible entity and concept names.

Mapping rules:

Graphify node Condition Wiki type Example
class any degree entity "UserService"
function (top-level) degree >= 3 entity "process_payment"
function (method) skip stays inside class page
module / file degree >= 2 entity "auth"
Leiden community size >= 2 concept "Authentication System"

Community naming algorithm:

  1. Collect source_file values from all community member nodes
  2. Extract common path prefix (e.g., auth/, payments/)
  3. If prefix gives a meaningful directory name → use it title-cased
  4. Otherwise → use the highest-degree node's label + "Module" suffix
  5. Deduplicate against existing concept names

Returns:

{
    "entities": ["UserService", "PaymentGateway", "process_payment"],
    "concepts": ["Authentication", "Payment Processing"],
    "edges_by_entity": {
        "UserService": [
            {"target": "PaymentGateway", "relation": "calls"},
            {"target": "login", "relation": "contains"}
        ]
    }
}

Implementation detail — filtering trivial nodes:

  • Skip nodes whose label starts with _ (private/internal)
  • Skip nodes whose label is __init__, __main__, setup
  • Skip "rationale" type nodes (Graphify extracts # NOTE: comments as rationale nodes)
  • Skip file-level module nodes that are just containers (only have "contains" edges out)

4. save_graph_json(wiki_root, slug, graph_data) -> Path

import json

def save_graph_json(wiki_root: Path, slug: str, graph_data: dict) -> Path:
    """Write graph.json to raw/<slug>-graph.json. Returns the path."""
    raw_dir = wiki_root / "raw"
    raw_dir.mkdir(parents=True, exist_ok=True)
    path = raw_dir / f"{slug}-graph.json"
    path.write_text(json.dumps(graph_data, indent=2, default=str))
    return path

Storage rationale: raw/ is the immutable source material directory. The graph is derived from source code — it belongs with source data. .obsidianignore already excludes raw/ from Obsidian indexing.

5. Helper: _empty_graph() -> dict

def _empty_graph():
    return {
        "nodes": [], "edges": [],
        "communities": {}, "cohesion": {},
        "god_nodes": [],
        "stats": {"files_parsed": 0, "nodes": 0, "edges": 0, "communities": 0},
    }

6. Constant: _CODE_EXTENSIONS

_CODE_EXTENSIONS = {
    ".py", ".ts", ".js", ".tsx", ".jsx",
    ".go", ".rs", ".java", ".cpp", ".c", ".h",
    ".rb", ".swift", ".kt", ".cs", ".scala",
    ".php", ".lua", ".groovy", ".jl",
    ".f90", ".pas",
}

This matches Graphify's supported tree-sitter languages.

Potential Issues

  1. Graphify's collect_files() vs our file walking: collect_files() has its own filtering logic. We may get different file sets than wiki_ingest_folder(). Solution: use our own file list from the walk loop where possible, or at minimum filter collect_files() output with our _SKIP_DIRS and extensions.

  2. NetworkX graph iteration order: G.nodes and G.edges(data=True) iteration order is insertion-order in Python 3.7+, but community assignment is non-deterministic (Leiden uses randomization). This is fine — we only need consistent node IDs, not consistent community assignment.

  3. Large repositories: extract() on a 1000+ file repo could take 10-30 seconds (tree-sitter is fast but not instant). This is acceptable for a one-time ingest operation, but document that large repos may take a moment.

  4. extract() with cache_root: The v7 API supports extract(code_files, cache_root=Path(".")) for caching parsed results. We should pass a cache path to avoid re-parsing on --update runs. Use wiki_root / "raw" as cache root.

  5. Import safety: All Graphify imports are lazy (inside function bodies), so import agent_notes never fails even when graphifyy isn't installed.

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions