Skip to content

Commit f6d899c

Browse files
aspalaclaude
andcommitted
chore: add test support, scripts, CI workflow, and update action config
Adds test support helpers, development/debug scripts, blocks CI workflow, and updates action.yml and README to reflect new block analysis capabilities. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent c86c397 commit f6d899c

56 files changed

Lines changed: 4117 additions & 4 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/blocks.yml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
name: Extract Code Blocks
2+
3+
on:
4+
pull_request:
5+
branches: [main]
6+
7+
permissions:
8+
contents: read
9+
10+
jobs:
11+
blocks:
12+
runs-on: ubuntu-latest
13+
steps:
14+
- uses: actions/checkout@v4
15+
- uses: ./
16+
with:
17+
command: blocks
18+
build: source

README.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,11 @@
22

33
A GitHub Action for running [codeqa](https://github.com/num42/n42-agentic-helpers/tree/main/code-quality-analyzer-ex) code quality analysis on your repository.
44

5-
Supports three commands:
5+
Supports four commands:
66
- **health-report** — Graded health report with worst offenders
77
- **compare** — Metric comparison between git refs (e.g. PR vs base)
88
- **analyze** — Raw JSON metrics output
9+
- **blocks** — Extract natural code blocks as JSON
910

1011
## Usage
1112

@@ -56,11 +57,26 @@ The base ref is auto-detected from the PR context. Override with `base-ref` if n
5657
run: cat ${{ steps.analysis.outputs.report-file }}
5758
```
5859

60+
### Extract Code Blocks
61+
62+
```yaml
63+
- uses: num42/codeqa-action@v1
64+
id: blocks
65+
with:
66+
command: blocks
67+
extra-args: --sub-blocks
68+
69+
- name: Use blocks
70+
run: cat ${{ steps.blocks.outputs.report-file }}
71+
```
72+
73+
Produces a JSON array of block objects. Each object includes `file`, `line_from`, `line_to`, `token_count`, and `depth` by default. Pass `--fields file,line_from,line_to,tokens` to customise. Add `--stream` for NDJSON output. Use `--workers N` for parallel file processing.
74+
5975
## Inputs
6076

6177
| Input | Required | Default | Description |
6278
|-------|----------|---------|-------------|
63-
| `command` | yes | — | `health-report`, `compare`, or `analyze` |
79+
| `command` | yes | — | `health-report`, `compare`, `analyze`, or `blocks` |
6480
| `path` | no | `.` | Directory to analyze |
6581
| `comment` | no | `false` | Post result as sticky PR comment |
6682
| `fail-grade` | no | — | Minimum grade for health-report (e.g. `C`). Fails if below |

action.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ branding:
88

99
inputs:
1010
command:
11-
description: "Command to run: health-report, compare, or analyze"
11+
description: "Command to run: health-report, compare, analyze, or blocks"
1212
required: true
1313
path:
1414
description: "Directory to analyze"

scripts/domain_tagger_demo.exs

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# Domain tagger demo v3
2+
# - noun extraction (no lookup table)
3+
# - block-unique signal: block domains minus sub-block domains
4+
# - language stopwords: extracted via Stopwords.find_stopwords on a synthetic
5+
# Elixir keywords file — infra words that appear there are subtracted
6+
#
7+
# run with: mix run scripts/domain_tagger_demo.exs
8+
9+
alias CodeQA.Metrics.{BlockAnalyzer, BlockDetector, TokenNormalizer}
10+
alias CodeQA.Stopwords
11+
12+
defmodule DomainTagger do
13+
@verbs MapSet.new(~w[
14+
get fetch find list search query count
15+
create build make add insert
16+
update patch put set save store upsert
17+
delete remove destroy drop clear purge flush
18+
send receive deliver emit broadcast publish
19+
upload download read write stream parse format render print log
20+
check validate verify assert ensure guard
21+
handle process compute calculate apply run exec call invoke dispatch
22+
init start stop reset open close connect disconnect reload
23+
is has can will do did should would may
24+
by with for from to into on at of in out new
25+
all some many each every any
26+
a an the ok error nil true false
27+
])
28+
29+
def split(content) do
30+
content
31+
|> String.replace(~r/([a-z])([A-Z])/, "\\1_\\2")
32+
|> String.replace(~r/([A-Z]+)([A-Z][a-z])/, "\\1_\\2")
33+
|> String.split(~r/[_!?]/, trim: true)
34+
|> Enum.map(&String.downcase/1)
35+
|> Enum.reject(&(String.length(&1) <= 1))
36+
end
37+
38+
def nouns(content) do
39+
content |> split() |> Enum.reject(&MapSet.member?(@verbs, &1))
40+
end
41+
42+
def tag(tokens) do
43+
bound = BlockAnalyzer.bound_variables(tokens)
44+
45+
tokens
46+
|> Enum.filter(&(&1.value == "<ID>"))
47+
|> Enum.flat_map(&nouns(&1.content))
48+
|> Enum.reject(fn noun -> MapSet.member?(bound, noun) end)
49+
|> Enum.frequencies()
50+
|> Enum.sort_by(fn {_, c} -> -c end)
51+
|> Enum.map(fn {word, _} -> String.to_atom(word) end)
52+
end
53+
end
54+
55+
# ---------------------------------------------------------------------------
56+
# Step 1: derive language stopwords from the repository's own .ex files.
57+
#
58+
# Stopwords.find_stopwords counts how often each noun appears across files.
59+
# Nouns present in ≥15% of repo files are treated as infrastructure words.
60+
# ---------------------------------------------------------------------------
61+
62+
repo_files =
63+
Path.wildcard("lib/**/*.ex")
64+
|> Map.new(fn path -> {path, File.read!(path)} end)
65+
66+
noun_extractor = fn content ->
67+
content
68+
|> TokenNormalizer.normalize_structural()
69+
|> DomainTagger.tag()
70+
|> Enum.map(&Atom.to_string/1)
71+
end
72+
73+
IO.puts("=== Deriving stopwords from #{map_size(repo_files)} repo files ===")
74+
75+
lang_stopwords =
76+
Stopwords.find_stopwords(repo_files, noun_extractor)
77+
78+
IO.puts("=== Language stopwords extracted (#{MapSet.size(lang_stopwords)} terms) ===")
79+
IO.puts(inspect(lang_stopwords |> MapSet.to_list() |> Enum.sort()))
80+
IO.puts("")
81+
82+
# ---------------------------------------------------------------------------
83+
# Step 2: analyze all .ex files in the repo
84+
# ---------------------------------------------------------------------------
85+
86+
target = "lib/codeqa/git.ex"
87+
sources = [{target, File.read!(target)}]
88+
89+
apply_stopwords = fn domains ->
90+
domains
91+
|> Enum.map(&Atom.to_string/1)
92+
|> Enum.reject(&MapSet.member?(lang_stopwords, &1))
93+
|> Enum.map(&String.to_atom/1)
94+
end
95+
96+
fn_name_of = fn block ->
97+
block.tokens
98+
|> Enum.drop_while(&(&1.value != "<ID>" or &1.content not in ["def", "defp"]))
99+
|> Enum.drop(1)
100+
|> Enum.find(&(&1.value == "<ID>"))
101+
|> case do
102+
nil -> "(unknown)"
103+
t -> t.content
104+
end
105+
end
106+
107+
Enum.each(sources, fn {source_label, content} ->
108+
IO.puts("=== #{source_label} ===\n")
109+
110+
tokens = TokenNormalizer.normalize_structural(content)
111+
blocks = BlockDetector.detect_blocks(tokens, language: :unknown)
112+
113+
Enum.each(blocks, fn block ->
114+
all_domains = DomainTagger.tag(block.tokens) |> apply_stopwords.()
115+
116+
sub_domain_set =
117+
block.sub_blocks
118+
|> Enum.flat_map(&DomainTagger.tag(&1.tokens))
119+
|> Enum.map(&Atom.to_string/1)
120+
|> Enum.reject(&MapSet.member?(lang_stopwords, &1))
121+
|> MapSet.new()
122+
123+
unique = Enum.reject(all_domains, &MapSet.member?(sub_domain_set, Atom.to_string(&1)))
124+
125+
IO.puts("#{fn_name_of.(block)}/? (lines #{block.start_line}#{block.end_line})")
126+
IO.puts(" all : #{inspect(all_domains)}")
127+
IO.puts(" unique : #{inspect(unique)}")
128+
Enum.each(block.sub_blocks, fn sb ->
129+
ds = DomainTagger.tag(sb.tokens) |> apply_stopwords.()
130+
unless Enum.empty?(ds), do: IO.puts(" sub:#{sb.start_line} : #{inspect(ds)}")
131+
end)
132+
IO.puts("")
133+
end)
134+
end)

scripts/inspect_blocks.exs

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
alias CodeQA.Metrics.{TokenNormalizer, BlockDetector, Block}
2+
3+
# ─── sample source ───────────────────────────────────────────────────────────
4+
5+
source = ~S"""
6+
defmodule Greeter do
7+
def hello(name) do
8+
"Hello, #{name}!"
9+
end
10+
11+
def goodbye(name) do
12+
"Goodbye, #{name}!"
13+
end
14+
end
15+
16+
17+
defmodule Calculator do
18+
def add(a, b), do: a + b
19+
20+
def subtract(a, b), do: a - b
21+
end
22+
"""
23+
24+
# ─── helpers ─────────────────────────────────────────────────────────────────
25+
26+
sep = fn char, n -> IO.puts(String.duplicate(char, n)) end
27+
title = fn label -> sep.("─", 60); IO.puts(" #{label}"); sep.("─", 60) end
28+
29+
# Reconstruct source from token stream using col positions for spacing.
30+
# <WS> tokens are skipped — indentation is recovered from content-token col.
31+
reconstruct = fn tokens ->
32+
{text, _, _} =
33+
Enum.reduce(tokens, {"", 1, 0}, fn token, {acc, line, col} ->
34+
case token.value do
35+
"<NL>" ->
36+
{acc <> "\n", line + 1, 0}
37+
38+
"<WS>" ->
39+
{acc, line, col}
40+
41+
_ ->
42+
padding = String.duplicate(" ", max(0, token.col - col))
43+
end_col = token.col + String.length(token.content)
44+
{acc <> padding <> token.content, line, end_col}
45+
end
46+
end)
47+
text
48+
end
49+
50+
# ─── 1. original source ──────────────────────────────────────────────────────
51+
52+
title.("1. ORIGINAL SOURCE")
53+
IO.puts(source)
54+
55+
# ─── 2. token stream ─────────────────────────────────────────────────────────
56+
57+
title.("2. TOKEN STREAM (value | content | line:col)")
58+
59+
tokens = TokenNormalizer.normalize_structural(source)
60+
61+
tokens
62+
|> Enum.each(fn t ->
63+
value = String.pad_trailing(t.value, 8)
64+
content = String.pad_trailing(inspect(t.content), 16)
65+
IO.puts(" #{value} #{content} #{t.line}:#{t.col}")
66+
end)
67+
68+
# ─── 3. blocks + sub-blocks ──────────────────────────────────────────────────
69+
70+
title.("3. BLOCKS + SUB-BLOCKS")
71+
72+
blocks = BlockDetector.detect_blocks(tokens, language: :unknown)
73+
74+
blocks
75+
|> Enum.with_index(1)
76+
|> Enum.each(fn {block, i} ->
77+
IO.puts("")
78+
IO.puts(" Block #{i} lines #{block.start_line}#{block.end_line}" <>
79+
" tokens=#{Block.token_count(block)}" <>
80+
" sub_blocks=#{Block.sub_block_count(block)}")
81+
82+
reconstructed_block = reconstruct.(block.tokens)
83+
IO.puts(" ┌─ reconstructed ────────────────────────────")
84+
reconstructed_block
85+
|> String.split("\n")
86+
|> Enum.each(fn line -> IO.puts(" │ #{line}") end)
87+
IO.puts(" └────────────────────────────────────────────")
88+
89+
if block.sub_blocks != [] do
90+
block.sub_blocks
91+
|> Enum.with_index(1)
92+
|> Enum.each(fn {sub, j} ->
93+
IO.puts("")
94+
IO.puts(" Sub-block #{i}.#{j} lines #{sub.start_line}#{sub.end_line}" <>
95+
" tokens=#{Block.token_count(sub)}")
96+
reconstructed_sub = reconstruct.(sub.tokens)
97+
IO.puts(" ┌─ reconstructed ──────────────────────")
98+
reconstructed_sub
99+
|> String.split("\n")
100+
|> Enum.each(fn line -> IO.puts(" │ #{line}") end)
101+
IO.puts(" └──────────────────────────────────────")
102+
end)
103+
end
104+
end)
105+
106+
# ─── 4. full reconstruction + match check ────────────────────────────────────
107+
108+
title.("4. FULL RECONSTRUCTION")
109+
110+
reconstructed = reconstruct.(tokens)
111+
IO.puts(reconstructed)
112+
113+
title.("5. RECONSTRUCTION vs ORIGINAL")
114+
115+
if reconstructed == source do
116+
IO.puts(" ✓ exact match")
117+
else
118+
IO.puts(" ✗ differs (whitespace normalisation expected)")
119+
IO.puts("")
120+
121+
source_lines = String.split(source, "\n")
122+
reconstructed_lines = String.split(reconstructed, "\n")
123+
124+
source_lines
125+
|> Enum.with_index(1)
126+
|> Enum.each(fn {orig, n} ->
127+
recon = Enum.at(reconstructed_lines, n - 1, "")
128+
marker = if orig == recon, do: " ", else: "!!"
129+
IO.puts(" #{marker} #{String.pad_leading(to_string(n), 2)} orig : #{inspect(orig)}")
130+
if orig != recon do
131+
IO.puts(" recon: #{inspect(recon)}")
132+
end
133+
end)
134+
end

scripts/run.sh

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,9 @@ case "$INPUT_COMMAND" in
3838
fi
3939
;;
4040
analyze) OUTPUT_FILE="${OUTPUT_FILE}.json" ;;
41+
blocks) OUTPUT_FILE="${OUTPUT_FILE}.json" ;;
4142
*)
42-
echo "::error::Unknown command: $INPUT_COMMAND. Must be health-report, compare, or analyze."
43+
echo "::error::Unknown command: $INPUT_COMMAND. Must be health-report, compare, analyze, or blocks."
4344
exit 1
4445
;;
4546
esac
@@ -82,6 +83,9 @@ case "$INPUT_COMMAND" in
8283
analyze)
8384
ARGS+=("--output" "$OUTPUT_FILE")
8485
;;
86+
blocks)
87+
ARGS+=("--output" "$OUTPUT_FILE")
88+
;;
8589
esac
8690

8791
# Parse ignore-paths YAML list into --ignore-paths flag

test/support/block_matcher.ex

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
defmodule Test.BlockMatcher do
2+
@moduledoc """
3+
Helpers for asserting on tokens within `CompoundBlock` structures.
4+
5+
Returns tagged tuples that can be matched against token fields:
6+
7+
- `exact(:content, "add")` — token whose `content` equals `"add"` exactly
8+
- `partial(:content, "@doc")` — token whose `content` contains `"@doc"` as a substring
9+
- `:value` targets the normalized token value instead of raw source content
10+
"""
11+
12+
@spec exact(:content | :value, String.t()) :: {:exact, :content | :value, String.t()}
13+
def exact(field, value) when field in [:content, :value], do: {:exact, field, value}
14+
15+
@spec partial(:content | :value, String.t()) :: {:partial, :content | :value, String.t()}
16+
def partial(field, value) when field in [:content, :value], do: {:partial, field, value}
17+
end

0 commit comments

Comments
 (0)