Chalkline's full writeup lives at paper/README.md, covering the bottom-up career-mapping thesis, the stage-by-stage algorithmic defense, the stakeholder data gaps the pipeline had to route around, and the bibliography. Start there if you want the project's reasoning before its mechanics.
Requires Python 3.14+ and uv for dependency management.
git clone https://github.com/Jybbs/chalkline.git
cd chalkline
uv syncChalkline operates in two stages. First, fit encodes the posting corpus with a sentence transformer, clusters the embeddings into career families, assigns O*NET occupations via posting-level MaxSim late-interaction scoring, derives per-cluster wages from the joined labor table, and builds a stepwise career graph that attaches credentials to each pathway on demand. Results are cached to disk via Hamilton's content-addressed store so subsequent runs with unchanged code and config serve instantly.
uv run chalkline fit # fit the pipeline, print a summary
uv run chalkline fit -v # same, with diagnostic logsThen launch starts the Marimo reactive notebook where you upload a resume and receive a personalized career report.
uv run chalkline launch # open the career report in your browserThe posting corpus from AGC Maine is proprietary and not included in the repository. Place posting data in
data/postings/before fitting.
The Green Buildings Career Map organized 55 jobs across 4 sectors with 300+ advancement routes, demonstrating that structured career maps change how workers navigate trades1. Chalkline asks whether the same kind of structure can be constructed algorithmically from job postings, complementing expert-curated maps with a data-driven approach that can be re-fitted as the labor market shifts.
The premise is that postings encode implicit structure about how occupations relate to one another, which skills bridge adjacent roles, and what credentials separate one career level from the next. Occupational modeling at scale has confirmed this, showing that millions of unstructured postings yield taxonomies comparable to expert-curated frameworks2, and network models built from skill overlap reveal the same latent mobility structure34. Data-driven taxonomies extracted directly from online adverts have reached similar conclusions at smaller scale5, reinforcing that the signal is in the postings themselves.
Chalkline works with 2,154 postings scraped from AGC Maine's listings and covers 60 O*NET SOC codes across three sectors (Building Construction, Construction Managers, Heavy Highway Construction). A sentence transformer6 encodes each posting into a 768-dimensional embedding, Ward-linkage HAC7 clusters those embeddings into 20 career families, and a stepwise k-NN graph routes advancement and lateral moves enriched by 836 credentials (19 apprenticeships, 787 certifications, 30 educational programs) on a per-route basis. A joined labor table of BLS OEWS wages, growth projections, and O*NET Bright Outlook designations covers 53 of the SOC codes, driving the cluster-level wage expectations that appear on every career card. Upload a resume, and the system chunks it into sentences, encodes each chunk, and projects into the same space for personalized skill-gap analysis8.
A chalk line snaps a straight reference path between two points. Chalkline does the same for careers.
Chalkline is a single-track embedding pipeline orchestrated by Hamilton9, wherein each processing step is a DAG node whose parameter names declare its dependencies. Hamilton resolves execution order automatically, caches every node result to disk under a content-addressed key of hash(code_version + input_data_versions), and serves from cache on subsequent calls with unchanged code and config, so that editing a curation script, a lexicon JSON, or an individual node function reliably invalidates only that node and its downstream dependents, rather than requiring a blunt wipe of the cache directory. The pipeline draws on recent work in job ad segmentation via NLP and clustering10, and on end-to-end transformer pipelines for resume matching11.
| Step | Node | Technique | Module |
|---|---|---|---|
| 1 | Corpus Loading | Deduplicate and filter JobSpy-collected postings, normalize companies and locations | collection.collector |
| 2 | Sentence Encoding |
Alibaba-NLP/gte-base-en-v1.5 via ONNX with CLS pooling |
pipeline.encoder |
| 3 | Dimensionality Reduction | L2-normalize embeddings, then TruncatedSVD to 10 components | pipeline.steps |
| 4 | Clustering |
Ward-linkage HAC at |
pipeline.steps |
| 5 | SOC Assignment | Posting-level ColBERTv2 MaxSim against Task embeddings of all 60 O*NET occupations | pathways.selection |
| 6 | Per-Cluster Wage | Top-K softmax expectation over labor wages weighted by SOC similarity | pathways.clusters |
| 7 | Career Graph | Stepwise k-NN backbone (lateral at same Job Zone, upward at next) with per-route destination-affinity credential pool and waste-aware Pareto-knee selection | pathways.graph |
| 8 | Resume Matching | Sentence chunking, per-task MaxSim, BM25-weighted gap ranking, SVD projection for centroid distance | matching.matcher |
The SentenceEncoder in pipeline/encoder.py downloads the ONNX model from HuggingFace on first use and runs inference via onnxruntime in fixed-size batches with CLS pooling followed by L2 normalization, with the ~430 MB model file deliberately instantiated outside the DAG, so that Hamilton's disk cache only serializes NumPy array outputs rather than the encoder weights themselves. Cold-start time for subsequent sessions drops from ~10.4s to ~0.35s because the encoder loads tokenizer files through Tokenizer.from_file and reuses try_to_load_from_cache rather than re-resolving the HuggingFace revision each time.
The fitted pipeline assembles into a Chalkline dataclass that exposes four attributes (clusters, config, graph, matcher) and a single match(pdf_bytes) method which extracts resume text via pdfplumber, splits it into sentences, encodes each chunk with the same sentence transformer used for posting encoding, projects the mean chunk vector through the fitted SVD, assigns the nearest career family, computes per-task MaxSim gap analysis, and returns a MatchResult carrying reach exploration and credential metadata. Because the matcher reuses every fitted transformation rather than re-encoding the reference corpus, per-match latency stays under a second once the encoder is warm.
Fit-time timing is logged per Hamilton node via run_after_node_execution in pipeline/progress.py, resulting in a diagnostic run that surfaces which step dominates wall-clock time without external profiling, and the chalkline cache CLI subcommand inspects Hamilton's SQLite metadata store to show which cached node output maps to which on-disk file when code changes do not invalidate the downstream subtree in the way you expect.
Posting collection sits upstream of the Hamilton DAG, because raw scraping is a stateful process that should not re-run on every pipeline fit. The collection/ subpackage wraps python-jobspy to issue searches against multiple aggregators for a curated list of construction search terms, concatenates the returned records into a single DataFrame, and passes them through clean_text normalization before the collector deduplicates on a composite key derived from the company and title slugs. Each posting receives a deterministic id via python-slugify, so that the same listing encountered twice across different boards collapses to a single record, resulting in a stable corpus that can be diffed between collection runs. The collector writes to data/postings/ as a JSON array consumed by the Hamilton corpus node, and the pipeline treats everything under that directory as read-only input.
Each posting description is fed through a sentence transformer (gte-base-en-v1.5) that converts text into a 768-dimensional vector capturing its semantic meaning. Every vector is scaled to unit length (L2-normalized), so that
768 dimensions is more than the downstream steps need, and high-dimensional spaces introduce a well-documented problem wherein all pairwise distances converge toward the same value12, making it harder to tell similar postings apart from dissimilar ones. TruncatedSVD13 compresses the space by decomposing the posting embedding matrix into its most informative components:
The pipeline retains
The pipeline groups postings into career families using Ward-linkage hierarchical agglomerative clustering7. Starting with each posting as its own cluster, the algorithm repeatedly merges the two clusters whose combination increases total within-cluster variance the least. The cost of merging clusters
This builds a full merge hierarchy that is then cut at
The methods tab surfaces two analytical primitives that together describe how usable the fitted partition actually is. Silhouette analysis16 validates the quality of the partition by measuring how well each posting fits its assigned family versus its nearest alternative family, with the per-posting silhouette coefficient defined as
where
Brokerage centrality on the career graph17 complements the silhouette view by measuring how often each cluster appears on the shortest path between other pairs of clusters. For a cluster
where
Each cluster needs an occupational identity drawn from the 60-code O*NET reference set, and the assignment scorer uses ColBERTv2-style late-interaction MaxSim1819, which preserves the multi-vector structure on both sides of the comparison rather than collapsing either into a pooled mean. For each cluster
Each posting casts its best-matching single task against each SOC, and the cluster-level score is the mean of those maxes. The SOCScorer dataclass in pathways/selection.py stacks every SOC's task matrix into one contiguous array at construction, and resolves every cluster-occupation pair with a single BLAS matmul plus np.maximum.reduceat for per-occupation max-pooling, so that the soc_similarity Hamilton node collapses to a three-line delegation.
The full (n_clusters, n_occupations) similarity matrix feeds two downstream consumers. The argmax assigns each cluster's SOC title and sector, and a softmax over each row at temperature
Job Zone assignment (ranging from 1 for minimal preparation to 5 for extensive) uses a smoothed vote from the top
where
The matcher splits an uploaded resume into sentences via NLTK's Punkt tokenizer, encodes each chunk
A resume line like "Installed commercial electrical systems for 8 years" scores 0.6β0.8 against its matching task, while unrelated tasks stay close to the neutral ~0.30 cosine floor that any coherent English text produces against construction content. Cluster assignment continues to use the mean chunk vector projected through the fitted SVD, so that centroid distance stays comparable across sessions even as per-task scoring benefits from the chunk-level resolution.
Generic verbs like "prepare", "use", and "assist" appear in the task descriptions of almost every occupation, so that raw per-task MaxSim would reward resumes that mention them regardless of whether the underlying work matches. The matcher re-weights each task's similarity by a BM25 term-weighting function8 over its stemmed content words, which suppresses high-document-frequency terms, and amplifies domain-specific ones such as "conduit", "circuit", and "journeyman". The BM25 term-frequency component with length normalization is
where wordfreq corpus) removes terms that carry little occupation-specific signal before the weighting runs. Demonstrated tasks rank by descending weighted similarity, gaps by ascending (largest deficits first).
The career graph connects the 20 career families with directed, weighted edges representing plausible career moves3. Graph-based representations of occupational transitions capture mobility patterns that flat taxonomies miss2223, and the stepwise constraint ensures edges only link clusters at the same Job Zone (lateral pivots) or one level apart (upward advancement), preventing unrealistic tier-skipping jumps24. Each cluster gets
Credentials attach per route rather than per edge, meaning that CareerPathwayGraph.credentials_for(target_id) applies a destination-affinity filter to the full credential set on demand, so that every route the user explores receives a freshly computed, destination-specific credential pool. A credential
where
Once a route's gap set and credential pool are known, the CredentialSelector picks up to five credentials that jointly cover as many gaps as possible while minimizing redundant reach. The selector sweeps a waste-penalty parameter
where positions: frozenset[int] (newly covered gaps at pick time), which the UI uses to check off only the gaps that credential contributes rather than every gap it can cover in isolation.
Every career card shows a median annual wage derived from the BLS OEWS table joined to O*NET SOC codes. Because SOC assignment is probabilistic (the softmax row over soc_similarity gives each cluster a distribution over occupations), the wage is computed as a top-K expectation rather than a single-SOC lookup. For cluster
where Cluster as a post-init attribute alongside display_title and soc_weights, so downstream consumers (map nodes, route verdicts, the wage-filter slider) read per-cluster values without re-running the computation.
The data tab characterizes the matched career family through two complementary views that both rest on TF-IDF over clusters-as-documents. The distinctive vocabulary view ranks every word appearing in the matched cluster's postings by
where df = 1), rare across the corpus vocabulary (2 β€ df β€ 4), and notable vocabulary that still ranks high in the matched family despite appearing in five or more families. Each tier is sized independently, so that a sparser tier never gets visually crowded by a denser one, and words below a minimum raw count threshold are filtered out first to suppress single-occurrence noise.
Sub-role discovery operates at a finer grain by running k-means on the matched cluster's posting embeddings and labeling each sub-cluster with its top-two TF-IDF words where the "documents" are the sub-clusters themselves. For sub-cluster
where
Cluster labels need to be unique within the corpus so two cards never collide in the picker or the map. Because multiple clusters can legitimately share a SOC title (two Operating Engineers clusters that differ in specialty), the pipeline resolves collisions through a three-level cascade applied asymmetrically per collision group:
At each pass, the resolver groups clusters by their current label, and, for any group with more than one cluster, promotes the smaller members to the next level, breaking ties by descending cluster size with cluster id as the secondary key. The largest Civil Engineers cluster keeps the bare title, while smaller colliding clusters advance to their modal posting title, or to the numbered fallback. The loop runs at most three iterations, because the level-2 fallback is guaranteed unique via the cluster id, and the (#id) form carries SOC context in the rare case where modal titles also collide.
The Marimo notebook opens to a splash page showing the fitted landscape at a glance (corpus size, occupation count, sector distribution, credential totals) with a drag-and-drop upload zone. Drop a PDF resume and the system extracts text, chunks it into sentences, encodes each chunk, projects through the fitted SVD, and matches to the nearest career family. The splash then dismisses and the three-tab dashboard takes over.
The primary view is an interactive D3 force-directed career map rendered via AnyWidget so Python state (click selection, wage filter) flows reactively back through traitlets. Horizontal position encodes wage, node rendering tier distinguishes the immediate career neighborhood from distant options, and the matched career renders as an enriched hero card integrated into the SVG. Clicking any cluster swaps the route panel below the map to describe the transition from the matched career to the selected destination.
The route panel owns the substantive career-planning content:
- Verdict: fit percentage (calibrated from SVD centroid distance), wage comparison bars for the source and destination, bold narrative verdict, and open-positions count
- Evidence Drawer: the eight strongest demonstrated skills and eight largest gaps, each rendered as a skill card with cosine-weighted progress bars
- Recipe: stacked credential path cards with per-credential gap shelves, where each shelf lists the route's full gap set and checks off only the tasks this credential contributes. Multiple strategies surface side by side (bang-for-your-buck, work-based path, certification stack) so the user can compare approaches
- Postings: up to ten destination-cluster postings ranked by cosine against the resume, rendered as compact cards
- Resources Drawer: the full credential catalog for this route, fuzzy-matched AGC member companies with career-page URLs, and sector-filtered job boards
A wage-floor slider at the top of the map prunes tier-2 cards whose median wage falls below the chosen threshold, defaulting to the corpus floor so every cluster is visible on first render. Debounce mode defers the map's re-render to slider release rather than every tick.
The data tab surfaces corpus statistics that contextualize the match by describing both the ambient job market and the internal structure of the user's assigned career family. The top row aggregates posting counts, sector shares, wage percentiles, and location distribution, so that a reader opening the tab sees the scale of the evidence before diving into specifics. A posting timeline plots every matched posting along its collection date, resulting in a temporal strip where hover text surfaces the company name, so that a reader can reason about seasonality or recent hiring bursts without leaving the notebook.
The matched cluster's internal composition comes from two analytical pieces that reuse the cluster's stored embeddings rather than re-encoding anything at render time. A t-SNE projection of the posting embeddings maps them to two dimensions with PCA initialization for stability, and k-means sub-clustering on the same high-dimensional vectors colors each point by its sub-role assignment, resulting in color bands labeled through the in-cluster TF-IDF formula described above. The distinctive vocabulary treemap sits alongside the projection and partitions the cluster's words into the three tiers (unique to this family, rare across the corpus, notable vocabulary), giving the reader a textual counterpart to the geometric sub-role view. An employer roll-up identifies which companies are hiring most in the matched family, a credential catalog filtered to the destination cluster surfaces ranked certifications and programs, and the tab closes with a relevant job boards listing filtered by sector relevance.
The methods tab documents the pipeline's design choices for technical audiences by combining a visual walkthrough of the Hamilton DAG with the analytical primitives that justified each step's configuration. The tab opens with a process flow diagram rendering every node's parameter-level dependency graph, accompanied by per-node timing pulled from the fit log, so that the reader can see which step dominates wall-clock cost. Bar charts of SVD explained variance reveal how much of the original 768-dimensional signal survives in the 10-component reduction, whereas sector cluster sizes and per-cluster silhouette coefficients describe the partition's balance and separation quality. A scatter plot pairs silhouette against brokerage centrality on the career graph, resulting in a two-dimensional view that distinguishes families that are well-defined but peripheral from families that are both well-defined and well-connected, and a matching brokerage bar chart ranks every cluster by its stepping-stone role in the graph. SOC-similarity heatmaps show how each cluster ranks against every O*NET occupation, providing direct evidence for the MaxSim assignment decisions, and a node-to-file table mirrors chalkline cache output, so that a reader verifying an invalidation subtree can confirm which cached artifacts Hamilton will rebuild on the next fit.
Interactive glossary tooltips sit throughout both analytical tabs via pipeline-specific substitutions, meaning technical terms like silhouette, betweenness, MaxSim, and TruncatedSVD render as underlined popover triggers that reveal rich definitions sourced from display/tabs/shared/glossary.toml without requiring the reader to leave the notebook for external documentation.
Chalkline's CLI is built on Typer with Rich markup. Running chalkline with no arguments prints help.
uv run chalkline --helpEncode postings, cluster into career families, run SOC assignment, build the career graph, and cache the fitted pipeline. All directory flags default to sensible project-relative paths that work when running from the repository root.
uv run chalkline fit # fit with default paths
uv run chalkline fit --verbose # same, with debug-level logs| Option | Short | Default | Description |
|---|---|---|---|
--lexicon-dir |
data/lexicons |
Path to lexicon JSONs (O*NET, credentials, labor) | |
--postings-dir |
data/postings |
Path to the corpus directory | |
--verbose |
-v |
False |
Show diagnostic logs |
Pre-fit the pipeline (hitting cache on unchanged code and config), then start marimo run on the career report notebook. Must be run from the project root where app/main.py exists.
uv run chalkline launch
uv run chalkline launch --verboseInspect Hamilton's content-addressed disk cache, listing every cached node, the SHA it keys against, and the on-disk file size. Useful when a code change does not seem to have invalidated what you expected.
uv run chalkline cache # inspect default .cache/hamilton
uv run chalkline cache --cache-dir path/to/cache # custom cache root| Component | Technology | Role |
|---|---|---|
| Sentence Encoding | onnxruntime + tokenizers |
ONNX inference for gte-base-en-v1.5 with HuggingFace fast tokenization |
| Machine Learning | scikit-learn |
TruncatedSVD, Ward HAC, t-SNE, k-means, L2 normalization, cosine similarity, silhouette |
| Pipeline Orchestration | sf-hamilton[diskcache] |
DAG resolution from function signatures with node-level content-addressed disk caching9 |
| Career Graph | NetworkX |
Directed weighted graph for stepwise k-NN backbone, reach queries, and betweenness centrality17 |
| Corpus Collection | python-jobspy |
Multi-board job aggregation from Indeed and other sources |
| PDF Extraction | pdfplumber |
Resume text extraction with layout-aware parsing |
| UI | Marimo + AnyWidget |
Reactive notebook with custom D3 career-map widget |
| HTML Composition | htpy + MarkupSafe |
Typed HTML element trees for display-layer composition |
| Visualization | Plotly |
Interactive charts for landscape, variance, heatmaps, treemaps |
| Vocabulary Filtering | wordfreq + nltk |
Zipf-frequency stop filtering and Snowball stemming for BM25 weighting |
| CLI | Typer |
fit, launch, and cache subcommands with Rich markup |
| Configuration | Pydantic |
PipelineConfig with extra="forbid" and tuned defaults |
| Logging | Loguru |
Structured pipeline progress and per-node timing |
| Utilities | python-slugify |
Deterministic posting id construction |
chalkline/
βββ app/
β βββ chalkline.css Dashboard theme (dark, Lora serif, sector palette)
β βββ main.py Marimo reactive notebook (career report)
β
βββ data/
β βββ certifications/ CareerOneStop certification curations (committed)
β β βββ careeronestop.json Scraped certification records for credential enrichment
β βββ labor/ BLS OEWS raw curations (committed)
β β βββ outlook.json O*NET Bright Outlook flags for 53 SOCs
β β βββ projections.json 10-year employment projections for 51 SOCs
β β βββ wages.json Annual wage percentiles for 50 SOCs
β βββ lexicons/ Pipeline inputs (committed)
β β βββ credentials.json 836 credentials (19 apprenticeships, 787 certs, 30 programs)
β β βββ labor.json Joined wage + projection + outlook table for 53 SOCs
β β βββ onet.json 60 SOC codes with Tasks, DWAs, Technology Skills, KSAs
β β βββ osha.json OSHA regulatory topic vocabulary list
β β βββ supplement.json Supplemental construction term vocabulary
β βββ postings/ Scraped AGC corpus (2154 records)
β βββ stakeholder/ AGC Maine reference data (gitignored)
β βββ additions/ Scope extensions (apprenticeship SOCs, program SOCs)
β βββ reference/ Members, apprenticeships, programs, job boards, etc.
β
βββ scripts/ Repeatable data curation (not part of the package)
β βββ curate_credentials.py Build credentials.json from stakeholder refs + enrichment
β βββ curate_labor.py Join wages + projections + outlook into labor.json
β βββ curate_onet.py Fetch O*NET Tasks, DWAs, Technology Skills, KSAs
β βββ explore_embeddings.py Diagnostic tool for SOC assignment investigations
β βββ parse_agc_workbook.py Extract stakeholder workbook sheets into reference JSONs
β βββ parse_certifications.py Transform CareerOneStop certification scrapes
β βββ parse_labor.py Parse raw BLS OEWS sheets into the labor subdirectory
β
βββ src/chalkline/
β βββ cli/ Typer CLI with fit, launch, and cache subcommands
β β βββ cache.py Hamilton cache inspector
β β βββ fit.py Pipeline fitting with cache-or-compute
β β βββ launch.py Marimo notebook launcher with pre-fit
β β
β βββ collection/ Corpus loading and posting schemas
β β βββ collector.py Filter and key postings from storage
β β βββ schemas.py Posting Pydantic models
β β βββ storage.py File-backed posting persistence
β β
β βββ display/ Presentation layer
β β βββ charts.py Plotly chart builders (variance, sector, silhouette, heatmap, scatter)
β β βββ forms.py Marimo UI composers (wage-filter slider)
β β βββ loaders.py ContentLoader + Layout composer for htpy assembly
β β βββ routes.py Route card builders (verdict, evidence, recipe, postings, resources)
β β βββ schemas.py RouteDetail, MapGeometry, CredentialPath, PathItem, MlMetrics, ...
β β βββ theme.py Plotly templates, sector palette, CSS custom property forwarding
β β βββ tabs/
β β βββ data/render.py Data tab renderer
β β βββ map/render.py Map tab renderer
β β βββ map/widget.py PathwayMap AnyWidget (D3 force-directed)
β β βββ methods/render.py Methods tab renderer
β β βββ shared/content.toml Shared UI labels
β β βββ shared/glossary.toml Glossary tooltip definitions
β β βββ splash/render.py Splash page renderer
β β
β βββ matching/ Resume-to-career matching
β β βββ matcher.py Sentence chunking, per-task MaxSim, BM25 weighting, SVD projection
β β βββ reader.py PDF text extraction via pdfplumber
β β βββ schemas.py MatchResult, BM25Config, ScoredTask models
β β
β βββ pathways/ Career graph construction and cluster domain
β β βββ clusters.py Cluster and Clusters dataclasses (wage, display_title cascade)
β β βββ graph.py NetworkX stepwise k-NN backbone with per-pair credentials_for
β β βββ loaders.py LaborLoader and StakeholderReference
β β βββ schemas.py Credential, EncodedOccupation, Occupation, SkillType
β β βββ selection.py SOCScorer (ColBERTv2 MaxSim) and CredentialSelector (waste-aware Pareto-knee)
β β
β βββ pipeline/ Orchestration and shared types
β βββ encoder.py ONNX sentence transformer wrapper with CLS pooling
β βββ orchestrator.py Hamilton DAG driver β fitted Chalkline dataclass
β βββ progress.py Loguru + Rich progress with per-node timing
β βββ schemas.py PipelineConfig (Pydantic, extra="forbid")
β βββ steps.py Hamilton node functions (the full DAG)
β
βββ paper/ Final report and figures (GitHub auto-renders paper/README.md)
β βββ figures/ PNG renders referenced from the report
β βββ README.md Final DS5230 writeup with bibliography
β
βββ tests/ Pytest suite mirroring src/ structure
βββ pyproject.toml Build config, dependencies, CLI entry point
βββ uv.lock Locked dependency versions
Each domain subpackage (collection/, matching/, pathways/, pipeline/, display/) owns its schemas and logic. The pipeline/ subpackage orchestrates the others through Hamilton, where each function in steps.py is a DAG node whose parameter names declare its dependencies. The display/ subpackage is organized tab-per-directory, so that each tab's render.py owns the Marimo cell composition for that tab, with shared primitives (Layout, Routes, Charts, Forms, Theme) sitting at the package root.
AGC Maine (Associated General Contractors of Maine) represents 222 member companies and has been the state's primary construction trade association since 1951. The association operates the Maine Construction Academy with tuition-free pre-apprenticeship programs expanding to five community colleges in 2026 and manages 19 registered apprenticeship pathways spanning trades from carpentry and welding to crane operation and solar installation.
AGC provided the posting corpus, the stakeholder reference data defining the project's SOC scope and three sectors, and the credential records (apprenticeships, certifications, educational programs) that enrich the career graph. The collaboration connects algorithmic career mapping to a real training pipeline2627, where outputs directly inform which programs AGC recommends to workers entering or advancing through the trades.
Footnotes
-
Hamilton. 2012. "Career Pathway and Cluster Skill Development: Promising Models from the United States." OECD Local Economic and Employment Development (LEED) Papers 2012/14. https://doi.org/10.1787/5k94g1s6f7td-en β©
-
Dixon, et al. 2023. "Occupational Models from 42 Million Unstructured Job Postings." Patterns 4 (7): 100757. https://doi.org/10.1016/j.patter.2023.100757 β©
-
del Rio-Chanona, et al. 2021. "Occupational Mobility and Automation: A Data-Driven Network Model." Journal of the Royal Society Interface 18 (174): 20200898. https://doi.org/10.1098/rsif.2020.0898 β© β©2
-
Alabdulkareem, et al. 2018. "Unpacking the Polarization of Workplace Skills." Science Advances 4 (7): eaao6030. https://doi.org/10.1126/sciadv.aao6030 β©
-
Djumalieva & Sleeman. 2018. "An Open and Data-driven Taxonomy of Skills Extracted from Online Job Adverts." ESCoE Discussion Paper 2018-13. https://www.escoe.ac.uk/publications/an-open-and-data-driven-taxonomy-of-skills-extracted-from-online-job-adverts/ β©
-
Ortakci. 2024. "Revolutionary Text Clustering: Investigating Transfer Learning Capacity of SBERT Models through Pooling Techniques." Engineering Science and Technology, an International Journal 55: 101730. https://doi.org/10.1016/j.jestch.2024.101730 β© β©2
-
Ward. 1963. "Hierarchical Grouping to Optimize an Objective Function." Journal of the American Statistical Association 58 (301): 236-244. https://doi.org/10.1080/01621459.1963.10500845 β© β©2
-
de Groot, et al. 2021. "Job Posting-Enriched Knowledge Graph for Skills-based Matching." RecSys in HR '21 Workshop, CEUR Workshop Proceedings, Vol. 2967. https://arxiv.org/abs/2109.02554 β© β©2
-
Krawczyk, et al. 2022. "Hamilton: Enabling Software Engineering Best Practices for Data Transformations via Generalized Dataflow Graphs." 1st International Workshop on Data Ecosystems (DEco@VLDB 2022), CEUR Workshop Proceedings, Vol. 3306: 41-50. https://ceur-ws.org/Vol-3306/paper5.pdf β© β©2
-
Lukauskas, et al. 2023. "Enhancing Skills Demand Understanding through Job Ad Segmentation Using NLP and Clustering Techniques." Applied Sciences 13 (10): 6119. https://doi.org/10.3390/app13106119 β©
-
Khelkhal & Lanasri. 2025. "Smart-Hiring: An Explainable End-to-End Pipeline for CV Information Extraction and Job Matching." arXiv preprint arXiv:2511.02537. https://doi.org/10.48550/arXiv.2511.02537 β©
-
Aggarwal, Hinneburg & Keim. 2001. "On the Surprising Behavior of Distance Metrics in High Dimensional Space." Database Theory (ICDT 2001), Lecture Notes in Computer Science 1973: 420-434. https://doi.org/10.1007/3-540-44503-X_27 β©
-
Halko, Martinsson & Tropp. 2011. "Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions." SIAM Review 53 (2): 217-288. https://doi.org/10.1137/090771806 β© β©2
-
Deerwester, Dumais, Furnas, Landauer & Harshman. 1990. "Indexing by Latent Semantic Analysis." Journal of the American Society for Information Science 41 (6): 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 β©
-
Zhang, Zhou & Bollegala. 2024. "Evaluating Unsupervised Dimensionality Reduction Methods for Pretrained Sentence Embeddings." Proceedings of LREC-COLING 2024: 6530-6543. https://aclanthology.org/2024.lrec-main.579/ β©
-
Rousseeuw. 1987. "Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis." Journal of Computational and Applied Mathematics 20: 53-65. https://doi.org/10.1016/0377-0427(87)90125-7 β©
-
Freeman. 1977. "A Set of Measures of Centrality Based on Betweenness." Sociometry 40 (1): 35-41. https://doi.org/10.2307/3033543 β© β©2
-
Khattab and Zaharia. 2020. "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval: 39-48. https://doi.org/10.1145/3397271.3401075 β© β©2
-
Santhanam, et al. 2022. "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction." Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: 3715-3734. https://doi.org/10.18653/v1/2022.naacl-main.272 β©
-
Achananuparp, et al. 2025. "A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models." arXiv preprint, accepted at ICWSM 2026. https://doi.org/10.48550/arXiv.2503.12989 β©
-
Rosenberger, et al. 2025. "CareerBERT: Matching Resumes to ESCO Jobs in a Shared Embedding Space for Generic Job Recommendations." Expert Systems with Applications 275: 127043. https://doi.org/10.1016/j.eswa.2025.127043 β©
-
Avlonitis, et al. 2023. "Career Path Recommendations for Long-term Income Maximization: A Reinforcement Learning Approach." RecSys in HR '23 Workshop, CEUR Workshop Proceedings, Vol. 3490. https://ceur-ws.org/Vol-3490/RecSysHR2023-paper_2.pdf β©
-
BoΕ‘koski, et al. 2024. "Career Path Discovery through Bipartite Graphs." Journal of Decision Systems 33 (sup1): 140-153. https://doi.org/10.1080/12460125.2024.2354585 β©
-
Senger, et al. 2025. "Toward More Realistic Career Path Prediction: Evaluation and Methods." Frontiers in Big Data 8: 1564521. https://doi.org/10.3389/fdata.2025.1564521 β©
-
Satopaa, et al. 2011. "Finding a 'Kneedle' in a Haystack: Detecting Knee Points in System Behavior." 31st International Conference on Distributed Computing Systems Workshops: 166-171. https://doi.org/10.1109/ICDCSW.2011.20 β©
-
Frej, et al. 2024. "Course Recommender Systems Need to Consider the Job Market." Proceedings of the 47th ACM SIGIR Conference. https://doi.org/10.1145/3626772.3657847 β©
-
Alonso, et al. 2025. "A Novel Approach for Job Matching and Skill Recommendation Using Transformers and the O*NET Database." Big Data Research 39: 100509. https://doi.org/10.1016/j.bdr.2025.100509 β©
