AI-powered fraud detection for students. Paste any suspicious job offer, internship message, or recruitment email and get an instant risk score — backed by semantic embeddings, 11 fraud signals, a self-organizing scam cluster database, and a self-learning scoring engine.
---
- The Problem
- What JobShield Does
- Features at a Glance
- Tech Stack
- System Architecture
- The 11 Fraud Signals
- Scoring System
- Analysis Pipeline — Step by Step
- Clustering & Scam Intelligence Database
- Self-Learning Weight Engine
- Chrome Extension
- Admin Panel & The Training Loop
- Project Structure
- Database Collections
- API Reference
- Environment Variables
- Installation & Setup
- Atlas Vector Search Setup
- Seeding the Database
- Running the Project
- Creating an Admin Account
- Color Palette & UI Design
- Known Limitations
India reported over 11 lakh cybercrime cases in 2023, with job fraud ranking among the top three categories. The victims are overwhelmingly students and recent graduates — first-time job seekers who have no prior experience recognizing scam patterns.
The typical attack:
- Student finds a listing on Internshala, LinkedIn, or receives a WhatsApp message
- Recruiter asks for a "registration fee," "security deposit," or "onboarding contribution" of ₹500–₹2000
- Student pays via UPI or bank transfer
- Recruiter disappears
The gap: Existing tools like ScamAdviser or Google Safe Browsing check URLs and domain reputation — they don't read the actual content of the message. A scammer using a Gmail address and WhatsApp leaves no URL to check.
JobShield was built for exactly this scenario.
JobShield analyzes the raw text of any job offer, internship message, or recruitment communication and returns:
- A 0–100 fraud risk score
- A human-readable explanation of every signal that fired
- A classification: Low Risk / Suspicious / High Risk / Likely Scam
There are two core flows:
| Flow | Endpoint | Saves to DB? | Purpose |
|---|---|---|---|
| Check | POST /api/check |
No | Quick ephemeral scan — paste and check |
| Report | POST /api/reports |
Yes | Submit a scam — adds to database, improves the system |
| Feature | Description |
|---|---|
| Semantic similarity | 384-dim embeddings catch scam variants and paraphrases — not just exact matches |
| 11 fraud signals | Across 4 categories: semantic, linguistic, identity, and infrastructure |
| Self-organizing clusters | Every report automatically merges into, attaches to, or creates a scam cluster |
| Self-learning weights | Logistic regression trains on admin-verified reports every 6 hours |
| Chrome extension | Auto-extracts text from Gmail and Internshala, shows result inline |
| NER model | Detects brand impersonation and domain-org mismatches using BERT-NER |
| Admin pipeline | Full review queue, cluster verification, reputation scoring, manual retraining |
| Combination bonus | Multiple weak signals together trigger a compound score boost |
| Floor boosts | Confirmed scam matches always produce "Likely Scam" regardless of other signals |
| Graceful degradation | Every external API call fails gracefully — the pipeline always completes |
| Layer | Technology |
|---|---|
| Frontend | React 18, Vite |
| Backend | Node.js 20, Express, ES Modules |
| Database | MongoDB Atlas (M0 free tier) |
| Vector Search | Atlas Vector Search — ANN cosine similarity |
| Embedding Model | sentence-transformers/all-MiniLM-L6-v2 (384-dim) via HuggingFace |
| NER Model | dslim/bert-base-NER via HuggingFace |
| Inference API | HuggingFace Inference API (router.huggingface.co) |
| Domain Intelligence | WhoisXML API (with MongoDB cache) |
| Authentication | JWT (jsonwebtoken + bcryptjs) |
| Chrome Extension | Manifest V3, content scripts, service worker |
| ML Engine | Logistic regression implemented from scratch in Node.js — no ML libraries |
Signals are divided into four categories. Each carries a default weight — once enough admin-verified data is collected, the logistic regression engine learns better weights from real data.
These two signals are mutually exclusive — only the higher-confidence one fires.
| # | Signal | Default Weight | Trigger |
|---|---|---|---|
| 1 | Confirmed scam match | 35 | Cosine similarity ≥ 0.95 against a verified cluster |
| 2 | High similarity match | 25 | Cosine similarity 0.85–0.95 against any cluster |
Signal 1 also applies a floor boost — any text triggering it scores a minimum of 72/100 (Likely Scam), regardless of other signals.
| # | Signal | Default Weight | Trigger |
|---|---|---|---|
| 3 | Domain mismatch | 20 | NER detects an ORG name that doesn't match the sender's domain |
| 4 | Young domain | 18 | WHOIS shows domain registered < 90 days ago |
| 6 | Big brand impersonation | 15 | NER detects PayPal, Amazon, Google, HDFC, Flipkart, Paytm etc. |
| # | Signal | Default Weight | Trigger |
|---|---|---|---|
| 5 | Payment language | 22 | Regex detects: registration fee, UPI payment, advance deposit, security deposit etc. |
| 9 | Telegram present | 14 | Regex detects t.me/ link or @handle (extremely common in Indian job scams) |
| 8 | Free email provider | 12 | Sender email is Gmail, Yahoo, Outlook, Rediffmail, Hotmail etc. |
| 11 | Urgency detected | 10 × urgencyScore | Keywords: "urgent", "act now", "expires today", "limited seats" etc. |
| # | Signal | Default Weight | Trigger |
|---|---|---|---|
| 7 | Suspicious TLD | 12 | Domain ends in .xyz, .top, .click, .tk, .loan, .win, .bid etc. |
| 10 | Previously reported | 16 | Same domain or cluster pattern reported 3+ times |
When 3 or more weak signals fire together, a compound bonus is applied:
| Weak signals firing | Bonus added |
|---|---|
| 3 | +15 |
| 4 | +20 |
| 5 | +25 |
Weak signals counted: freeEmailProvider, telegramPresent, urgencyDetected, suspiciousTLD, paymentLanguage.
This reflects that multiple weak signals together are significantly more suspicious than their individual weights suggest. A message from a Gmail address, asking for UPI payment, with a Telegram link, and urgency language is a textbook Indian job scam — even if each signal alone seems minor.
Step 1 — Evaluate all 11 signals
Sum triggered weights → rawScore
Step 2 — Add combination bonus (if 3+ weak signals fired)
Step 3 — Normalize against theoretical maximum
maxPossible = sum of ALL weights if every signal fired
normalizedScore = (rawScore / maxPossible) × 100
Step 4 — Apply floor boosts
confirmedScamMatch → minimum score 72 (Likely Scam)
highSimilarityMatch → minimum score 55 (High Risk)
paymentLanguage + telegram → minimum score 60 (High Risk)
paymentLanguage + freeEmail → minimum score 60 (High Risk)
Step 5 — Clamp to 0–100
| Score Range | Classification | Meaning |
|---|---|---|
| 0 – 25 | Low Risk | No significant signals — appears legitimate |
| 26 – 50 | Suspicious | Some signals fired — proceed with caution |
| 51 – 70 | High Risk | Multiple strong signals — likely fraudulent |
| 71 – 100 | Likely Scam | High confidence fraud — do not engage |
Every request to /api/check or /api/reports runs the same pipeline in analysisService.js.
Text is hashed with SHA-256. The hash is looked up in EmbeddingCache (MongoDB, 7-day TTL auto-expiry). On a cache miss, the text is sent to HuggingFace:
Model: sentence-transformers/all-MiniLM-L6-v2
Output: float[384] — a vector representing the semantic meaning of the text
This is the foundation of the entire similarity system. Two messages that mean the same thing — even with completely different words — will have numerically similar vectors. The embedding is reused in Step 3 for Atlas Vector Search.
Three tasks run simultaneously:
Regex Extraction (regexService.js) — Synchronous, no external calls, instant. Scans text with compiled regex patterns and extracts:
- Payment language (UPI, registration fee, advance deposit, bank transfer, etc.)
- Telegram links and @handles
- Email addresses and their domains (checks against free provider list)
- URLs and their TLDs (checks against suspicious TLD list)
- Urgency keywords (normalized to 0–1 score: triggered / 3, capped at 1)
- Primary domain (first URL domain found, or email domain)
NER Model (nerService.js) — Calls dslim/bert-base-NER via HuggingFace Inference API. Extracts ORG and PER entity groups. Checks ORGs against a hardcoded list of major brands. Fails gracefully — NER failure skips signals 3 and 6 without crashing the pipeline.
Domain Mismatch (regexService.js) — Uses NER ORG entities and the primary domain extracted by regex. If any detected ORG name does not appear in the sender's domain string, domainMismatch = true. A message claiming to be from "Google" but using a domain like jobs-portal.com triggers this signal.
WHOIS Lookup (domainService.js) — Checks the DomainIntelligence MongoDB collection first. On a cache miss, calls the WHOIS API, stores the result with domain age in days and registrar name. Fails gracefully — WHOIS failure skips signal 4.
Atlas Vector Search (similarityService.js) — Queries the ScamClusters collection using the embedding from Step 1. Returns the 5 nearest clusters by cosine similarity. The closest match is evaluated against the 0.95 and 0.85 thresholds for signals 1 and 2. Also counts total reports matching the same domain for signal 10.
weightService.js returns current weights from memory. No DB call, no API call. Under 1ms.
- < 20 labelled reports: Returns calibrated hardcoded defaults
- 20+ labelled reports: Returns logistic regression weights, retrained every 6 hours
- Blend ratio: 20% learned / 80% defaults at 20 samples → 80% learned / 20% defaults at 150+ samples
scoringService.js applies weights to triggered signals, computes the combination bonus, normalizes against the theoretical maximum, and applies floor boosts.
explanationBuilder.js maps triggered signal flags to an ordered array of plain-English sentences. This is what users see — a clear explanation of exactly why a message scored the way it did.
Every report submitted via /api/reports is routed through clusterService.js, which compares its embedding against all existing scam clusters and makes one of three decisions.
New report embedding arrives
│
▼
Atlas Vector Search → find closest existing cluster
│
Compare similarityScore of best match
│
┌─────┴─────────────────────────┬──────────────────────────┐
│ │ │
≥ 0.95 0.85 – 0.95 < 0.85
│ │ │
▼ ▼ ▼
MERGE ATTACH CREATE
Update centroid reportCount++ New cluster born
running average lastReportedAt from this embedding
reportCount++ Centroid UNCHANGED reportCount = 1
averageRiskScore updated Prevents drift verified = false
Signal 1 fires (+35) Signal 2 fires (+25) No signal fires
When a report merges (≥ 0.95 similarity), the cluster centroid shifts toward the new embedding:
newCentroid[i] = (oldCentroid[i] × (count - 1) + newEmbedding[i]) / countOver time the centroid represents the true semantic center of all scams in that cluster, making future similarity searches more accurate. The centroid is intentionally not updated for ATTACH operations (0.85–0.95) — this prevents variant wordings from drifting a cluster away from its original identity.
The first time a new scam template appears, it creates a cluster with verified: false. The second time a similar message arrives, it merges in. By the third or fourth report, an admin can review and set verified: true — which activates Signal 1 (+35, floor boost to 72 minimum) for every future similar message.
The database is self-organizing and self-improving. The 500th report benefits from the intelligence of all 499 before it.
weightService.js implements logistic regression from scratch in Node.js — no Python, no ML libraries, no GPU, no external dependencies.
Admin marks reports as "verified-scam" or "rejected"
│
▼
Every 6 hours (or POST /api/admin/retrain):
1. Pull all verified-scam reports → label 1
Pull all rejected reports → label 0
2. For each report, build 11-dimensional feature vector:
[confirmedScamMatch, highSimilarityMatch, domainMismatch,
youngDomain, paymentLanguage, bigBrandMentioned,
suspiciousTLD, freeEmailProvider, telegramPresent,
previouslyReported, urgencyScore]
3. Run gradient descent (1000 epochs, lr=0.05):
Minimise binary cross-entropy loss
Find weights w[] that best separate scam (1) from clean (0)
4. Scale coefficients → integer scoring weights summing to ~170
5. Apply dynamic blend with calibrated defaults:
< 50 samples → 20% learned + 80% defaults
50–150 samples → 50% learned + 50% defaults
150+ samples → 80% learned + 20% defaults
6. Store in memory → all subsequent requests use these weights
Signals that consistently appear in verified scam reports — but not in rejected (clean) reports — receive higher weights. Signals that appear in both get lower weights. The scoring system adapts to the actual distribution of scams in your dataset, not a theoretical assumption.
Example output after training:
Learned weights active:
confirmedScamMatch : 38 (default 35 ↑+3)
telegramPresent : 19 (default 14 ↑+5)
freeEmailProvider : 9 (default 12 ↓-3)
paymentLanguage : 26 (default 22 ↑+4)
The harder side of building training data is collecting non-scam examples — most users only submit messages they're suspicious of. On server startup, seedCleanExamples() automatically labels any pending reports that scored under 20 with no payment/telegram/brand signals as rejected, bootstrapping the non-scam training set without any manual work.
The Chrome extension brings JobShield to the point of attack — inside Gmail and Internshala — without asking the user to copy and paste anything.
popup.html / popup.js ← Login screen + analysis screen
background.js ← Service worker: handles all API calls
content.js ← Injected into Gmail + Internshala pages
config.js ← Single source of truth for API and app URLs
icons/ ← icon16.png, icon48.png, icon128.png
- User opens a suspicious email in Gmail or a job listing on Internshala
- Clicks the JobShield icon in the Chrome toolbar
- Clicks "Auto-extract text from this page" —
content.jsreads the DOM - Clicks "Analyse" —
background.jscallsPOST /api/check - Risk score and explanation appear in the popup
- A floating badge overlays the page itself, showing the result in context
| Site | Primary Selectors | Fallback |
|---|---|---|
| Gmail | .a3s.aiL, .a3s, [data-message-id] |
[role="textbox"] |
| Internshala | .internship_details, .job-detail-section, #internship_detail, .detail_view |
10+ additional selectors, then full-page text scraper |
| Generic | main, article, [role="main"], .content |
body (first 5000 chars) |
- CSP compliance: No inline
onclickhandlers anywhere — all events wired viaaddEventListener. Required by Manifest V3's strict Content Security Policy. - Double injection guard:
window.__jobshieldInjectedprevents the content script from re-registering listeners if injected multiple times. - Graceful injection errors:
ensureContentScript()wraps injection in try/catch. The popup always renders even if the page disallows injection. - JWT persistence: Stored in
chrome.storage.local— survives browser restarts. - Timeout safety: All
chrome.tabs.sendMessagecalls are protected against missing content script with proper error handling.
- Open Chrome →
chrome://extensions - Enable Developer mode (top-right toggle)
- Click Load unpacked → select the
extension/folder - JobShield icon appears in the toolbar
After any code change: click the refresh icon on the extension card in chrome://extensions.
The admin panel is not just a moderation dashboard — it is the mechanism that trains the entire scoring system.
| Action | Effect on system |
|---|---|
| View all reports | Filter by status, classification, date, sort by any field |
Mark verified-scam |
user.verifiedReports++ · reputationScore +10 · cluster.verified = true |
Mark rejected |
user.rejectedReports++ · reputationScore -5 |
| Verify a cluster | cluster.verified = true — activates Signal 1 (+35) for ALL future similar messages |
| Unverify a cluster | Deactivates Signal 1 for that cluster |
| Force retrain | POST /api/admin/retrain — immediately reruns logistic regression |
User submits suspicious message
│
▼
Pipeline runs → report saved
High-risk reports auto-flagged (score > 70) → enter admin queue
│
▼
Admin reviews → marks verified-scam or rejected
│
├── Cluster verified
│ → Signal 1 activates for all future similar messages
│ → Protects every future user who sends a similar scam
│
└── 20+ verified + 20+ rejected reports reached
→ Logistic regression retrains (every 6h or on demand)
→ Weights shift to reflect real data
→ Scoring improves for all future analyses
→ More reports correctly classified
→ Admin efficiency improves
→ Loop continues
Every admin review action makes the entire system more accurate for every future user.
Users accumulate a reputationScore based on report quality:
- +10 for each report admin-confirms as a real scam
- -5 for each report admin-rejects as not a scam
High-reputation users are more trustworthy community contributors. This surfaces over time in the admin panel.
{
email: String, // unique, indexed
password: String, // bcrypt hashed, never returned in responses
role: String, // "user" | "admin"
reputationScore: Number, // starts 0, +10 verified / -5 rejected
totalReports: Number,
verifiedReports: Number,
rejectedReports: Number,
createdAt: Date
}{
userId: ObjectId, // ref: User, indexed
rawText: String,
textHash: String, // SHA-256, indexed
embedding: [Number], // 384-dim (excluded from list endpoints)
structuredSignals: {
paymentLanguage: Boolean,
domainMismatch: Boolean,
bigBrandMentioned: Boolean,
suspiciousTLD: Boolean,
freeEmailProvider: Boolean,
telegramPresent: Boolean,
previouslyReported: Boolean,
urgencyScore: Number, // used as continuous feature in weight learning
},
domain: String,
domainAgeDays: Number,
registrar: String,
similarityScore: Number,
riskScore: Number,
classification: String, // enum: Low Risk | Suspicious | High Risk | Likely Scam
explanation: [String],
clusterId: ObjectId, // ref: ScamCluster
status: String, // pending | auto-flagged | verified-scam | rejected
location: String,
paymentMethod: String,
createdAt: Date
}{
clusterEmbedding: [Number], // 384-dim centroid — running average of merged embeddings
representativeText: String, // most recent merged text (truncated to 500 chars)
reportCount: Number, // total reports merged or attached
averageRiskScore: Number,
verified: Boolean, // true activates Signal 1 (+35) for future matches
dominantDomain: String,
dominantBrand: String,
firstReportedAt: Date,
lastReportedAt: Date
}{
textHash: String, // SHA-256, unique
embedding: [Number], // 384-dim
createdAt: Date // TTL index: auto-deleted after 7 days
}{
domain: String, // unique
ageDays: Number,
registrar: String,
flagCount: Number, // increments on each report referencing this domain
createdAt: Date
}| Method | Endpoint | Auth | Body | Response |
|---|---|---|---|---|
| POST | /api/auth/register |
None | { email, password } |
{ token, user } |
| POST | /api/auth/login |
None | { email, password } |
{ token, user } |
| GET | /api/auth/me |
JWT | — | { user } |
| Method | Endpoint | Auth | Body | Response |
|---|---|---|---|---|
| POST | /api/check |
JWT | { text } |
{ riskScore, classification, explanation[], signals{} } |
| POST | /api/reports |
JWT | { text, location?, paymentMethod? } |
{ reportId, riskScore, classification, explanation[], status } |
| Method | Endpoint | Auth | Query Params | Description |
|---|---|---|---|---|
| GET | /api/reports |
JWT | ?page=1&limit=10 |
Own report history (paginated) |
| GET | /api/reports/:id |
JWT | — | Single report detail |
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| GET | /api/dashboard/stats |
JWT | Personal stats, recent reports, classification breakdown |
| Method | Endpoint | Auth | Body / Query | Description |
|---|---|---|---|---|
| GET | /api/admin/reports |
JWT + Admin | ?status=&classification=&page=&sortBy=&order= |
All reports with filters |
| GET | /api/admin/reports/:id |
JWT + Admin | — | Single report detail |
| PATCH | /api/admin/reports/:id |
JWT + Admin | { status } |
Update status — triggers user reputation adjustment |
| GET | /api/admin/clusters |
JWT + Admin | ?verified=true&page= |
All clusters |
| PATCH | /api/admin/clusters/:id |
JWT + Admin | { verified } |
Verify/unverify cluster |
| POST | /api/admin/retrain |
JWT + Admin | — | Force immediate logistic regression retraining |
{
"riskScore": 85,
"classification": "Likely Scam",
"explanation": [
"Matches a verified scam template (99.4% similarity)",
"Message requests payment, deposit, or fee upfront",
"Sender domain doesn't match the organisation named in the message",
"This pattern has been reported 7 times by other users",
"Multiple scam signals detected together — elevated risk"
],
"signals": {
"confirmedScamMatch": true,
"highSimilarityMatch": false,
"similarityScore": 0.994,
"paymentLanguage": false,
"domainMismatch": true,
"domainAgeDays": 2573,
"bigBrandMentioned": false,
"suspiciousTLD": false,
"freeEmailProvider": false,
"telegramPresent": false,
"previouslyReported": true,
"urgencyDetected": false,
"urgencyScore": 0
},
"cached": true
}# MongoDB Atlas
MONGODB_URI=mongodb+srv://<user>:<password>@<cluster>.mongodb.net/<dbname>
# JWT
JWT_SECRET=your_strong_secret_here
JWT_EXPIRES_IN=7d
# HuggingFace — free account at huggingface.co → Settings → Access Tokens
HUGGINGFACE_API_KEY=hf_xxxxxxxxxxxxxxxxxxxx
HF_NER_MODEL=dslim/bert-base-NER
# WHOIS — free account at whoisxmlapi.com (500 queries/month free)
WHOIS_API_KEY=your_whois_api_key
# Similarity thresholds (these are the defaults — omit to use defaults)
SIMILARITY_CONFIRMED_THRESHOLD=0.95
SIMILARITY_HIGH_THRESHOLD=0.85
# Server
PORT=5000
CLIENT_URL=http://localhost:5173const JOBSHIELD_CONFIG = {
API_BASE_URL: 'http://localhost:5000/api', // → your deployed API URL for production
APP_URL: 'http://localhost:5173', // → your deployed app URL for production
}- Node.js 18+
- A MongoDB Atlas account — free M0 tier is sufficient
- A HuggingFace account — free, for the Inference API key
- A WhoisXML API account — free tier gives 500 queries/month
git clone https://github.com/yourusername/jobshield.git
cd jobshieldcd server && npm install
cd ../client && npm installcd server
cp .env.example .env
# Edit .env and fill in all required valuesSee Atlas Vector Search Setup below — required before signals 1 and 2 work.
See Seeding the Database below — populates 200 verified scam clusters so similarity signals work from day one.
# Terminal 1 — backend
cd server && npm run dev
# Terminal 2 — frontend
cd client && npm run devJobShield requires one mandatory vector search index (and one optional). These must be created manually in the Atlas UI — they cannot be created programmatically on the M0 free tier.
- Go to cloud.mongodb.com → your cluster → Atlas Search tab
- Click Create Search Index → select Atlas Vector Search (not Atlas Full Text Search)
- Select your database →
scamclusterscollection - Replace the default JSON with:
{
"fields": [
{
"type": "vector",
"path": "clusterEmbedding",
"numDimensions": 384,
"similarity": "cosine"
}
]
}- Name the index exactly:
cluster_vector_index - Click Create Search Index → wait for status Active (1–2 minutes)
Same process on the reports collection, field embedding, name report_vector_index.
Submit the same scam message twice. Your server terminal should show:
Vector search returned 5 results
Best match: 69a28b..., score: 0.9987, verified: false
Signal 2 fired (+20)
Merged into cluster 69a28b... (similarity: 0.999, count: 2)
Without seed data, signals 1 and 2 will never fire because ScamClusters is empty. The seed script pre-populates it with 200 verified scam clusters from Kaggle's SMS Spam Collection dataset.
Step 1 — Download the dataset from Kaggle: https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction
Step 2 — Place the CSV at:
server/scripts/fake_job_posting.csv
Step 3 — Run the seed script:
cd server
npm run seedThe script embeds the 860 fake jobs via HuggingFace and inserts them as verified: true clusters. Takes 5–10 minutes due to HF API rate limiting (200ms delay between requests). Progress is printed to the terminal.
# Backend (http://localhost:5000)
cd server
npm run dev # nodemon hot-reload
npm start # no hot-reload
# Frontend (http://localhost:5173)
cd client
npm run dev # Vite dev server
npm run build # production build
npm run preview # preview production build locally
# Database seeding
cd server
npm run seed # import Kaggle spam dataset into ScamClustersAll accounts register as role: "user" by default. To promote an account to admin:
MongoDB Atlas UI (Collections → users) or mongosh:
db.users.updateOne(
{ email: "your@email.com" },
{ $set: { role: "admin" } }
)Log out and log back in for the role change to apply. Admin users unlock the /admin/reports and /admin/clusters pages in the frontend, and the admin API endpoints.
The most impactful admin action: Verifying a cluster. Setting
verified: trueon a cluster activates Signal 1 (+35 points, minimum score 72) for every future message that semantically matches it. A few minutes of admin review in the early days can protect thousands of future users.
Built with Node.js · React · MongoDB Atlas · HuggingFace Inference API · Chrome Extensions Manifest V3