Price extraction and comparison documentation

Overview

This project is a location-aware shopping basket price comparison system for retail chains. It automatically scrapes official retail chain price data, stores it in a database, and lets users compare the total cost of their shopping cart across nearby branches.

Features

Automated Scraping Uses selenium with a headless Chrome browser to fetch product files from official retailer websites. Supports .gz and .zip compressed XML formats.
Data Parsing & Storage Extracts Store, Chain, and Item details from XML files. Stores data in a structured database (SQLAlchemy ORM).
Shopping Cart Price Comparison Users provide a list of items. The system computes basket costs across chains. Returns the five cheapest branches closest to the user’s location.
REST API GET /compare_list → Returns price comparison for a predefined item list. POST /compare_list → Accepts a JSON body with custom items. Response includes store name, branch, distance, and basket cost.

How to seed

cd utilities
pipenv shell
pipenv install
pipenv run dbinit (Run only when migrations folder does not exist)
pipenv run dbmigrate
pipenv run dbupgrade
pipenv run seed (to scrape data)
pipenv run start

Data Flow

Scraping Scraper visits each retailer link (Main.aspx). Extracts downloadable XML/ZIP/GZIP product files. Parses them into structured JSON.
Storage Saves data into tables: Store → Chain name. Chain → Branch info. Item → Product details.
Comparison User’s shopping list is matched against items in the DB. System calculates basket cost per branch. Returns sorted list (cheapest first, filtered by distance).

Example Requests

GET /compare_list Returns prices for predefined items:

POST /compare_list

Send custom items list: Response example:

  [
    {
      "storeName": "Kingstore",
      "branch": "ראשון לציון",
      "items": [
        {
          "code": "12345",
          "name": "מלפפון פרימיום",
          "price": 4.5,
          "unit": 1,
          "AllowDiscount": true
        }
      ]
    }
  ]

Prediction Model Documentation

Overview

The Purchase Prediction Model is designed to forecast household shopping needs based on historical transaction data stored in MongoDB. It leverages pandas for data manipulation and implements a scoring system that balances recency and frequency of purchases to determine the likelihood of an item being needed again soon.

Data Source

Database: MongoDB
Collection: purchases
Structure: Each purchase document typically contains fields such as:

  {
    "household_id": "684d3028dab5df14d3285146",
    "shoppingListId": "684d3029dab5df14d328514f",
    "itemName": "Milk",
    "quantity": 2,
    "purchasedAt": "2025-08-05",
    ...
  }

Data is loaded into a pandas DataFrame for preprocessing:

  purchases = list(collection.find({}, {"_id": 0}))
  df = pd.DataFrame(purchases)

Model Logic

The prediction system assigns a likelihood score to each item a household has purchased in the past. This score is computed from two key signals:

Recency Factor

Measures how recently an item was purchased.
Recent purchases indicate a shorter expected time before the next purchase.
Formula:

  recency_days = (today - last_purchase_date).days
  recency_score = max(0, 1 - recency_days / 30)
  boost += 0.6 * recency_score

Scaled to 0 → 1 (items bought today = high score; items not bought in >30 days = low score).

Weighted 60% of total score.

Frequency Factor

Measures how often an item is purchased historically.
Items purchased frequently are more likely to be repeat buys.
Formula:

  freq_score = purchase_counts.get(item, 0) / max_freq
  boost += 0.4 * freq_score

Normalized by dividing item frequency by the most frequently purchased item.

Scaled to 0 → 1.

Weighted 40% of total score.

Final Scoring

The overall likelihood score is a weighted combination:

  score = 0.6 * recency_score + 0.4 * freq_score

Items are then ranked by this score to generate predicted shopping lists.

Example

Suppose a household purchase history contains:

Item	Last Purchase	Purchase Count
Milk	1 day ago	50
Bread	7 days ago	30
Eggs	20 days ago	15
Sugar	60 days ago	5

Milk: Very recent + very frequent → High score

Bread: Moderate recency + high frequency → High-mid score

Eggs: Older purchase but moderately frequent → Mid score

Sugar: Very old + low frequency → Low score

The model would recommend Milk, Bread, and Eggs, but not Sugar.

Implementation Highlights

MongoDB Connection
Connects with credentials, checks connection with ping.
DataFrame Creation
Converts MongoDB results into pandas DataFrame.
Purchase Analysis
Groups by item to calculate purchase counts and last purchase dates.
Scoring System
Applies recency and frequency boosts.
Prediction Output
Returns ranked items as predictions.

Benefits

Lightweight: No heavy ML training needed; purely satistical scoring.
Dynamic: Works in real-time with live MongoDB data.
Customizable: Weights (recency vs frequency) can be tuned per household.

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
.expo		.expo
Resources		Resources
Utilities		Utilities
server		server
shopmate-mobile		shopmate-mobile
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Price extraction and comparison documentation

Overview

Features

How to seed

Data Flow

Example Requests

Prediction Model Documentation

Overview

Contents

Data Source

Model Logic

Example

Implementation Highlights

Benefits

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Price extraction and comparison documentation

Overview

Features

How to seed

Data Flow

Example Requests

Prediction Model Documentation

Overview

Contents

Data Source

Model Logic

Example

Implementation Highlights

Benefits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages