Insurance Classification System

Overview

This system combines multiple approaches to classify companies into insurance categories:

Sentence Transformers for semantic understanding
Cosine similarity for flexible matching
TF-IDF for traditional text analysis

Data Processing Pipeline

Data Loading and Initial Processing
- Load company data from CSV
- Combine multiple text fields (description, business_tags, category, niche)
- Clean and normalize text (lowercase, remove numbers/punctuation)
- Handle missing values
Feature Extraction
- Generate sentence embeddings using all-MiniLM-L6-v2
- Compute TF-IDF features for n-grams (bigrams and trigrams)
- Select top-k features based on importance
- Normalize features (sample-wise or feature-wise)
Model Processing
- Load pre-trained sentence transformer
- Pre-compute taxonomy label embeddings
- Set similarity threshold (0.85)
Classification
- Compute cosine similarity between company and taxonomy embeddings
- Apply threshold to filter matches
- Use fallback mechanism (top-3) if no matches meet threshold
- Assign multiple labels when appropriate
Output and Evaluation
- Save classifications to CSV
- Track processing progress

Strengths

Flexible Classification
- Multi-label support for complex business categorization
- Threshold-based matching with fallback mechanism
- Handles ambiguous cases through top-k matching
Robust Text Processing
- Combines multiple text fields (description, tags, category, niche)
- Handles missing data gracefully
- Normalization and cleaning for consistent input

Weaknesses

Current Limitations

Relies heavily on pre-trained embeddings
No active learning or feedback loop
Fixed similarity threshold
Sequential processing of companies

Scalability

Current Implementation

Processes data sequentially
Memory usage scales with dataset size
Pre-computed embeddings reduce computation time

Future Improvements

Parallel Processing
- Implement Map-Reduce paradigm
- Distributed computing with Dask/Spark
- Batch processing optimization
Model Enhancements
- Active learning for continuous improvement
- Dynamic threshold adjustment
- Confidence scoring for predictions
- Industry-specific fine-tuning

Design Decisions and Problem-Solving Process

Architecture Choices

Sentence Transformers:
- Considered alternatives:
  - Pure BERT: Too resource-intensive for large-scale processing
  - Traditional ML (SVM, Random Forest): Limited semantic understanding
  - CNN: Higher parameter count, less efficient
Why Cosine Similarity?
- Alternatives considered:
  - Euclidean distance: Less effective for text similarity
- Chose cosine similarity because:
  - Proven effectiveness for text similarity
  - Computationally efficient

Implementation Decisions

Text Processing Strategy
- Considered alternatives:
  - More aggressive cleaning: Risk of losing important information
  - Less cleaning: More noise in embeddings
  - Custom cleaning rules: Hard to maintain
- Chose current approach because:
  - Preserves important business terminology
  - Removes common noise (numbers, punctuation)
  - Maintains readability for debugging
Memory Management
- Considered alternatives:
  - Process all data at once: Memory constraints
  - Very small chunks: Too much overhead
  - Complex caching: Implementation complexity
- Chose current approach because:
  - Pre-computing taxonomy embeddings reduces runtime memory
  - Sequential processing with progress tracking
  - Clear memory usage patterns

Bibliography:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
TextClassificationFlowchart.png		TextClassificationFlowchart.png
classified_companies.csv		classified_companies.csv
classifier_notebook.ipynb		classifier_notebook.ipynb
insurance_taxonomy.csv		insurance_taxonomy.csv
ml_insurance_challenge.csv		ml_insurance_challenge.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Insurance Classification System

Overview

Data Processing Pipeline

Strengths

Weaknesses

Scalability

Current Implementation

Future Improvements

Design Decisions and Problem-Solving Process

Architecture Choices

Implementation Decisions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Insurance Classification System

Overview

Data Processing Pipeline

Strengths

Weaknesses

Scalability

Current Implementation

Future Improvements

Design Decisions and Problem-Solving Process

Architecture Choices

Implementation Decisions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages