Visual Improvements:
- 🎨 Color-Coded Badges: Source, year, citations, and open access status
- Citations: Red (100+), Warning (50+), Info (10+), Gray (<10)
- Sources: Blue (Semantic Scholar), Green (arXiv), Info (WoS), Warning (OpenAlex)
- ✨ Better Card Design: Shadow effects, improved spacing, icons
- 🔍 Improved Metadata Display: Icons for authors, DOI links, better typography
New Features:
-
📊 Sorting Controls:
- Relevance (default - semantic similarity)
- Citations (high to low)
- Year (newest/oldest first)
- Instant client-side sorting without re-running search
-
🎯 Source Filtering:
- Filter results by specific sources
- Instant client-side filtering
- No need to re-run search
-
💡 Helpful Tooltips:
- Source selection tooltip explains each database
- Pipeline controls tooltip explains each step
- Recommended settings clearly marked in UI
Search Quality Improvements:
-
✓ Enhanced Deduplication: Fuzzy title matching (85% similarity)
- Catches formatting variations (e.g., "Machine Learning" vs "machine learning")
- Removes punctuation differences
- Case-insensitive matching with word-level comparison
-
✓ Query Expansion: Now disabled by default for better precision
- Modern search engines (Semantic Scholar, OpenAlex) handle semantic similarity internally
- Can be re-enabled if needed via Pipeline Controls
- Reduces search noise and improves relevance
-
🔍 Multi-Source Search
- Search Semantic Scholar, arXiv, and Web of Science
- Select which sources to use for each search
- Get results from multiple databases at once
-
📚 Build on Previous Searches
- Enable cumulative mode to add results over time
- Smart deduplication across all searches
- Build comprehensive literature reviews iteratively
-
⚙️ Flexible Configuration
- See which sources are available
- Optional Web of Science integration
- Works great with just free sources (Semantic Scholar + arXiv)
- Login via Admin tab
- Go to Literature Search
- Enter your query: "AI in drug discovery"
- Click "Search Papers"
- Done! ✓
Before searching, check/uncheck sources:
- ☑ Semantic Scholar (recommended)
- ☑ arXiv (good for recent research)
- ☐ Web of Science (optional, requires API key)
To create a comprehensive literature set:
-
First search: "machine learning"
- Get 10 results
-
Check "Build on previous searches"
-
Second search: "deep learning"
- Get 10 new results
- Combined with previous = 20 unique papers
-
Third search: "neural networks"
- Get 10 more results
- Combined = 30 unique papers total
-
Click "Clear Session" when done
- Start fresh for a new topic
Only if you want premium citation data:
- Get API key from https://developer.clarivate.com/
- Set environment variable:
export WOS_API_KEY="your-key-here"
- Restart HARVEST
- Web of Science checkbox will be available!
Note: Uses Web of Science Expanded API directly - no additional packages needed!
When using Web of Science, you can use advanced query syntax for precise searches:
Simple queries (auto-converted):
"machine learning"→ searches topic field
Advanced queries (use as-is):
AB=(genomic* OR transcriptom*)→ searches abstractsTI=(CRISPR) AND PY=(2020-2024)→ searches titles and year rangeAU=(Smith J*) AND TS=(climate)→ searches authors and topic
Common field tags:
TS=Topic (title + abstract + keywords)TI=TitleAB=AbstractAU=AuthorPY=Year (use ranges likePY=(2020-2024))SO=Journal nameDO=DOI
Operators: AND, OR, NOT
Wildcards: * (many chars), ? (one char)
Example: AB=(longevity* OR reproduction*) AND PY=(2015-2024)
See WoS Advanced Search Guide for more details.
Every paper includes:
- Title
- Authors (up to 3)
- Year
- DOI (clickable link)
- Abstract snippet
- Citation count
- Source (which database it came from)
Plus:
- Smart ranking by semantic relevance
- Duplicate removal across sources
- Citation-weighted sorting
- Export to projects
- Use Semantic Scholar only (fastest)
- Single specific query
- Enable all available sources
- Use cumulative mode
- Try multiple related queries
- Export everything to one project
- Use arXiv only
- Focus on preprints
- Check dates manually
- Include Semantic Scholar
- Enable Web of Science if available
- Note citation counts
"No sources selected"
- Check at least one source box
Web of Science not available
- It's optional! Use free sources instead
- Or set up API key (see above)
No results found
- Try broader terms
- Check more sources
- Verify spelling
Session not working
- Make sure you're logged in
- Check "Build on previous searches" box
- Try "Clear Session" and restart
See full documentation: docs/SEMANTIC_SEARCH.md
Includes:
- Detailed usage guide
- Technical details
- Advanced features
- API comparison
- Best practices
- Check
docs/SEMANTIC_SEARCH.md - Review main
README.md - Open a GitHub issue
Happy searching! 🎉
The HARVEST application includes an enhanced semantic search capability that allows users to discover relevant academic papers from multiple sources with advanced features like API selection, cumulative session building, and intelligent deduplication.
The semantic search system can query multiple academic databases simultaneously:
-
Semantic Scholar - Open academic paper search from AI2
- Largest open corpus (200M+ papers)
- Includes citation counts and impact metrics
- Free to use, no API key required
- Advanced features:
- Paper recommendations based on similarity
- Bulk paper retrieval by IDs
- Filtering by year, citations, open access
- Venue and publication type filtering
-
arXiv - Preprint repository
- Physics, mathematics, computer science, etc.
- Free and open access
- Latest research before peer review
-
Web of Science (Optional)
- Comprehensive citation database
- Requires API key from Clarivate
- Premium content and citation analytics
Users can select which sources to search, allowing for:
- Targeted searches - Focus on specific databases
- Comprehensive searches - Query all available sources
- Cost management - Avoid paid APIs when not needed
The "Build on previous searches" feature enables:
- Iterative refinement - Add new results to existing ones
- Topic exploration - Expand search scope over time
- Comprehensive reviews - Build complete literature sets
- Smart deduplication - Automatic removal of duplicates across sessions
Results are reranked using semantic similarity:
- Embeds query and abstracts using sentence transformers
- Computes cosine similarity scores
- Returns most semantically relevant papers
- Preserves highly-cited papers in ranking
-
Navigate to Literature Search Tab
- Login via Admin tab first (authentication required)
-
Select Search Sources
- Check desired sources: Semantic Scholar, arXiv, Web of Science
- Default: Semantic Scholar + arXiv
-
Enter Search Query
- Use natural language: "AI in drug discovery"
- Or specific terms: "CRISPR gene editing ethics"
-
Click "Search Papers"
- Results appear with execution pipeline details
- Papers are ranked by semantic relevance
After getting your results, you can refine them without re-running the search:
-
Sort Results
- Click the "Sort By" dropdown
- Choose from:
- Relevance (default - based on semantic similarity to your query)
- Citations (high to low - shows most influential papers first)
- Year (newest first - shows latest research)
- Year (oldest first - shows foundational work)
-
Filter by Source
- Click the "Filter by Source" dropdown
- Select a specific source to view only those results
- Choose "All Sources" to see everything again
-
Visual Indicators
- Source badges show where each paper came from (color-coded)
- Citation badges show impact (color-coded by count)
- Year badges show publication date
- Open access badges show freely available papers (green)
Example Workflow:
- Search for "machine learning in healthcare"
- Get 50 results from multiple sources
- Sort by "Citations (high to low)" to see most influential papers
- Filter by "arXiv" to see only preprints
- Sort by "Year (newest first)" to find latest arXiv papers
This is much faster than running multiple searches!
To build on previous searches:
-
Perform Initial Search
- Enter query and search normally
-
Enable "Build on previous searches"
- Check the cumulative search option
-
Enter New Query
- Related or refined search terms
-
Search Again
- New results are added to previous ones
- Duplicates automatically removed
- Results re-ranked together
-
Clear Session
- Click "Clear Session" to start fresh
- Session automatically resets on logout
The search system allows you to control how many results are fetched from each source and displayed:
Click "Advanced: Results per Source" to configure individual source limits:
-
Semantic Scholar: 1-100 results (default: 100)
- API maximum: 100 per request
- Increased from original 40 to fetch comprehensive results
-
arXiv: 1-100 results (default: 50)
- API supports up to 100+ results
- Increased from original 10 to capture more preprints
-
Web of Science: 1-100 results (default: 100)
- API maximum: 100 per request
- Increased from original 20 to maximize coverage
-
OpenAlex: 1-200 results (default: 200)
- API maximum: 200 per page
- Increased from original 20 to leverage full API capacity
Benefits of Higher Limits:
- More comprehensive results - Captures a wider range of relevant papers
- Better deduplication - More papers to compare across sources
- Improved ranking - More candidates for semantic reranking
- Fuller literature coverage - Especially important for systematic reviews
Trade-offs:
- Higher limits increase search time (typically 5-15 seconds vs 2-5 seconds)
- More results to review (use semantic reranking to focus on top papers)
After fetching, deduplication, and reranking, you can control how many papers to display:
- Range: 1-100 results
- Default: 20 results (increased from original 10)
- Purpose: Shows the most relevant papers after semantic reranking
Recommended Settings:
- Quick overview: 10-20 results
- Comprehensive review: 50-100 results
- Literature mapping: Use "Build on previous searches" with 20-30 results per query
Example Workflow:
- Set Semantic Scholar to 100, OpenAlex to 200
- Enable all pipeline features (expansion, deduplication, reranking)
- Set display count to 30
- Run search
- Review top 30 semantically-ranked results from 300 total fetched
Different sources expect different query formats:
-
Semantic Scholar: Natural language queries
- Examples: "AI in drug discovery", "CRISPR gene editing"
-
arXiv: Natural language queries
- Examples: "quantum computing", "deep learning"
-
OpenAlex: Natural language queries
- Examples: "climate change modeling", "protein folding"
-
Web of Science: Advanced syntax or natural language
- Natural language is automatically converted to
TS=(query)format - For advanced syntax, see Advanced Search Syntax section below
- Natural language is automatically converted to
The source checkboxes indicate the expected query type in parentheses.
-
Select Papers
- Check boxes next to desired papers
- Use "Select All" / "Deselect All" buttons
-
Export Selected DOIs
- Create new project with selections
- Add to existing project
- Copy to clipboard
The semantic search executes in three stages:
- Expands query with common synonyms
- Creates up to 3 query variations
- Broadens search scope automatically
- Queries selected academic databases
- Retrieves papers with metadata
- Deduplicates across sources
- Merges with session history if enabled
- Encodes query and abstracts semantically
- Calculates similarity scores
- Returns top-k most relevant papers
- Preserves citation-weighted ranking
The Semantic Scholar integration follows best practices from the S2 API documentation and webinar examples for optimal performance and reliability.
Improved Features:
- Selective field requests - Only requests needed fields to reduce API load
- Filtering options - Filter by year range, minimum citations, open access status
- Pagination support - Handles large result sets efficiently
- Retry logic with exponential backoff and jitter - Automatically retries failed requests with randomized delays
- Rate limit awareness - Respects 429 (Too Many Requests) responses
- Transient error handling - Automatically retries on 502, 503, 504 server errors
- Better error handling - Graceful degradation on API failures with specific error logging
- Metadata enrichment - Includes venue, publication type, PDF availability
Reliability Features (from S2 Webinar Best Practices):
- 6 automatic retries with exponential backoff (2.0s factor)
- Jitter randomization (0.5s) prevents thundering herd problems
- Retry-After header respect for server-directed backoff
- Graceful degradation returns empty results instead of crashing
Year Range Filtering:
# Search for recent papers only
result = search_semantic_scholar("machine learning", year_range="2023-2024")
# Search for papers from a specific year
result = search_semantic_scholar("CRISPR", year_range="2023")Citation Filtering:
# Get only highly-cited papers
result = search_semantic_scholar("climate change", min_citations=100)Get paper recommendations based on similarity to a known paper using Semantic Scholar's recommendation algorithm. This finds papers that:
- Share similar topics and methodology
- Are cited by or cite similar papers
- Have overlapping author networks
Usage:
# Get recommendations based on a paper DOI
recommendations = get_recommended_papers_s2(
paper_id="10.1038/nature14539",
limit=20,
pool='recent' # or 'all-cs' for all CS papers
)Recommendation Pools:
'recent'- Papers from the last 2 years (default, good for current research)'all-cs'- All computer science papers (good for comprehensive reviews)
Efficiently retrieve multiple papers by their IDs (DOIs or S2 paper IDs):
# Get specific papers by DOI
paper_ids = [
"10.1038/nature14539",
"arXiv:1706.03762",
"10.1126/science.aaa1234"
]
papers = get_papers_by_ids_s2(paper_ids)Use Cases:
- Building literature sets from citation lists
- Following up on references from a key paper
- Validating DOIs from external sources
The enhanced implementation detects open access papers and extracts PDF URLs when available:
# Search results now include:
{
'title': 'Paper Title',
'is_open_access': True,
'pdf_url': 'https://arxiv.org/pdf/1234.5678.pdf'
}-
Obtain API Key
- Register at https://developer.clarivate.com/
- Subscribe to Web of Science Expanded API
- Get your API key
-
Set Environment Variable
export WOS_API_KEY="your-api-key-here"
-
Restart Application
- Web of Science will appear in source options
- Green checkmark indicates availability
Note: The integration uses the Web of Science Expanded API directly via REST calls. No additional Python packages are required beyond requests (already included).
Technical Note: The implementation includes the viewField='fullRecord' parameter in all API requests to ensure abstracts are returned. Without this parameter, the Web of Science API defaults to 'summary' view which excludes abstracts. This is a critical requirement for proper abstract retrieval.
- Endpoint:
https://wos-api.clarivate.com/api/wos - Database: WOS (Web of Science Core Collection)
- Implementation: Based on Clarivate's official examples
- Max results per query: 100
- Required Parameters:
viewField='fullRecord'to retrieve abstracts
Web of Science supports powerful advanced search queries using field tags and boolean operators. When you use WoS as your search source, you can use either simple queries or advanced syntax.
Simple queries are automatically converted to Topic Search format:
"machine learning"→TS=(machine learning)"climate change"→TS=(climate change)
Use field tags to search specific fields:
Common Field Tags:
TS=- Topic (searches title, abstract, author keywords, and Keywords Plus®)TI=- TitleAB=- AbstractAU=- Author namePY=- Publication yearSO=- Publication title (journal/book)DO=- DOIUT=- Accession number (WoS ID)PMID=- PubMed ID
All Available Field Tags:
TS=Topic TI=Title AB=Abstract
AU=Author AI=Author ID AK=Author Keywords
GP=Group Author ED=Editor KP=Keywords Plus
SO=Publication DO=DOI PY=Year Published
CF=Conference AD=Address OG=Organization
OO=Organization SG=Suborganization SA=Street Address
CI=City PS=Province/State CU=Country
ZP=Zip/Postal Code FO=Funding Agency FG=Grant Number
FD=Funding Details FT=Funding Text SU=Research Area
WC=WoS Categories IS=ISSN/ISBN UT=Accession Number
PMID=PubMed ID DOP=Pub Date LD=Index Date
PUBL=Publisher ALL=All Fields FPY=Final Pub Year
EAY=Early Access SDG=SDG Goals TMAC=Citation Topic
Boolean Operators:
AND- Both terms must be presentOR- Either term can be presentNOT- Exclude terms
Wildcards:
*- Multiple characters (e.g.,genom*matches genomic, genomics, genome)?- Single character (e.g.,wom?nmatches woman, women)
Basic searches:
AB=(genomic* OR transcriptom*)
TI=(machine learning)
AU=(Smith J*)
Complex searches with boolean operators:
AB=(genomic* OR transcriptom*) AND PY=(2020-2024)
TS=(CRISPR) AND AU=(Doudna) NOT TI=(review)
(TI=(climate change) OR AB=(global warming)) AND PY=(2015-2024)
Year ranges:
PY=(2020-2024) # Papers from 2020 to 2024
PY=(2023) # Papers from 2023 only
Combining multiple fields:
AB=(longevity* OR reproduction*) AND AU=(Tribolium) AND PY=(2015-2024)
TS=(artificial intelligence) AND SO=(Nature) AND PY=(2020-2024)
When using Web of Science ONLY:
- Advanced queries (with field tags) skip query expansion and semantic reranking
- Results are returned in WoS relevance order
- This preserves the precision of your advanced query
When using Web of Science with other sources:
- Simple queries undergo semantic processing across all sources
- Advanced queries are used as-is for WoS, expanded for other sources
- Results are deduplicated and semantically reranked
Best Practices:
- Use advanced syntax for precise, reproducible searches
- Use wildcards (
*) for word variations - Use field tags to narrow your search scope
- Test your query at Web of Science Advanced Search first
- Enclose phrases in quotes:
AB=("machine learning")
- Web of Science searches may have rate limits
- Check your API plan for usage quotas
- Results include citation metrics
- Access to paywalled content metadata
- Advanced queries provide more control than natural language search
Resources:
# Optional: Web of Science Expanded API key
export WOS_API_KEY="your-key-here"
# Required for PDF downloads (see main docs)
export UNPAYWALL_EMAIL="your@email.com"In literature_search.py:
# Customize query expansion synonyms (not used for WoS advanced queries)
synonym_map = {
'ai': ['artificial intelligence', 'machine learning', 'deep learning'],
# Add your domain-specific terms
}
# Adjust result limits
semantic_scholar_limit = 40 # Default
arxiv_limit = 10 # Default
wos_limit = 20 # Default
top_k = 10 # Results to displaySymptom: Source shows ✗ unavailable
Solutions:
- Check if package is installed:
pip install semanticscholar arxiv
- For Web of Science:
- Verify API key is set
- Check client library installation
- Test API key validity
Possible causes:
- Query too specific - try broader terms
- Selected sources don't contain matches
- Network connectivity issues
- API rate limits reached
Solutions:
- Expand search terms
- Enable more sources
- Try again after brief wait
- Check internet connection
If seeing duplicates:
- System automatically deduplicates by DOI and title
- Some papers may appear similar but are different
- Check DOI to verify uniqueness
If cumulative search not working:
- Ensure "Build on previous searches" is checked
- Verify you're logged in (session persists)
- Try "Clear Session" and restart
- Check browser console for errors
| Feature | Semantic Scholar | arXiv | Web of Science |
|---|---|---|---|
| Cost | Free | Free | Paid API key |
| Coverage | Broad (all fields) | Physics, Math, CS | Comprehensive |
| Citations | Yes | No | Yes |
| Abstracts | Yes | Yes | Yes |
| Full Text | Links only | Free PDFs | Metadata only |
| Updates | Real-time | Daily | Real-time |
| Rate Limits | Generous | Generous | Plan-dependent |
- Start with Semantic Scholar + arXiv
- Enable cumulative search
- Try multiple query variations
- Refine with specific terms
- Export complete set to project
- Use arXiv only
- Search for recent terms
- Sort by date (manual)
- Follow up on preprints
- Include Semantic Scholar
- Enable Web of Science if available
- Note citation counts
- Track influential papers
- Enable cumulative search
- Start broad, then narrow
- Build session over time
- Review all unique papers
Papers are deduplicated using an enhanced fuzzy matching approach:
- Primary key: DOI exact match
- Secondary key: Fuzzy title similarity (Jaccard similarity with 85% threshold)
- Titles are normalized: lowercase, punctuation removed, common prefixes stripped
- Catches near-duplicates that differ slightly in formatting
- Example: "Machine Learning in Drug Discovery" matches "machine learning in drug discovery"
- Priority: Higher citation count preferred when duplicates found
- Scope: Across all sources and session history
Title Normalization Process:
- Convert to lowercase
- Remove common prefixes ("the", "a", "an")
- Remove punctuation
- Normalize whitespace
- Calculate word-level Jaccard similarity
This improved deduplication catches:
- Case variations
- Punctuation differences
- Minor formatting changes
- Whitespace inconsistencies
Without fuzzy matching, papers like these would be treated as separate:
- "Machine Learning in Drug Discovery"
- "machine learning in drug discovery"
- "Machine learning in drug discovery."
Configuration:
# Default similarity threshold: 85%
# Can be adjusted in _titles_are_similar() function
similarity_threshold = 0.85Uses sentence-transformers library:
- Model:
all-MiniLM-L6-v2(384 dimensions) - Metric: Cosine similarity
- Encoding: Query + paper abstracts
- Ranking: Similarity score × citation weight
Session papers stored in browser:
- Storage type: Session storage (temporary)
- Persistence: Until logout or tab close
- Size limit: Browser-dependent (~5MB typical)
- Privacy: Client-side only, not uploaded
For batch processing or automation:
import literature_search
# Search with specific sources
result = literature_search.search_papers(
query="machine learning in genomics",
top_k=20,
sources=['semantic_scholar', 'arxiv'],
previous_papers=None # Or list of previous papers
)
# Access results
if result['success']:
for paper in result['papers']:
print(f"Title: {paper['title']}")
print(f"DOI: {paper['doi']}")
print(f"Source: {paper['source']}")
print(f"Citations: {paper['citations']}")
print()Extend query expansion for your domain:
# In literature_search.py
synonym_map = {
'crispr': ['cas9', 'gene editing', 'genome editing'],
'protein': ['peptide', 'polypeptide', 'amino acid sequence'],
# Add your terms
}Planned features:
- PubMed integration
- Google Scholar support
- Citation network visualization
- Saved search queries
- Email alerts for new papers
- Export to BibTeX/RIS
- Advanced filtering (year, journal, etc.)
For issues or questions:
- Check this documentation
- Review main README.md
- Open GitHub issue
- Contact repository maintainers