Skip to content

Latest commit

 

History

History
24 lines (21 loc) · 560 Bytes

File metadata and controls

24 lines (21 loc) · 560 Bytes

final result

  • party
  • party relationship, with
    • entity as party (org, company, person)
    • topic as party
    • url and archive link as links

rules

  • only one relation regardless of multiple mentions
  • leave possible spark for future scaling

strat

  • get text from warc
  • preprocess text
  • get entities from text
  • get topics from text - yes but expensive, maybe do on demand

__

  • form party relationships
  • discard articles with no entities
  • save party relationships with article clean text and ref back to warc