Skip to content

Latest commit

 

History

History
19 lines (13 loc) · 943 Bytes

File metadata and controls

19 lines (13 loc) · 943 Bytes

Requirements

How to use

Run sequentially:

  1. Fix errors in XML files with correct.sh. Note that XML files must be placed within the same directory with Bash script.
  2. Parse XML file(s) by running xml_parser.py with name(s) of file(s) as mandatory parameters. JSONL file(s) will be created as a result.
  3. Create *.parquet files by running jsonl_parser.py. Use --active and --address parameters in order to parse only active-state entities, and parse address strings into its components.

Restrictions

  1. You need a lot of space on your drive (approx. 40 Gb to parse both UO and FOP files as of January 2022).
  2. Parsing addresses into components somewhen inaccurate. Managers posts too.