Skip to content

[Feature] External Data Support: Multi-Format, Remote Filesystem, Data Lake, and Optimization #158

@BingqingLyu

Description

@BingqingLyu

Is your feature request related to a problem? Please describe.

NeuG currently has limited support for external data sources. Users cannot read or import data from diverse file formats (e.g., GraphAR, Iceberg), cannot access files on remote storage systems (S3, OSS, HTTP), and cannot export query results to external formats like Parquet. These gaps block basic data ingestion workflows before any graph computation can take place.

Describe the solution you'd like

This is a tracking issue for the full External Data Support roadmap. It covers the following sub-tasks:

Sub-Issues

  1. [Feature] Support more external data formats (Parquet, GraphAR)

    • Parquet: basic support is largely in place; remaining gaps to be addressed.
    • GraphAR: support reading GraphAR-formatted graph data as external tables (graph) via LOAD FROM.
  2. [Feature] Support remote filesystem access (S3, OSS, HTTP)

    • Enable LOAD FROM to access files on Amazon S3, Alibaba Cloud OSS, and plain HTTP endpoints.
    • Implement a pluggable filesystem abstraction layer to support multiple remote backends.
  3. [Feature] Support data lake format: Apache Iceberg

    • Allow reading Iceberg tables as external data sources in LOAD FROM queries.
    • Support schema inference and snapshot-level reads.
  4. [Feature] Query optimization for external data

    • Partition pruning on Iceberg: skip irrelevant partitions based on query predicates.
    • Predicate pushdown on GraphAR / Parquet: push filter conditions into the scan layer to reduce I/O and improve performance.
    • Other scan-level optimizations as formats are added.
  5. [Feature] Export query results to Parquet

    • Support COPY ... TO '...' (FORMAT PARQUET) or equivalent syntax for exporting query results as Parquet files.
    • Enable writing to both local filesystem and remote storage (S3, OSS, HTTP).

Describe alternatives you've considered

  • Requiring users to manually import all external data into NeuG before querying — this is the current workaround but adds friction and storage overhead.
  • Using external ETL pipelines to pre-convert data before loading — shifts format conversion complexity entirely to the user.

Additional context

  • Parquet read support via extension is already partially implemented.
  • Remote filesystem abstraction work has started (see extension/s3/).
  • All sub-features above should be tracked as individual child issues linked to this parent issue.

Metadata

Metadata

Assignees

Labels

Projects

Status

To do

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions