[Feature] External Data Support: Multi-Format, Remote Filesystem, Data Lake, and Optimization

**Is your feature request related to a problem? Please describe.**

NeuG currently has limited support for external data sources. Users cannot read or import data from diverse file formats (e.g., GraphAR, Iceberg), cannot access files on remote storage systems (S3, OSS, HTTP), and cannot export query results to external formats like Parquet. These gaps block basic data ingestion workflows before any graph computation can take place.

**Describe the solution you'd like**

This is a tracking issue for the full External Data Support roadmap. It covers the following sub-tasks:

### Sub-Issues

1. **[Feature] Support more external data formats (Parquet, GraphAR)**
   - Parquet: basic support is largely in place; remaining gaps to be addressed.
   - GraphAR: support reading GraphAR-formatted graph data as external tables (graph) via `LOAD FROM`.

2. **[Feature] Support remote filesystem access (S3, OSS, HTTP)**
   - Enable `LOAD FROM` to access files on Amazon S3, Alibaba Cloud OSS, and plain HTTP endpoints.
   - Implement a pluggable filesystem abstraction layer to support multiple remote backends.

3. **[Feature] Support data lake format: Apache Iceberg**
   - Allow reading Iceberg tables as external data sources in `LOAD FROM` queries.
   - Support schema inference and snapshot-level reads.

4. **[Feature] Query optimization for external data**
   - **Partition pruning on Iceberg**: skip irrelevant partitions based on query predicates.
   - **Predicate pushdown on GraphAR / Parquet**: push filter conditions into the scan layer to reduce I/O and improve performance.
   - Other scan-level optimizations as formats are added.

5. **[Feature] Export query results to Parquet**
   - Support `COPY ... TO '...' (FORMAT PARQUET)` or equivalent syntax for exporting query results as Parquet files.
   - Enable writing to both local filesystem and remote storage (S3, OSS, HTTP).

**Describe alternatives you've considered**

- Requiring users to manually import all external data into NeuG before querying — this is the current workaround but adds friction and storage overhead.
- Using external ETL pipelines to pre-convert data before loading — shifts format conversion complexity entirely to the user.

**Additional context**

- Parquet read support via extension is already partially implemented.
- Remote filesystem abstraction work has started (see `extension/s3/`).
- All sub-features above should be tracked as individual child issues linked to this parent issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] External Data Support: Multi-Format, Remote Filesystem, Data Lake, and Optimization #158

Sub-Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] External Data Support: Multi-Format, Remote Filesystem, Data Lake, and Optimization #158

Description

Sub-Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions