Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
'_build',
'Thumbs.db',
'.DS_Store',
'tutorials/dataset_basic_tutorial.md',
'**/*_tutorial.md', # ipynb files will be used instead.
]

# Suppress warning in exception basic_data_tutorial
Expand Down Expand Up @@ -116,6 +116,8 @@
'tutorials/data_sources/bagz_data_source_tutorial.ipynb',
'tutorials/data_sources/huggingface_dataset_tutorial.ipynb',
'tutorials/data_sources/pytorch_dataset_tutorial.ipynb',
'tutorials/performance_debugging.ipynb',
'dataset/performance_debugging.ipynb',
]


Expand Down
2 changes: 1 addition & 1 deletion docs/grain.data_loader.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
``grain`` DataLoader
=================
====================

.. automodule:: grain._src.python.data_loader
.. currentmodule:: grain
Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ not depend on TensorFlow.
:maxdepth: 1
:hidden:
:caption: Get started
overview
installation
api_choice
```
Expand Down
6 changes: 4 additions & 2 deletions docs/tutorials/data_loader_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ kernelspec:
name: python3
---



+++ {"id": "qGiXX-sg4l9o"}

# `DataLoader` guide
Expand Down Expand Up @@ -96,7 +98,7 @@ index_sampler = grain.IndexSampler(
## Data source
A data source is responsible for reading indvidual records from underlying files / storage system. We provide the following data sources:

* `ArrayRecordDataSource`: reads records from [ArrayRecord](go/array-record-design) files.
* `ArrayRecordDataSource`: reads records from [ArrayRecord](https://github.com/google/array_record) files.
* `tfds.data_source`: data source for [TFDS](https://www.tensorflow.org/datasets) datasets without a TensorFlow dependency.


Expand All @@ -106,7 +108,7 @@ Below, we show an example using a TFDS data source, but using other data sources

## TFDS Data source

```{code-cell}
``` {code-cell}
---
executionInfo:
elapsed: 38785
Expand Down
17 changes: 10 additions & 7 deletions grain/_src/python/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -223,9 +223,10 @@ def range(

Input arguments are interpreted the same way as in Python built-in
``range``:
- ``range(n)`` => start=0, stop=n, step=1
- ``range(m, n)`` => start=m, stop=n, step=1
- ``range(m, n, p)`` => start=m, stop=n, step=p

- ``range(n)`` => start=0, stop=n, step=1
- ``range(m, n)`` => start=m, stop=n, step=1
- ``range(m, n, p)`` => start=m, stop=n, step=p

The produced values are consistent with the built-in `range` function::

Expand Down Expand Up @@ -572,8 +573,9 @@ def seed(self, seed: int) -> MapDataset[T]:
When default seed generation is enabled by calling ``ds.seed``, every
downstream random transformation will be automatically seeded with a unique
seed by default. This simplifies seed management, making it easier to avoid:
- Having to provide a seed in multiple transformations.
- Accidentally reusing the same seed across transformations.

- Having to provide a seed in multiple transformations.
- Accidentally reusing the same seed across transformations.

It is recommended to call this right after the source. ``ds.seed`` has to be
called before any random transformations (such as ``shuffle`` or
Expand Down Expand Up @@ -1079,8 +1081,9 @@ def seed(self, seed: int) -> IterDataset[T]:
When default seed generation is enabled by calling ``ds.seed``, every
downstream random transformation will be automatically seeded with a unique
seed by default. This simplifies seed management, making it easier to avoid:
- Having to provide a seed in multiple transformations.
- Accidentally reusing the same seed across transformations.

- Having to provide a seed in multiple transformations.
- Accidentally reusing the same seed across transformations.

It is recommended to call this right after the source. ``ds.seed`` has to be
called before any random transformations (such as ``random_map`` that rely
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -601,7 +601,7 @@ class ConcatThenSplitIterDataset(dataset.IterDataset):
packed element. Positions indicate the position within the unpacked sequence.

Features can be "meta features" in which case they are never split
and we do not create *_positions and *_segment_ids features for them.
and we do not create ``*_positions`` and ``*_segment_ids`` features for them.
"""

def __init__(
Expand All @@ -623,8 +623,8 @@ def __init__(
meta_features: Set of feature names that are considered meta features.
Meta features are never split and will be duplicated when other features
of the same element are split. Otherwise, meta features are packed
normally (they have their own sequence length). No *_positions and
*_segment_ids features are created for meta features.
normally (they have their own sequence length). No ``*_positions`` and
``*_segment_ids`` features are created for meta features.
split_full_length_features: Whether full-length features are split, or
they are considered packed and passed through in priority. Setting
split_full_length_features=False is an optimization when some sequences
Expand Down
Loading