Skip to content

1558 csv data reader#1640

Merged
SFJohnson24 merged 35 commits intomainfrom
1558-csv-data-reader
Apr 8, 2026
Merged

1558 csv data reader#1640
SFJohnson24 merged 35 commits intomainfrom
1558-csv-data-reader

Conversation

@alexfurmenkov
Copy link
Copy Markdown
Collaborator

@alexfurmenkov alexfurmenkov commented Feb 26, 2026

Added support for data in .csv format.
Implemented params reading from .env files - based on .env in data folder.

@alexfurmenkov alexfurmenkov marked this pull request as ready for review March 2, 2026 08:56
Copy link
Copy Markdown
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good so far-- can you see messages on slack with me and Gerry re: the .env and taking that into account as well with csv runs. https://github.com/cdisc-org/cdisc-rules-engine/issues/1559 My parsing script matches your reader logic--there is a description of the .env structure there.
It would want to get the standard/version/substandard/CT/define xml path from the .env

@alexfurmenkov alexfurmenkov linked an issue Mar 5, 2026 that may be closed by this pull request
raise NotImplementedError

def from_file(self, file_path):
with open(file_path, "r", encoding=self.encoding) as fp:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need encoding + general error handling wrapping similar to json reader here.

first_record = {}

with open(self.file_path, encoding=self.encoding) as f:
dataset_length = sum(1 for _ in f) - 1 # subtract header
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can result in dataset_length = -1 for empty csv files. For those cases I guess it should be zero. So we should clamp it to zero.

description = "JSON data is malformed."


class InvalidCSVFormat(EngineError):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better naming for this would be InvalidCSVFile instead of format because this error is for when csv is malformed or encoding failed which makes it invalid csv file. What do you think?

Comment thread core.py Outdated
type=click.Path(exists=True, readable=True, resolve_path=True),
help="Path to directory containing local rules.",
multiple=True,
envvar="LOCAL_RULES",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no LOCAL_RULES

Comment thread core.py Outdated
required=False,
help="Rule ID to get rule for.",
multiple=True,
envvar="RULEID",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no RULEID

Copy link
Copy Markdown
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noticed a lot of env params that dont exist:

PRODUCT=TIG
VERSION=1-0
SUBSTANDARD=SDTM
CT=sdtmct2020-12-18
DEFINE_XML=define_negative1.xml

above is a complete .env file with all possible params from the cdisc-open-rules repository. You can hop on the rules branch and see what it looks like if you want

Comment thread core.py
required=True,
default=None,
help="Standard version to validate against",
envvar="VERSION",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All else looks fine to me. Could you please adda test where we check the .env interoperability. You can cover all the flags in few tests.

Copy link
Copy Markdown
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the following command in the cli:
python core.py validate -s sdtmig -v 3.4 -dp tests/resources/CoreIssue1558/datasets/dm.csv -dep tests/resources/CoreIssue1558/test.env

The obtained repor tis attached. There are errors for rule CORE-000658 CORE-000711 CORE-000713 CORE-000714. The issue details sheet shows the error in the Value(s) column.

CORE-Report-2026-04-06T15-00-16.xlsx

@alexfurmenkov alexfurmenkov requested a review from RamilCDISC April 7, 2026 17:11
@RamilCDISC
Copy link
Copy Markdown
Collaborator

@SFJohnson24 Could you please confirm if you agree with the changes made on your request. I have done manual validation and as well as code review. All seems good to me. Once you give a heads up i will add final comment and approve.

Comment thread env.example
MAX_REPORT_ROWS = 10 # integer for maximum number of issues per excel sheet (plus headers) in result report. Defaults to 10000.
MAX_ERRORS_PER_RULE = (10, True) # Tuple for maximum number of errors to report per rule during a validation run. Also has a per dataset flag described as second bool value in readme. example value
DEFINE_XML
DATA_DIR No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is DATA_DIR? Is this an artifact?

Copy link
Copy Markdown
Collaborator

@SFJohnson24 SFJohnson24 Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this example file also have the other variables the .env can handle?

Comment thread core.py
"CDISC TIG Use Case for scoping a TIG Validation."
"Any of INDH, PROD, NONCLIN, or ANALYSIS."
),
)
Copy link
Copy Markdown
Collaborator

@SFJohnson24 SFJohnson24 Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this on TIG and it we will need envvar="USE_CASE" for this standard. I updated the open rules to reflect USE_CASE as an .env variable

Copy link
Copy Markdown
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good--just 1 small tweak that came up in my TIG testing. Looks good for other standards though

@SFJohnson24
Copy link
Copy Markdown
Collaborator

i pushed the 2 lines up just to expedite this getting merged--looks good @alexfurmenkov

@SFJohnson24 SFJohnson24 self-requested a review April 8, 2026 17:53
@SFJohnson24 SFJohnson24 merged commit eea0443 into main Apr 8, 2026
12 checks passed
@SFJohnson24 SFJohnson24 deleted the 1558-csv-data-reader branch April 8, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support CSV datasets in engine

3 participants