API Reference¶
@df_in¶
Validates DataFrame parameters passed to a function.
@df_in(
name: str | None = None,
columns: list[str] | dict[str, str | dict] | None = None,
strict: bool | None = None,
lazy: bool | None = None,
composite_unique: list[list[str]] | None = None,
row_validator: type[BaseModel] | None = None,
min_rows: int | None = None,
max_rows: int | None = None,
exact_rows: int | None = None,
allow_empty: bool | None = None
)
Parameters:
| Parameter | Type | Description |
|---|---|---|
name |
str or None |
Name of the parameter to validate. If not specified, validates the first DataFrame parameter. |
columns |
list or dict or None |
Column specification. See Column Specifications. |
strict |
bool or None |
If True, raises error for unexpected columns. Defaults to project config or False. |
lazy |
bool or None |
If True, collects all validation errors before raising. Defaults to project config or False. |
composite_unique |
list[list[str]] or None |
Column combinations that must be unique (for example [['first_name', 'last_name']]). |
row_validator |
type[BaseModel] or None |
Pydantic model for row-level validation. |
min_rows |
int or None |
Minimum number of rows required. |
max_rows |
int or None |
Maximum number of rows allowed. |
exact_rows |
int or None |
Exact number of rows required. |
allow_empty |
bool or None |
Whether empty DataFrames are allowed. Defaults to project config or True. |
Examples:
# Column list
@df_in(["a", "b", "c"])
# With dtypes
@df_in({"a": "int64", "b": "object"})
# With constraints
@df_in({"price": {"dtype": "float64", "nullable": False, "checks": {"gt": 0}}})
# Multiple DataFrames — use name to target specific parameters
@df_in(name="orders", columns=["id", "total"])
@df_in(name="customers", columns=["id", "name"])
# Row validation
@df_in(row_validator=OrderModel)
# Shape/lazy controls
@df_in({"email": {"nullable": False}}, lazy=True, min_rows=1, allow_empty=False)
@df_out¶
Validates the DataFrame returned by a function.
@df_out(
columns: list[str] | dict[str, str | dict] | None = None,
strict: bool | None = None,
lazy: bool | None = None,
composite_unique: list[list[str]] | None = None,
row_validator: type[BaseModel] | None = None,
min_rows: int | None = None,
max_rows: int | None = None,
exact_rows: int | None = None,
allow_empty: bool | None = None
)
Parameters:
| Parameter | Type | Description |
|---|---|---|
columns |
list or dict or None |
Column specification. See Column Specifications. |
strict |
bool or None |
If True, raises error for unexpected columns. Defaults to project config or False. |
lazy |
bool or None |
If True, collects all validation errors before raising. Defaults to project config or False. |
composite_unique |
list[list[str]] or None |
Column combinations that must be unique (for example [['country', 'city']]). |
row_validator |
type[BaseModel] or None |
Pydantic model for row-level validation. |
min_rows |
int or None |
Minimum number of rows required. |
max_rows |
int or None |
Maximum number of rows allowed. |
exact_rows |
int or None |
Exact number of rows required. |
allow_empty |
bool or None |
Whether empty DataFrames are allowed. Defaults to project config or True. |
Examples:
@df_out(["result", "score"])
@df_out({"score": "float64"}, strict=True)
@df_out(row_validator=ResultModel)
@df_out(min_rows=10, max_rows=100, lazy=True)
@df_log¶
Logs DataFrame structure when entering and exiting a function.
Parameters:
| Parameter | Type | Description |
|---|---|---|
level |
int |
Logging level for emitted messages. Default: logging.DEBUG. |
include_dtypes |
bool |
If True, includes column dtypes in log output. Default: False. |
Examples:
@df_log(level=logging.INFO)
def process(df):
return df
# Logs: Function process parameters contained a DataFrame: columns: ['a', 'b']
# Logs: Function process returned a DataFrame: columns: ['a', 'b']
@df_log(include_dtypes=True)
def process(df):
return df
# Logs: ... columns: ['a', 'b'] with dtypes ['int64', 'object']
Column Specifications¶
Columns can be specified in several formats:
List Format¶
Simple list of required column names:
Dict with Dtypes¶
Map column names to expected dtypes:
Rich Column Spec¶
Full control over column validation:
columns={
"column_name": {
"dtype": "float64", # Expected dtype (optional)
"nullable": False, # Allow null values? Default: True
"unique": True, # Require unique values? Default: False
"required": True, # Is column required? Default: True
"checks": { # Value checks (optional)
"gt": 0,
"lt": 100
}
}
}
Regex Patterns¶
Match multiple columns with regex:
columns=["id", "r/feature_\\d+/"] # Matches feature_1, feature_2, etc.
columns={"r/score_\\d+/": {"dtype": "float64", "checks": {"between": (0, 100)}}}
Value Checks¶
Available built-in checks for the checks parameter:
| Check | Argument | Description |
|---|---|---|
gt |
number |
Greater than |
ge |
number |
Greater than or equal |
lt |
number |
Less than |
le |
number |
Less than or equal |
eq |
value |
Equal to |
ne |
value |
Not equal to |
between |
(lo, hi) |
Value in range (inclusive) |
isin |
list |
Value in set |
notin |
list |
Value not in set |
notnull |
True |
No null values |
str_regex |
pattern |
String matches regex |
str_startswith |
str |
String starts with prefix |
str_endswith |
str |
String ends with suffix |
str_contains |
str |
String contains substring |
str_length |
(lo, hi) |
String length in range |
Custom checks are also supported by passing a callable as the check value.
Examples:
columns={
"price": {"checks": {"gt": 0, "lt": 10000}},
"score": {"checks": {"between": (0, 100)}},
"status": {"checks": {"isin": ["active", "pending"]}},
"email": {"checks": {"str_regex": r"^[^@]+@[^@]+\.[^@]+$"}}
}
Configuration¶
Configure Daffy in pyproject.toml:
[tool.daffy]
strict = false # Default strict mode (default: false)
row_validation_max_errors = 5 # Max errors shown in row validation (default: 5)
checks_max_samples = 5 # Max sample values in check errors (default: 5)
Configuration is read from the pyproject.toml in the current working directory or any parent directory.