Transform¶
Bidirectional data transformation engine for round-trip workflows (decode, patch, re-encode) on nested binary blobs. Domain-agnostic: the engine knows nothing about SAS, Viya, or any specific platform.
TL;DR¶
from kstlib.transform import transform
# Decode a base64+zlib+JSON+XML blob, patch a hostname inside the
# inner XML, and re-encode back to the exact same format
patched_blob = transform(
blob_b64_string,
chain_name="patch_report",
)
The chain definition lives in kstlib.conf.yml:
transforms:
chains:
sas_report:
forward:
- base64
- zlib:
skip_bytes: 3
- json:
extract: "transferableContent.content"
backward:
- json:
wrap: "transferableContent.content"
- zlib:
prepend_bytes: "4d1504"
- base64
patch_report:
preset: sas_report
patch:
scope: blob
replace:
"https://old-host/": "https://new-host/"
Concept: preset vs usage pattern¶
Transform chains follow a two-tier convention:
Preset : a reusable decode/encode chain. Define it once, reference it from multiple usage chains. A preset has a
forward:block (and optionallybackward:) but nopatch:.Usage : a chain that inherits the preset’s encode/decode pipeline via
preset:and adds its ownpatch:block (orcomposed_patch:for surgical workflows).
chains:
# PRESET: define once, reuse many times
sas_report:
forward: [base64, {zlib: {skip_bytes: 3}}, {json: {extract: "..."}}]
backward: [{json: {wrap: "..."}}, {zlib: {prepend_bytes: "4d1504"}}, base64]
# USAGE: inherit the preset, add a specific patch
patch_dev:
preset: sas_report
patch:
replace: {"prod-host": "dev-host"}
patch_prod:
preset: sas_report
patch:
replace: {"dev-host": "prod-host"}
This separation lets you maintain one decode/encode pipeline definition across many environment-specific patches without duplication.
Note
Chained presets are not supported: a preset cannot itself reference
another preset via preset:. The validation enforces this at
config-load time.
The 5 primitives¶
Each primitive is bidirectional. The forward direction decodes /
parses, the backward direction re-encodes / serializes.
base64¶
forward:
- base64 # str (b64) -> bytes
backward:
- base64 # bytes -> str (b64)
Pure RFC 4648 base64 encoding. Three options support proprietary wire formats like SAS Viya report blobs:
Option |
Default |
Purpose |
|---|---|---|
|
|
When |
|
|
Literal string removed from the start of the input before decoding. No-op if the input does not start with it (allows mixed blobs). Max 32 chars. |
|
|
Literal string prepended to the base64 result on encode. Mirrors |
# SAS Viya report blob: "TRUE###" prefix where ### is a non-alphabet
# separator that strict mode would reject.
forward:
- base64:
strict: false # tolerate ### separator
strip_prefix: "TRUE###" # remove SAS marker before decode
backward:
- base64:
prefix: "TRUE###" # re-add SAS marker after encode
The prefix "TRUE" is a clever trick: those 4 base64 chars decode to
exactly M\x15\x04 (the SAS proprietary 3-byte header). The ###
that follows is a separator that lenient base64 decoders skip. By
stripping "TRUE###" first and decoding the rest, you get the raw
zlib stream directly without needing skip_bytes on the next stage.
bytes¶
forward:
- bytes # bytes -> str (utf-8)
backward:
- bytes # str -> bytes (utf-8)
Used to bridge between binary and string-typed primitives. Accepts
an encoding option (default utf-8).
zlib¶
forward:
- zlib # compressed -> bytes
backward:
- zlib # bytes -> compressed
Option |
Default |
Purpose |
|---|---|---|
|
|
Strip N leading bytes before decompression. Useful for legacy formats that prepend a header to the zlib stream. |
|
|
Hex string prepended after compression. Mirror of |
|
|
Compression level. |
# Maximum compression for the SAS Viya backward chain
backward:
- zlib:
level: 9 # smallest output, slowest
skip_bytes cannot be auto-reversed (the engine cannot guess what bytes
to re-prepend), so any chain that uses it must declare an explicit
backward: block with prepend_bytes.
Tip
Modern SAS Viya workflows should prefer the prefix-based approach
(see the base64 primitive’s strip_prefix / prefix options) over
the legacy skip_bytes / prepend_bytes pair. The prefix approach is
simpler, configurable from YAML alone, and avoids byte-counting math.
The decompressor enforces a hard ratio limit (max 100x expansion) and an absolute size limit (200 MB) to prevent zlib bombs.
json¶
forward:
- json # str -> dict
backward:
- json # dict -> str (UTF-8)
Standard JSON parsing. Four options on the serialize side:
Option |
Default |
Purpose |
|---|---|---|
|
|
Forward-only. Drill into a nested envelope (dot-notation path). |
|
|
Backward-only. Restore the value into the envelope captured during forward. |
|
|
When |
|
|
When |
Warning
The kstlib ensure_ascii default is false, NOT true like Python
stdlib. This is an intentional divergence: SAS Viya report blobs
(and many other real-world payloads) contain accented characters
(café, données) that would otherwise be escaped to caf\u00e9
and roughly DOUBLE the JSON size before compression.
extract / wrap are useful for envelope-style payloads:
forward:
- json:
extract: "transferableContent.content" # Drill into the envelope
backward:
- json:
wrap: "transferableContent.content" # Restore the envelope
The forward path stores the original envelope internally
(in _ChainContext.json_envelopes) so the backward path can rebuild
the exact same structure even when only the inner payload was patched.
This makes the round-trip lossless.
xml¶
forward:
- xml # str -> ElementTree.Element
backward:
- xml # Element -> str
Uses defusedxml if available (recommended for security). DOCTYPE
declarations are rejected by default to prevent XXE attacks and
billion-laughs expansion.
The patch stage operates directly on the XML string before re-parsing, which is faster and more flexible than walking the Element tree.
SAS Viya blob formats¶
SAS Viya transfer packages use two distinct blob formats depending on the type of object being serialized. The transform engine ships two ready-to-use presets, one for each format.
Format A: report blobs (compressed binary)¶
Used for objects with a large textual payload (BIRD XML reports, data sources, …). The wire format is:
"TRUE###" + base64(zlib(JSON envelope))
"TRUE"is 4 base64 characters that decode toM\x15\x04, the 3-byte SAS proprietary header"###"is a SAS proprietary separator (NOT in the base64 alphabet, must be stripped before strict decoding)The remainder is base64-encoded zlib-compressed JSON envelope containing the inner payload (e.g. BIRD XML)
The clever bit: stripping "TRUE###" and then base64-decoding the
rest yields a raw zlib stream directly, with no additional skip_bytes
needed. This is Approach A and matches the actual SAS wire format
bit-for-bit.
Format B: metadata blobs (plain JSON)¶
Used for small metadata objects (folders, files, ACLs, …). The wire format is much simpler:
base64(JSON document)
No prefix, no compression, no proprietary header. Just plain base64-encoded JSON.
The two presets¶
transforms:
chains:
# PRESET: sas_report (Format A - compressed binary)
sas_report:
forward:
- base64:
strict: false # tolerate ### separator
strip_prefix: "TRUE###" # strip SAS marker
- zlib # no skip_bytes - Approach A
- json:
extract: "transferableContent.content"
backward:
- json:
wrap: "transferableContent.content"
minify: true # compact before compression
ensure_ascii: false # preserve French/Unicode
- zlib:
level: 9 # max compression
- base64:
prefix: "TRUE###" # re-add SAS marker
# PRESET: sas_metadata (Format B - plain JSON)
sas_metadata:
forward:
- base64:
strict: true # pure base64, no SAS noise
- json
backward:
- json:
minify: true # smaller output
ensure_ascii: false # preserve Unicode
- base64
Cross-format dispatch in user code¶
The transform engine itself does not auto-detect which format applies
to a given blob. The caller (a pipeline step or a Python script that
iterates the package) is responsible for picking the right preset
based on transferObject.summary.type:
import json
from kstlib.transform import TransformChain, load_transform_config
config = load_transform_config()
report_chain = TransformChain.from_config("sas_report", config)
metadata_chain = TransformChain.from_config("sas_metadata", config)
with open("MyPackage.json") as f:
pkg = json.load(f)
for detail in pkg["transferDetails"]:
to = detail["transferObject"]
obj_type = to.get("summary", {}).get("type")
chain = report_chain if obj_type == "report" else metadata_chain
to["content"] = chain.transform(to["content"])
with open("MyPackage_patched.json", "w") as f:
json.dump(pkg, f, separators=(",", ":"), ensure_ascii=False)
Note
The "TRUE###" prefix is fully configurable. If SAS changes their
proprietary marker tomorrow (e.g. to "TRUE|||" or anything else),
you only need to update the YAML strings - no code change required.
This is the whole point of having strip_prefix and prefix as
data-driven YAML options instead of hardcoded constants.
Forward / patch / backward¶
Every transform chain follows the same three-stage pipeline:
INPUT -> [forward primitives] -> decoded data -> [patch] -> patched data -> [backward primitives] -> OUTPUT
A chain that does not declare a patch simply round-trips the data
through forward and backward (useful for verifying the integrity of
the encode/decode pipeline itself). A chain that does not declare a
backward block uses auto-reverse (see below).
Auto-reverse rules¶
When a chain declares only forward: (no backward:), the engine
generates the backward chain by reversing the forward primitives in
order and swapping primitive options where needed:
Forward primitive |
Auto-reverse |
|---|---|
|
|
|
|
|
|
|
|
|
error (cannot guess prepend_bytes) |
|
|
|
|
The zlib skip_bytes exception¶
# This chain WILL FAIL at config-load time
chains:
bad_chain:
forward:
- base64
- zlib:
skip_bytes: 3 # Cannot auto-reverse: needs explicit prepend_bytes
Fix: declare an explicit backward: block with prepend_bytes:
chains:
good_chain:
forward:
- base64
- zlib:
skip_bytes: 3
backward:
- zlib:
prepend_bytes: "4d1504"
- base64
Patches: replace vs callable¶
A patch: block applies between forward and backward. It is mutually
exclusive between two modes:
Mode 1: replace (string substitution)¶
Simple key/value substitution applied to the decoded data (after the forward chain). Works on strings and on serialized XML.
patch:
scope: blob # default - patch decoded data
replace:
"https://old-host/": "https://new-host/"
'library="CASUSER"': 'library="PUBLIC"'
The replace map is applied in dict iteration order. There is no regex
support (use callable for regex needs). The engine enforces a
maximum of 100 entries per replace map, and 4096 chars per key/value.
Note
Deprecated alias: the field name mapping: is still accepted as a
deprecated alias for replace:. Setting mapping: emits a
DeprecationWarning and is silently copied to replace:. Migrate
existing configs to replace:.
Mode 2: callable¶
For complex patches (regex, lookup tables, conditional logic, external state), use a Python callable:
patch:
scope: blob
callable: myproject.transforms:patch_function
args:
target_host: "{{target_host}}" # Resolved from chain context
cas_mapping: "{{cas_mapping}}" # Resolved from chain context
The callable target follows the module.path:function_name convention.
The function is called as func(decoded_data, **resolved_args) and
must return the patched data.
{{variable}} references in args are resolved against the chain’s
context dict at execution time, allowing dynamic values to be
injected from a pipeline step or any caller.
Important
External callables must be whitelisted in
transforms.security.allowed_callable_modules. A callable whose
module is not in the whitelist raises TransformConfigError at
config-load time, before any transformation runs.
Patch scope: blob | outer | all¶
The scope: field controls where a replace: patch applies.
|
|
|
|
|---|---|---|---|
Decoded data (e.g. BIRD XML) |
applied |
not applied |
applied |
Outer wrapper ( |
not applied |
applied |
applied |
|
n/a |
PROTECTED, never patched |
PROTECTED, never patched |
scope: blob is the default and preserves the historical behavior:
the replace map is applied to the data decoded by the forward chain.
scope: outer and scope: all mutate a JSON wrapper passed by the
caller via the existing metadata= kwarg on chain.transform() /
chain.patch(). The wrapper lives outside the encoded blob:
think of connectors[].uri and connectors[].hints.orig-uri in a
SAS Viya transferObject document. Use these scopes when you need
to patch fields that the forward chain never touches.
chains:
patch_report_full:
preset: sas_report
patch:
scope: all # patch BIRD XML AND outer wrapper
replace:
'library="CASUSER"': 'library="PUBLIC"' # BIRD XML form
'library=CASUSER': 'library=PUBLIC' # connector URI form
"https://old-host": "https://new-host"
import json
from kstlib.transform import TransformChain, load_transform_config
config = load_transform_config()
chain = TransformChain.from_config("patch_report_full", config)
wrapper = json.loads(transfer_object_json)
blob = wrapper["content"]
new_blob = chain.transform(blob, metadata={"outer": wrapper})
wrapper["content"] = new_blob
# wrapper is mutated in place by replace_outer_uris
If scope: is outer or all and metadata['outer'] is missing,
the patch raises PatchError at execution time.
Protected outer paths (xpath safety)¶
replace_outer_uris is the helper that powers scope: outer and
scope: all. It walks the wrapper recursively and applies the
replace map to every string value, except strings whose path
matches a PROTECTED_OUTER_PATHS entry.
The default blacklist contains one path:
PROTECTED_OUTER_PATHS = frozenset({
"connectors[*].hints.xpath",
})
Path syntax |
Meaning |
|---|---|
|
Match a dict key literally. |
|
Match any list index (wildcard). |
|
Mix of literal keys and list wildcards. |
The connectors[*].hints.xpath blacklist exists because SAS Viya
stores BIRD XPath pointers there. Patching these strings would break
the wrapper-to-content coherence and silently corrupt the report.
Warning
connectors[*].hints.xpath is ALWAYS protected. Even with
scope: all, even with custom replace maps, it is never modified.
You can extend or replace the blacklist by passing a custom
protected_paths to replace_outer_uris directly:
from kstlib.transform import replace_outer_uris
n = replace_outer_uris(
wrapper,
{"old": "new"},
protected_paths=frozenset({"my.field", "items[*].immutable"}),
)
The function returns the number of strings that were modified.
Composed patches: surgical multi-object workflows¶
Plain patch: applies to every object in a workflow. When a
package contains many objects (e.g. 200 reports) and each needs a
different replace map, use composed_patch: instead.
A composed patch references other chains by name and applies their
patch block conditionally. Two layers exist:
global_patches: applied to every object regardless of metadatatargeted_patches: applied only when the object metadata matches a filter
chains:
# "Patch-only" reusable building blocks (no forward/backward)
remap_host:
patch:
replace:
"https://source.res.private/": "https://target.res.private/"
remap_caslib_global:
patch:
replace:
'library="CASUSER"': 'library="PROD_GLOBAL_LIB"'
remap_caslib_r220:
patch:
replace:
'library="CASUSER"': 'library="R220_DEDICATED_LIB"'
# The orchestrator: inherits forward/backward from sas_report,
# composes the building blocks via global_patches + targeted_patches
patch_report_composed:
preset: sas_report
global_patches:
- remap_host # Applied to EVERY object
targeted_patches:
- filter:
content_type: report
name: "R220_*"
patches:
- remap_caslib_r220
- filter:
content_type: report
name: "*" # Fallback for other reports
patches:
- remap_caslib_global
Cascade: last applied wins (inverse of CSS)¶
Warning
Cascade is by declaration order, NOT by filter specificity. This is the inverse of CSS.
Order your targeted_patches from most general to most specific.
The last applied patch overwrites earlier ones on conflict.
Concrete example with the config above:
from kstlib.transform import transform
# Object 1: matches both R220_* and the "*" fallback
result_r220 = transform(
blob_b64,
"patch_report_composed",
metadata={"content_type": "report", "name": "R220_ASTRO"},
)
# Apply order: remap_host -> remap_caslib_r220 -> remap_caslib_global
# Final caslib: PROD_GLOBAL_LIB (the wildcard fallback wins because
# it is declared LAST in targeted_patches)
# Object 2: matches only the "*" fallback
result_orion = transform(
blob_b64,
"patch_report_composed",
metadata={"content_type": "report", "name": "ORION_FOO"},
)
# Apply order: remap_host -> remap_caslib_global
# Final caslib: PROD_GLOBAL_LIB
If you want R220 reports to keep their dedicated caslib, declare the specific filter LAST so it wins:
targeted_patches:
- filter: {name: "*"} # General first (will be overridden)
patches: [remap_caslib_global]
- filter: {name: "R220_*"} # Specific last (final winner for R220)
patches: [remap_caslib_r220]
This ordering convention is intentional: it mirrors the kstlib config
cascade philosophy (kwargs > user config > preset > defaults) where
the most explicit override always wins.
Filter syntax¶
A filter combines two fields, ANDed together:
Field |
Type |
Default |
Notes |
|---|---|---|---|
|
string |
|
Exact match against |
|
string (glob) |
|
|
- filter:
content_type: report # Exact: must equal "report"
name: "R220_*" # Glob: prefix match
patches: [remap_caslib_r220]
The metadata dict is provided by the caller via the metadata=
keyword argument on transform() or chain.transform(). The transform
engine never reads metadata from the data itself: the caller is
responsible for extracting content_type and name from whatever
schema they use (e.g. transferDetails[].transferObject.summary in
the SAS Viya Transfer API).
Note
A targeted_patches entry with no filter (or with all wildcards)
matches every object and behaves like an additional global_patches
entry. The advantage of using targeted_patches with a "*" filter
over global_patches is ordering control: targeted_patches always
run AFTER global_patches.
Patch-only chains¶
A chain that has only patch (no forward, no preset) is a
“patch-only” chain. It exists solely to be referenced from another
chain’s composed_patch.global_patches or
composed_patch.targeted_patches[*].patches.
chains:
remap_host:
# No forward, no preset, no backward. Just a patch building block.
patch:
replace:
"https://source/": "https://target/"
Calling TransformChain.transform() directly on a patch-only chain
works as identity (forward() and backward() are no-ops because
the primitive list is empty), but it is not the intended use case.
Security¶
Callable whitelist¶
transforms:
security:
allowed_callable_modules:
- myproject.transforms
- myproject.viya.patches
Only callables whose module path matches an entry in this list (or is a sub-module of one) are allowed. The default is an empty whitelist, which means no external callables can be invoked.
A callable target whose module is not whitelisted raises
TransformConfigError at config-load time, before any transformation
runs. Replace-based patches are not affected by the whitelist.
Hard limits¶
Parameter |
Default |
Hard Max |
|---|---|---|
Input data size |
100 MB |
100 MB |
JSON payload size |
50 MB |
50 MB |
XML payload size |
50 MB |
50 MB |
Decompressed size |
200 MB |
200 MB |
Decompression ratio |
100x |
100x |
Replace entries per patch |
100 |
100 |
Replace key/value length |
4096 chars |
4096 chars |
Named chains |
50 |
50 |
Forward / backward chain length |
20 primitives |
20 primitives |
Global patches per composition |
10 |
10 |
Targeted patches per composition |
50 |
50 |
Patches per targeted entry |
10 |
10 |
Glob pattern length |
256 chars |
256 chars |
Zlib bomb protection¶
The zlib_decompress primitive enforces both an absolute decompressed
size limit (200 MB) and a maximum decompression ratio (100x). A zlib
stream that expands beyond either threshold raises DecompressError
immediately, before allocating the full output buffer.
XML security¶
The xml_parse primitive uses defusedxml if available (recommended).
DOCTYPE declarations are rejected by default to prevent XXE attacks
and billion-laughs entity expansion.
If defusedxml is not installed, the engine falls back to the stdlib
xml.etree.ElementTree parser with explicit DOCTYPE rejection layered
on top.
YAML config reference¶
Complete schema with comments. All fields are optional unless marked required:
transforms:
# Security: callable whitelist (default: empty = no callables allowed)
security:
allowed_callable_modules:
- myproject.transforms # Module path prefix
# Named chain definitions
chains:
# Example 1: full chain with explicit forward + backward + patch
my_chain: # **required** - chain name
forward: # **required** unless preset is set
- base64 # Primitive name (string form)
- zlib: # Primitive with options (dict form)
skip_bytes: 3
- json:
extract: "path.to.field"
backward: # Optional (auto-reversed if absent)
- json:
wrap: "path.to.field"
- zlib:
prepend_bytes: "4d1504"
- base64
patch: # Optional (no patching if absent)
scope: blob # blob | outer | all (default: blob)
replace: # Mutually exclusive with callable
"old": "new"
callable: mod.path:fn # Mutually exclusive with replace
args:
key: "{{var}}" # Resolved from chain context
# Example 2: chain that inherits a preset
my_usage:
preset: my_chain # Mutually exclusive with forward
patch: # Override the preset's patch
scope: blob
replace:
"foo": "bar"
# Example 3: chain with composed patch (mutually exclusive with patch)
my_composed:
preset: my_chain
global_patches: # List of chain names
- other_chain_a
targeted_patches: # Conditional patches
- filter:
content_type: report
name: "R220_*"
patches:
- other_chain_b
Python API¶
Convenience function¶
from kstlib.transform import transform
# Loads config from kstlib.conf.yml automatically
result = transform(blob_b64, "patch_report")
# With metadata for composed_patch filter matching
result = transform(
blob_b64,
"patch_report_composed",
metadata={"content_type": "report", "name": "R220_ASTRO"},
)
Client instance¶
from kstlib.transform import TransformChain, load_transform_config
config = load_transform_config()
chain = TransformChain.from_config("patch_report", config)
# Forward only
decoded = chain.forward(blob_b64)
# Patch only (operates on the already-decoded data)
patched = chain.patch(decoded)
# Backward only (must be called after forward to restore envelopes)
re_encoded = chain.backward(patched)
# Full round-trip
result = chain.transform(blob_b64)
Programmatic construction¶
from kstlib.transform import (
TransformChain,
TransformChainConfig,
PrimitiveConfig,
PatchConfig,
)
chain = TransformChain(
TransformChainConfig(
name="my_chain",
forward=(
PrimitiveConfig(name="base64"),
PrimitiveConfig(name="zlib"),
PrimitiveConfig(name="json"),
),
patch=PatchConfig(replace={"foo": "bar"}),
)
)
result = chain.transform(blob_b64_string)
Integration with kstlib.pipeline¶
The transform engine integrates cleanly with kstlib.pipeline via the
CallableStep. A pipeline step can invoke kstlib.transform.transform
directly:
pipelines:
patch-and-upload:
steps:
- name: load
type: shell
command: "kstlib rapi download --out blob.json"
- name: patch
type: callable
callable: kstlib.transform:transform
args:
- "{{blob_b64}}" # Loaded from previous step
- "patch_report" # Chain name
- name: upload
type: shell
command: "kstlib rapi upload --body @result.json"
For composed patches, pass metadata= as a kwarg via the callable
args (the syntax depends on your pipeline step setup).
Examples¶
See examples/transform/ for runnable demos:
01_round_trip.py: programmatic chain construction with synthetic data, full forward + patch + backward, integrity verification02_config_driven.py: load a chain fromkstlib.conf.ymland apply it viaTransformChain.from_config()03_composed_patch.py: 3 synthetic objects (R220_foo, ORION_bar, OTHER_baz) demonstrating the global + targeted cascade with explicit before/after output, andscope: allmutating the outer wrapper for the R220 case04_outer_patch.py: the threescope:values (blob,outer,all) exercised in isolation on a synthetic SAS Viya transferObject, plus thereplace_outer_urisstandalone helper and a customprotected_pathsblacklist
Tip
For complete API documentation including all classes, functions, and exceptions, see Bidirectional Data Transformation.