DMC on Snowflake/Snowpark (Beta)
Known Limitations
- DMC uses a separate file-based datastore (on a block storage volume) as the primary backend.
This datastore holds both DMC metadata and the primary copy of the user data. - User data changes are synchronized with a small delay (~10 seconds by default) into the Snowflake schema - but there is no synchronization from Snowflake into the file-based datastore of DMC.
- Data Quality Rules & Commands are executed against the Snowflake schema once synchronization is completed.
INFO
This means they do not have access to the DMC metadata!
- Filtering records by QS results is currently not possible if the QS Rules references non-DMC tables. While QS Rules are executed on the Snowflake side, querying records filtered by QS results happens through the file-based datastore which only has access to the DMC tables.
Setup Preparations
To perform the setup preparations, both the Snowflake CLI
and either the docker binary or Google's crane (Part of the "go-containerregistry" toolset)
are required.
Account Settings
Snowflake image registries and block storage volumes do not support Tri-Secret Secure and periodic re-keying, an opt-out is required to be able to use those features.
alter account set ENABLE_TRI_SECRET_AND_REKEY_OPT_OUT_FOR_IMAGE_REPOSITORY=TRUE;
alter account set ENABLE_TRI_SECRET_AND_REKEY_OPT_OUT_FOR_SPCS_BLOCK_STORAGE=TRUE;
Snowflake Role
The "Snowpark Container Services" instance should be using a dedicated role.
This role needs access to a warehouse as well as the account-level BIND SERVICE ENDPOINT privilege that
needs to be set.
-- Create Role DMC_SNOWPARK_ROLE (!! SECURITYADMIN RIGHTS !!)
use role SECURITYADMIN;
create role if not exists DMC_SNOWPARK_ROLE;
grant role DMC_SNOWPARK_ROLE to role SYSADMIN;
grant USAGE on warehouse WH_XSMALL to role DMC_SNOWPARK_ROLE;
grant BIND SERVICE ENDPOINT on ACCOUNT to role DMC_SNOWPARK_ROLE;
Snowpark Compute Pool
To actually run the container, a "compute pool" is required.
This has to be created with appropriate privileges, but can and should be signed over to the DMC_SNOWPARK_ROLE role.
-- Compute Pool DMC_SNOWPARK_POOL (!! ACCOUNTADMIN RIGHTS !!)
use role ACCOUNTADMIN;
create compute pool if not exists DMC_SNOWPARK_POOL
min_nodes = 1
max_nodes = 1
instance_family = CPU_X64_XS
auto_resume = true
initially_suspended = true
auto_suspend_secs = 600
;
grant OWNERSHIP on compute pool DMC_SNOWPARK_POOL to role DMC_SNOWPARK_ROLE;
The auto_suspend_secs should be set to a value in the order of minutes to ensure that, when the service is suspended
in order to restart it, the compute pool isn't suspended as well in the time before the service is resumed.
DMC Database / Schema
All DMC content is hosted in a dedicated database & schema, which will also be used as a target for mirroring the data from DMC into Snowflake.
-- CREATE DMC DATABASE
create database if not exists DMC_SNOWPARK_DB;
grant OWNERSHIP on database DMC_SNOWPARK_DB to role DMC_SNOWPARK_ROLE;
use role DMC_SNOWPARK_ROLE;
use database DMC_SNOWPARK_DB;
-- CREATE DMC SCHEMA
create schema if not exists DMC_SNOWPARK_DB.DMC;
use schema DMC_SNOWPARK_DB.DMC;
Stage for Volume
A Snowflake stage can be used to make logs accessible externally and to extract the internal database files.
create stage if not exists DMC_SNOWPARK_DB.DMC.DMC_STAGE
directory = ( enable = true )
encryption = ( type = 'SNOWFLAKE_SSE' )
;
Note that the stage must not be used to actually store the internal database files due to limitations imposed by Snowflake!
Snowpark SSO
Snowpark Container Services offers built-in SSO authentication capabilities. This enables DMC to recognize the currently logged in Snowflake user.
To enable it, the instance must be configured with DMC_AUTH=snowpark.
However, this will only enable authentication (recognizing the user).
Authorization
It is also possible to enable authorization (assigment to DMC user groups/roles) based on Snowflake roles.
To enable this, a dedicated SNOWPARK_USER_ROLES view is required that gives DMC access to relevant
Snowflake roles, based on data from the SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_USERS system view.
Due to Snowflake's security model, this view needs to be owned by a role that is allowed to actually access
the role assignments in the SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_USERS view (usually ACCOUNTADMIN).
It is recommended to limit the SNOWPARK_USER_ROLES view to the roles which are actually relevant to DMC,
e.g. by using a LIKE filter based on role naming conventions.
More complex scenarios using a "root role" and recursive queries are also possible.
-- SNOWPARK_USER_ROLES View (!! ACCOUNTADMIN RIGHTS !!)
use role ACCOUNTADMIN;
create or replace view DMC_SNOWPARK_DB.DMC.SNOWPARK_USER_ROLES
copy grants
as
select GRANTEE_NAME as USER,
ROLE as ROLE
from SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_USERS
where GRANTED_TO = 'USER'
and ROLE like 'DMC%'
;
grant SELECT
on view DMC_SNOWPARK_DB.DMC.SNOWPARK_USER_ROLES
to role DMC_SNOWPARK_ROLE
;
For DMC to use this view to determine the authorizations/roles
of users logged in via Snowpark SSO, set the following configuration
parameter (in addition to DMC_AUTH=snowpark):
DMC_SNOWPARK_SSO_USER_ROLES_VIEW=SNOWPARK_USER_ROLES
Please note that inside DMC, all roles retrieve from Snowflake will have the sf: prefix and need
to be mapped to DMC internal roles, including the admin role.
| Snowflake Role | DMC ext. Role | DMC int. Role |
|---|---|---|
| DMC_ROLE_USERS | sf:DMC_ROLE_USERS | users |
| DMC_ROLE_ADMIN | sf:DMC_ROLE_ADMIN | admin |
This ensures there are no collisions between groups if multiple DMC authentication modes are used concurrently.
Snowpark Image Registry & Container Image
Container Images used by Snowpark Container Services must be pushed into a Snowflake Image Repository:
create image repository DMC_SNOWPARK_DB.DMC.IMAGE_REPOSITORY;
-- Get the Repository URL from thte image Registry
show image repositories;
The recommended way to transfer the images to the Snowflake Image Repository is to use Google's crane binary.
# get Snowflake CLI from https://docs.snowflake.com/en/developer-guide/snowflake-cli/installation/installation
# get Crane from https://github.com/google/go-containerregistry/releases/tag/v0.20.2
# login to INFOMOTION Image Registry
crane auth login -u {user} -p {password} infomotiondmc.azurecr.io
# login to Snowflake Image Registry
snow spcs image-registry token --connection MY_CONNECTION_WITH_SSO --format=JSON | crane auth login xxx.registry.snowflakecomputing.com --username 0sessiontoken --password-stdin
# copy image
crane copy infomotiondmc.azurecr.io/dmc:snowflake-beta xxx.registry.snowflakecomputing.com/dmc_snowpark_db/dmc/image_repository/dmc:snowflake-beta
Service Setup
Now we can create the actual container service for DMC.
Important to note:
- The previously created compute pool is used.
- The container uses a block storage volume
to store its internal (file-based) database which holds a copy of the user data and the DMC metadata.
A stage-based volume can be used, but can lead to database corruption in certain scenarios due to the limitation on stage-based volumes- The block storage size is set to 1 gigabyte. Please adjust accordingly, as resizing is currently not supported!
- It is however possible to extract the data into a file on a stage-based volume and transfer it to a new instance using
the DMC database console:
script SIMPLE to '/dmc/logs/dmc.sql' schema DMC;
- Initially, the built-in admin user (
DMC_ADMIN_PASSWORD) must be used to configure the role mapping.
It can (and should) be disabled after, since it is not recommended for production instances. - "Auto Resume" is disabled as it can cause interference with the file-based metadata store.
- Multiple instances are currently not supported.
create service DMC
in compute pool DMC_SNOWPARK_POOL
from specification $$
spec:
containers:
- name: dmc
image: /DMC_SNOWPARK_DB/DMC/IMAGE_REPOSITORY/dmc@sha256:xxxxxxxxx...xxxxxxxxxx
env:
# Choose warehouse to use
SNOWFLAKE_WAREHOUSE: WH_XSMALL
# License
DMC_LICENSE: "..."
# Map the (Snowflake) "DMC_DEMO_ADMIN" role to the (DMC) "admin" role
DMC_ROLE_MAPPING: '{"sf:DMC_DEMO_ADMIN":["admin"]}'
# Choose a non-empty password for additional protection of the admin database console
DMC_DB_PASSWORD: ""
# Disable the administrator's DB console, if desired (not recommended)
DMC_ENABLE_DB_CONSOLE: true
# Enable demo content, if desired
DMC_IMPORT_DEMO_CONTENT: false
# Reduce (or increase) maximum replication delay in seconds, if desired
DMC_SNOWFLAKE_MAX_DELAY: 10000
#######################
# Static settings
#######################
SERVER_PORT: 8080
SNOWFLAKE_QUERY_TAG: DMC
DMC_SCHEMA_UPDATE: true
DMC_SCHEMA_CONTEXTS: "schema,baseContent"
DMC_DB_JDBC_URL: jdbc:h2:file:/dmc/data/dmc.db;INIT=CREATE SCHEMA IF NOT EXISTS dmc
DMC_DB_DRIVER: org.h2.Driver
DMC_DB_SCHEMA: dmc
DMC_DB_TYPE: h2
DMC_DB_USER: dmc
DMC_SNOWFLAKE_REPLICATION: true
DMC_AUTH: snowpark
DMC_AUTH_SNOWPARK_SSO_USER_ROLES_VIEW: SNOWPARK_USER_ROLES
DMC_LOG_BUFFER_SIZE: 10000
volumeMounts:
- name: data
mountPath: /dmc/data
- name: stage
mountPath: /dmc/stage
endpoints:
- name: dmc
port: 8080
public: true
protocol: HTTP
volumes:
- name: data
source: block
size: 1Gi
- name: models
source: "@dmc_stage"
$$
AUTO_RESUME=FALSE
MIN_INSTANCES=1
MAX_INSTANCES=1
;
As outlined above, there are some limitations with relation to stage-based volumes.
Due to this, we do not recommend using stage-based volumes, either for logging or for the DMC data.
Since access to files stored on a block storage volume is currently not possible, using this for logging does not make
too much sense.
Instead, we recommend using console/STDOUT logging and accessing the logs either via Snowflake or via the actuator/log
endpoint. This, however, also has some limitations (e.g. the number of log entries available via the Actuator endpoint
is limited by the DMC_LOG_BUFFER_SIZE setting).
Granting Access
Users can be granted access to the Snowpark-hosted DMC service with the following Snowflake privileges:
GRANT USAGE ON DATABASE DMC_SNOWPARK_DB TO ROLE DMC_USERS;
GRANT USAGE ON SCHEMA DMC TO ROLE DMC_USERS;
-- Also grant the service's ALL_ENDPOINTS_USAGE role
GRANT SERVICE ROLE DMC!ALL_ENDPOINTS_USAGE TO ROLE DMC_USERS;
Monitoring & Operations
To monitor the compute pool and the service the following statements can be used.
Especially the endpoint URL is relevant to access DMC.
-- Describe Compute Pool
describe compute pool DMC_SNOWPARK_POOL;
-- Describe Service
describe service DMC;
-- Access service logs
CALL SYSTEM$GET_SERVICE_LOGS('dmc', '0', 'dmc', '500');
-- Show service endpoint URL for web access
show endpoints in service DMC;
The service can be suspended which will also allow the compute pool to auto-suspend, eliminating the ongoing costs otherwise generated by continuously running the compute pool.
-- Suspend/Resume service
alter service DMC suspend;
alter service DMC resume;
The service can be updated/changed at any time. This includes switching to a different image (either by tag or by sha256 hash).
alter service DMC
from specification $$
spec:
containers:
- name: dmc
image: /DMC_SNOWPARK_DB/DMC/IMAGE_REPOSITORY/dmc@sha256:xxxxxxxxx...xxxxxxxxxx
# ...
$$
;
Automatic Suspend/Resume
Scheduled tasks can be used to automatically suspend/resume the DMC instance at appropriate times to avoid unnecessary costs:
create task TSK_DMC_SUSPEND
warehouse = WH_XSMALL
schedule = 'USING CRON 0 18 * * * Europe/Berlin'
as
alter service DMC suspend
;
alter task TSK_DMC_SUSPEND resume;
Backup of internal DMC data
When using Snowflake's Block Storage Volumes, there are two ways of creating a backup:
- Using Snowflake's CREATE SNAPSHOT command. The snapshot can both be restored within the same service instance and also to initialize a volume on a different service instance.
- Dumping the (internal) database contents to a SQL file on the stage-based volume.
- Access the database console of your DMC instance
- Connect to the embedded database with the credentials specified in the service configuration (If not otherwise configured, the database username and password will both be empty).
- Execute the following command:
SCRIPT TO '/dmc/stage/dmc.sql'