Data Config

Configure your training data, relationships, field parameters, privacy options, and additional constraints.

Select Your Source Table(s):

Here you will add tables to your project, one at a time.
Use the dropdowns to select a database, schema and table. Remember: you must have run the correct grants to add tables or other objects to a project. If you do not see your source data here, please see Object Priviliges.
Click the '+' icon to the left of the dropdowns to add the table to your project.
Once you have added your tables, it's time to configure each table. Click 'Config' to do this.
You can also click the 'x' button to the left of a table you've added to remove it.

Field Details:

First you will configure each field in your table on the 'Field Details' tab.
We recommend starting with the "Type" column. Set all of your ID fields to 'ID'.
Check that all your text fields are set to 'Categorical'.
Remove any fields that you do not want to include in the model by clicking the 'X' icon to the left of the field.
One primary key should be selected (using the P-Key checkbox) if available for each table. "ID" must be selected as the type for the primary key. This is important, especially when working with multiple tables.
Click the 'Anonymize' checkbox for any fields that contain PII. These fields will not be included in the training and the results will be generated from scratch.
Adjust the 'Format' field where needed. Only available for ID and datetime fields. Accepts regex and strftime formats.

Anonymize:

Note: The anonymize checkbox is only available for categorical fields since continuous fields (dates, numeric fields) are not generated with discrete values from the source data.

When the user selects anonymize for a field, the user must also select the type of output the user expects to replace the field values for that column. E.g. selecting 'name' for the type (person category) will randomly generate full names to populate the field values when generating.

Set the anonymization options for each field that contains sensitive data.
First, select the Category and Type of anonymization.
This will result in random values being generated of that type (using the Faker library). For example, if you have a field of full names, you should select 'person' then 'name' for these two fields.
Some types support localization and extra parameters. When this is the case, you will see a dropdown for Locales and a text box for the extra parameters. See tooltips for more details.

Tip: Categorical fields can cause the model training to take longer, along with the number of distinct values for each categorical field. Therefore, the user should be mindful of the number of categorical fields for each table in the model. We also recommend starting small with a smaller number of categorical fields before advancing to different models and more categorical fields / records. This restriction will be lifted in future versions of the app.

Constraints:

Constraints are rules that the synthetic data must follow. For example, you can set a constraint that the 'age' field must be greater than 18. You would do this by selecting 'Scalar Inequality', then selecting the age field from the dropdown, and finally entering 18 into the input box.
Don't forget to click the '+' button to add the constraint to the table.

Relationships:

After you've configured and saved your tables and Field Details (IDs), you must configure the relationships between the tables when working with multiple tables.
You must select your parent and child tables, and their respective PK and FK fields.
The parent table is the one whose primary key in the relationship is referenced by the child table's foreign key. In other terms, the parent table would be synonymous with the dimension table, while the child table would be synonymous with the fact table.
Don't forget to click the '+' button to add the relationship to the project.

PreviousSelect ML Model NextTrain and Generate

Last updated 2 months ago