Skip to content

Data Management

Idea

The Data Management related APIs handles the entire workflow of data registration and preparation. Data Contextualization provides a simple way to prepare data for establishing semantic correlations and data query processing. The stages to use Data Contextualization includes:

For more information about these stages refer Basics.

Access

For accessing this service, you need to have the respective roles listed in Data Contextualization roles and scopes.

Application users can access the REST APIs using REST Client. Depending on the APIs, users will need different roles to access the Data Query Service.

Note

Access to the Data Query Service APIs are protected by Insights Hub authentication methods, using OAuth credentials.

Basics

Data Registration

Data scientist or analyst decides the data source and categorization of the data, the data tag name, data upload strategy (replace/append) and file type (JSON, CSV, Parquet or\and XML). Once these decisions are made then, Data Registration APIs can be used to create the registry.
Data Registration APIs are used to organize the incoming data. When configuring a data registry, you can update your data based on a replace or append strategy. During each data ingest operation, the replace strategy will replace the existing schema and data whereas, the append strategy will update the existing schema and data. For example, if schema changes and incoming data files are completely different every time, then you can use replace strategy.

By default, the append strategy updates schema when the input file is causing a change in the schema. Users can control this behavior by using the frozen schema flag. Refer the Frozen Schema under Data Registration feature to understand this mode.

Custom Data Types

The Data Contextualization by default identifies basic data types for each property, such as String, Integer, Float, Date, etc. Once the data source and data type are identified, the user can provide the custom data types with a regex pattern so that Data Contextualization can apply that type during schema creation. The developer can use Custom Data Types APIs to manage custom types. The user can use the set of APIs to create their custom data type. Data Contextualization also provides an API to suggest data type based on user-provided sample test values. The custom data type contains data type name and one or more regular expression pattern that needs to be matched for incoming data. It also provides suggestions and helps decide the regular expression for the data type. This returns a list of possible regex matcher with given tests and sample values. Users can pick the regex pattern that matches the sample values the most and register those patterns as custom data types. Data Contextualization also supports deleting an unused custom data type. If any custom data type is identified by schema, then it cannot be deleted.

Data Ingest

The developer can use Data Ingest APIs to bring data to Data Contextualization so that schema can be created by querying and semantic model creation process. The Integrated Data Lake (IDL) user should follow Data Lake APIs to process the data for Data Contextualization. Data Ingest is the starting point to create schema and data management for schemas. Once the valid registries are created for a data source, then the user can perform file upload and start the data ingest process to create schemas. This is used to upload files from various systems and start the data ingest process. Currently, Data Contextualization supports JSON, XML, Parquet and CSV file formats for enterprise data and Parquet format for time series data. It supports two ways of data ingestion:

For Integrated Data Lake (IDL) customers

If you are a new IDL customer using Data Contextualization with IDL for Enterprise and IoT data, you should follow the below steps:

  1. Purchase the Data Contextualization and IDL base plan.
  2. By default, Data Contextualization enables cross-account access to provisioned tenants under sdi folder.
  3. Data Contextualization uses POST/objectEventSubscriptions IDL endpoint to subscribe to the Data Contextualization topic at the time of provisioning of Data Contextualization & IDL to a tenant. The IDL will notify anytime this folder is changed.
  4. Retrieve <storageAccount> using IDL API - GET/objects. Use the storageAccount from response to register IDL datalake with Data Contextualization.
  5. Register IDL datalake with Data Contextualization by calling Data Contextualization POST/dataLakes with payload {"type": "MindSphere", "name": "idl", "basePath": "<storageAccount>/data/ten=<tenantId>"}.
  6. Enterprise data uploaded into sdi folder or Insights Hub IoT data imported into TSI/sdi folder will be processed based on a notification received from IDL.
Enterprise Data Flow
  1. Use Data Contextualization documentation and create a data registry as explained under Data Registry APIs. Once the data registry is created, Data Contextualization will return registryId.
  2. Store the registryId retrieved from this data registry.
  3. Identify the files that need to be uploaded for a given registryId and create metadata for files using IDL POST/objectMetadata/{objectpath} APIs.

      {"tags": ["registryid_<registryId>"]}
    

    In case input file contains XML and there is no default rootTag identified for XML files, or you want to provide different rootTag for a file then add this tag in above Metadata creation:

      {"tags": ["registryid_<registryId>", "rootTag” : “<root tag>”]}  #if not using defaultRootTag or want to use different rootTag>"
    
  4. You can upload the file and Data Contextualization will retrieve the message from IDL for each upload and create a schema with the uploaded files. If you are using Postman to upload the file then make sure to choose binary option before uploading the file using IDL generated URL.

  5. Use searchSchema to retrieve all the schemas for uploaded files and create a query using Data Contextualization APIs.
IoT Data Flow
  1. Identify the asset and aspect for which you want Data Contextualization to process the data. Create an IoT data registry using the data registry creation APIs for given Asset and Aspect by using POST /iotDataRegistries endpoint.
  2. Once the IoT data registry is created, use IDL API to perform the import by using POST /timesSeriesImportJobs endpoint.
  3. Once timeSeriesImportJobs is performed and data is stored to the path that Data Contextualization is subscribed to, IDL will send a message to Data Contextualization.
  4. Timeseries data that is imported in Data Contextualization subscribed folder will be consumed by Data Contextualization and ready to be used in queries.
  5. Use the search schema to retrieve all the schemas for uploaded files and start writing a query using Data Contextualization APIs.

Search Schema

The schema is available once ingestJobs is successful. Schema registry allows a user to retrieve schema based on the following:

  • The source name, data tag, or schema name for the Enterprise category.
  • The assetId, aspectName, or schema name for IoT category.

Features

Data registration

This is the first step before any data is ingested or connected to Data Contextualization for schema extraction or query execution. Data analysts or admins need to register the data sources from where data will be used for analysis and semantic modeling.

The registration consists of data source names, datatags (or sub-sources/tables within a source), file pattern, file upload strategies. Data Contextualization currently supports csv, Json or xml and therefore the file pattern must end with csv, xml or json file types. Here is example filepattern for different types: The example file pattern to support different extensions: [a-z]+.json (for JSON file type) , [a-z]+.csv (CSV file type), [a-z]+.xml (XML file type). It can accept multiple extensions in this format: [a-z]+.(json|csv|xml) Replace [a-z] with supported filtered file type as desired.

For file-based batch data ingestion, various file upload data management policies are provided. Currently, Append and Replace as provided as two policies that can be set for each datatag within a data source.

  • Append: It joins files ingested for source and data tag. It provides a success response if schema matches Otherwise, it creates appends to the existing schema. This policy can be used in batch-based data ingestion from data upload API.
  • Replace: This data management policy replaces the entire data set for the corresponding source and data tag. This policy is useful in updating meta-data kind of information.
  • Frozen Schema: This allows user to reduce the time for schema extraction for high frequency ingestion when Enterprise Data Registry schema is not changing for subsequent ingestion. Follow this step-by-step guide to skip or reset schema generation process for Enterprise Data Registry:
    • Use the default value or set request parameter SchemaFrozen to false during data registry creation using endpoint- /dataRegistries POST
    • After ingesting a file for the first time, update the request and set parameter SchemaFrozen to true using data registry update endpoint - /dataRegistries/{id} PATCH
    • Once the flag is set to true, Data Contextualization will not extract the schema for subsequent ingestion and schema must not change for this registry on subsequent ingestion. This will help improve the performance of ingestion.
    • At any point user thinks that schema may change for the given data registry then SchemaFrozen flag can be reset to false using same PATH endpoint - /dataRegistries/{id} PATCH

Data Registry service is primarily used for two purposes:

  1. Maintaining the registry for a tenant: Using this service, you can create your domain-specific registry entries using this service. This registry is the starting point for any analytics and file upload. The data registry allows you to restrict the kind of file that will be uploaded based on the file pattern and area that it will be uploaded. Data Contextualization will either replace or append to existing data based on the file upload strategy. The following endpoints can be used to create and retrieve the data registry created by the customer:

    • /dataRegistries POST
    • /dataRegistries/{id} PATCH
    • /dataRegistries/{id} GET
    • /dataRegistries GET
  2. Create custom data types for a given tenant: This endpoint allows you to create a sample regular expression pattern that can be used during schema extraction. This helps to generate the regular expression patterns based on the available sample values and then register one or more system generated regular expression patterns. It also allows you to retrieve the generated pattern. Data Contextualization, by default uses a set of regular expressions when extracting the schema based on the uploaded file. If a tenant has provided the custom data types using custom generated regular expression with this service, then those are used as well to infer the data types on the uploaded file. The following endpoints can be used:

    • /suggestPatterns POST - Generates the regular expression patterns for a given set of sample values.
    • /dataTypes/{name} GET - Retrieves datatypes for a tenant and data type name.
    • /dataTypes/{name} DELETE - Deletes datatypes for a tenant and data type name only if it is not used by any schema.
    • /dataTypes GET – Retrieves datatypes for a tenant.
    • /dataTypes POST - Register Datatypes to a tenant based on sample value generated or customer created data types.
    • /dataTypes/{name}/addPatterns– POST – update registered data types.

Data Ingest

Once registrations are done, raw data can be ingested either from integrated data lakes or customer data lakes. For more information, refer the Basics section. It serves as the starting point for data ingestion for the Data Contextualization application. Currently, CSV, Parquet, JSON, and XML formatted domain-specific files are supported. Two scenarios can be used to upload the file:

  1. Upload the file using IDL: This is the preferred mode, Data Contextualization will start processing the file once it is uploaded to IDL with the correct configuration. For more information on configuring the IDL, refer the Integrated Data Lake Service section.
  2. Upload file with valid data source registry: This approach is used for the Data Contextualization the only customer. It allows more validation against the data registry and creates multiple schemas based on different domains created under the data registry. Using this mode, you can create a combination of schemas from different domains, query them or use for analytical modeling.

In the Data Management, Data Contextualization schedules an automatic Extract, Load and Transform (ELT) job to extract/infer schema from this data. The schema is then stored as per data source and data tag corresponding to these data sources.

Bulk Data Ingest

Data Contextualization also supports the bulk data ingestion from Integrated Data Lake (IDL). This feature is useful to ingest existing files from data lake to Data Contextualization without the need to copy the files in the 'sdi' folder of the datalake.

The bulk data ingestion is supported only for registries where the schema is frozen. So, user must do the following steps before using the bulk ingestion:

  1. Create a data registry with required file extension.
  2. Perform data ingestion to create the initial schema for the data registry.
  3. Set the schemaFrozen flag on the registry to 'True'.

In the bulk ingestion request, user can specify the fileType and the location of the folders in IDL, where the files are stored. Users can specify upto 10 folder paths in IDL. The location of the folders used as input to Data Contextualization should be outside the "sdi" folder in the datalake.

When user invokes the bulk ingest API, Data Contextualization reads the data files from the folder locations to complete the data ingestion. Users can check the status of the bulk ingestion job by using the API.

The bulk data ingestion has the following limits:

  1. Supports only CSV and Parquet formats.
  2. User can specify upto 10 folder locations from data lake as input.
  3. Each input folder must have at least 1 file matching the extension in the data registry.
  4. Total number of files (matching extension in data registry) in all input folders should be no more than 1000.
  5. Total size of all input folders (considering the only files matching extension in data registry) should be no more than 2 GB in size.
  6. Any input file (matching extension in data registry) should be no more than 500 MB in size.

Schema Evolution The schema change is applicable for append strategy only. If schema changes from one data ingest to another then Data Contextualization takes care of consolidating the schema when new properties are found. In case property contains incompatible data types over different ingestion process then, Data Contextualization will update the type to encompassing type. The encompassing type is accommodated up to 500 records for existing data, the schema is considered stable for the existing data after 500 records for incompatible datatype change.

Users can search for a schema of these ingested or linked data to:

  • Build queries based on the physical schema
  • Develop the semantic model by mapping business properties to physical schema attributes
  • Get an initial inferred semantic model for selected schemas
  • Build queries based on a created semantic model

Schema Registry service is primarily used for maintaining the schemas for a tenant. Schemas are stored with the default name format of datasource_dataTag. In case the ingested files are ingested on (Quick Data Contextualization processing) the fly- schema name is the name of the file.

Data Contextualization system extracts and stores the schema once the file is uploaded. The user can search a schema created by the Data Contextualization system using data Tag, schema name and source name. Multiple schemas can be searched using an empty list or provide a filtered list based on the above parameters to search the schema generated by the system.

During the schema extraction process, Data Contextualization will add a column sdiLastModifiedDate that stores the time when the file was ingested. In the corresponding job status, the startedDate in the response will be same as the value in sdiLastModifiedDate. You can use the sdiLastModifiedDate field in queries to get data by the timestamp.

Data Contextualization currently recognizes UTC format dates (for example, 2020-02-15T04:46:13Z) and W3C format dates (for example, 2020-10-15T04:46:13+00:00), if present in the raw data files. They will be identified as timestamp data type in the resulting schema.

Search Schema

This allows the user to search the schema based on data tag, schema name or source name. The search schema array must contain identical elements in the search criteria.

POST Method: /searchSchemas

Data Contextualization works on schema-on-read philosophy. Users do not need to know the schema before data is ingested into Insights Hub Data Contextualization. It has capabilities to infer/extract schema consisting of attributes and data types from ingested data and store it specific to tenant and data tag. The Data Ingest Service API supports XML, Parquet,JSON, and CSV as input data file formats. The file containing XML format can provide the root element that the customer wants to process.

Limitations

  1. The number of data sources that can be registered depends upon the offering plan subscribed by the user tenant.
  2. Data Ingest POST method supports files up to 100 MB in size.
  3. Data Contextualization allows maximum ingest rate of 70 MBPS.
  4. Once the data is ingested successfully, the schema is available for search once a job is finished.
  5. Users can create a maximum of 200 custom data types per tenant and each data type cannot contain more than 10 regular expression patterns.
  6. The search schema request and infer ontology is limited to 20 entry per search.
  7. Data Contextualization supports a maximum of 250 properties per schema.
  8. Semicolon, pipe, etc as delimiter are not accepted in CSVs. Input CSV files to have a comma(,) as a delimiter.
  9. For Input files in JSON format; JSON key names should not have special characters like dot, space, comma, semicolon, curly braces, brackets, new line, tab, etc.
  10. The existing schema evolution support incompatible data type change for up to 500 records.
  11. The source name cannot start with MDSPINT as it is Data Contextualization reserved keyword.
  12. While creating the SDI registry, please do not use "-" (dash) character in the sourceName or the dataTag. This character is a SQL wildcard. Use of this character will result in errors in SQL statements created for such a registry.

Last update: January 9, 2024

Except where otherwise noted, content on this site is licensed under the Development License Agreement.