Skip to content

Integrated Data Lake Service

Note

Integrated Data Lake Service API version 4.. is available only for Virtual Private Cloud.

Idea

The Integrated Data Lake (IDL) is a repository that allows you to store structured and unstructured data in its native format as long as it is required. It handles large data pools for which the schema and data requirements are not defined until the data is queried. This offers more agility and flexibility than the traditional data management systems.

The Integrated Data Lake Service allows you to store your data as it is, analyze it using dashboards and visualizations or use it for big data processing, real-time analytics, and machine learning.

Access

For accessing Integrated Data Lake Service, you need to have the respective roles listed in Data lake Services roles and scopes.

A user can only interact with objects within their environment and subtenants.

For accessing the Secure Data Sharing (SDS) protected APIs, you need to have the appropriate Policy Definitions in place. For the list of supported APIs and the required actions, refer here.

Basics

Data Upload and Download

The Integrated Data Lake Service enables data upload and download using AWS, Azure or Aliyun. The signed URLs for AWS and Aliyun and Shared Access Signatures (SAS) for Azure have an expiration date and time and can only be used by authorized environment users or services.

Data upload and download AWS Azure Aliyun
Using service Pre-signed URLs Shared Access Signature (SAS) Pre-signed URLs
Maximum object size limit 5 GB 256 MB 5 GB

Time Series Import

The Integrated Data Lake Service allows authorized environment users or services to import time series data into the data lake. This enables on-demand time series upload for analytics and machine learning tools.

Metadata Management

Metadata management is a crucial aspect of data management within any Data Lake application. It involves systematic organization, storage, retrieval, and maintenance of metadata, which is essentially data about data. This includes information about the characteristics, origin, usage, and relationships of the actual data stored in the Data Lake. The Integrated Data Lake Service assigns each object a unique identifier. Additionally, the object can be assigned a set of extended metadata values.

Metadata Collections

Metadata collection is a systematic process of collecting metadata keys that manage all the relevant metadata information associated with a specific dataset or system.

Metadata Keys

Metadata keys are unique identifiers that are used to represent specific attributes or characteristics associated with the data in a key-value pair structure.

Metadata Rules

Metadata rules are a set of pre-defined conditions that determines the metadata tied to a custom collection that is applicable to Integrated Data Lake resources.

Data Access

Using these services, you can enable (and disable) read and write access to your data for a specific environment. For example, you can enable analytics tools to directly access your data for analysis without having to download it. This saves storage space and eliminates the need for regular data synchronization.

Alternatively, the Integrated Data Lake Service can generate temporary read and write access to your data. For more information, refer the below table:

Deviation AWS Access Permission Azure Access Permission Aliyun Access Permission
Data access services Cross account access READ Service Principal READ and WRITE Cross account access READ
Simple Token Service (STS) READ, WRITE Pre-signed URL READ, WRITE Security Token Service (STS) READ, WRITE
Access limits 5 cross account accesses 5 Service Principals 5 cross account accesses

The Security Token Service or Service Principal can only be used by an authorized environment user or service.

Notification

The Integrated Data Lake Service provides a notification functionality, which notifies whenever objects are ingested, updated or deleted using the service. Only the authorized environment users or services can subscribe to notifications. Currently, a maximum of 15 notifications can be subscribed.

Info

If the permission to send notification to Simple Notification Service (SNS) topic is removed, then the tenantAdmin will be notified through an email to check and respond. The email will be sent for a week and thereafter the subscription will be inactivated.

Event Subscription

Event Subscription allows you to subscribe to the events to get notifications. You can register with a Simple Notification Service (SNS) destination path on AWS or Service Bus destination path on Azure for notifications which will be published by Integrated Data Lake service. These notifications include object events like create, update or delete in environment prefix. You can add, view, edit and delete the event subscriptions.

If the permission to send notification to a SNS/service bus topic is removed or if there is any misconfiguration, then the TenantAdmin will be notified through an email. The status of the subscription will be changed to "Inactive" in the user interface. You can change the subscription status to "Active" in the user interface after resolving the configuration.

Info

Subscription name is not mandatory. If the name is not provided, then (subscription_<<1234>>) will be considered as the default name for the subscription.

Features

Integrated Data Lake services exposes its API for realizing the following tasks:

  • Import time series data
  • Generate signed URLs or to upload, update or download objects
  • Delete objects
  • Add, update and delete metadata values for objects
  • Receive notifications
  • Cross account access for AWS and Aliyun/Service Principal (for Azure)
  • Subtenancy support
  • Bulk batch upload of objects

The Integrated Data Lake UI provides the following functionalities:

  • Cross account access (on AWS) and (on Aliyun)/Service Principal (on Azure) - For enabling native account to read the data from IDL.
  • TimeSeries Import functionality - To import time series data into Data Lake
  • Data Explorer - For enabling to explore the files/objects.
  • Event Subscription - For creating subscription for events on data.
  • OData - For creating contracts for objects on data.
  • Metadata Management- For systematic organization, storage, retrieval, and maintenance of metadata, which is essentially data about data.

Limitations

  • All requests pass through Industrial IoT Gateway and must adhere to the Industrial IoT Gateway Restrictions.
  • Maximum supported object size for object upload and download using signed URL is 5 GB.
  • Maximum supported object size for object upload and download using Shared Access Signatures (SAS) is 256 MB.
  • Signed URLs expire after two hours.
  • Shared Access Signatures (SAS) expire after twelve hours.
  • Objects are not version controlled.
  • The cross account accesses will be emptied and stopped working as expected under the revocation process as per the bucket policy.
  • The token will remain active until its expiry before deprovisioning.
  • The S3 Signed URL will remain active until its expiry before deprovisioning.
  • All the Bulk Import limitation will still be valid for time series import functionality in IDL.
  • A maximum of 10 cross account accesses can be created in disabled state for AWS.
  • A maximum of 5 cross account accesses can be enabled for any given time for AWS and for Aliyun.
  • A maximum of 5 service principals can be enabled for any given time for Azure.
  • A user can subscribe to only 15 subscriptions.
  • The data in UTS might take 48hrs to reflect.
  • "Write" access cannot be provided at Time Series Import folder in Service Principal.
  • Upload Pre-Signed URL in AWS and Shared Access Signatures (SAS) in Azure for Time Series Import folder cannot be created.
  • User path should be pre-fixed with Time Series Import for downloading the time series data files.
  • Characters used for values of the file name must be matching with the regex pattern character set '[a-zA-Z0-9.!*'() _-/=]'. Spaces are not allowed in the beginning or at the end. Also, consecutive spaces are not allowed within the name.
  • Objects uploaded by using native URLs will be deleted by using native URLs only. IDL Service URLs do not support the deletion of files which are uploaded using native URLs.
  • A maximum of 2 secrets can be generated at any given time for each Service Principal.
  • Secret will be active for maximum 90 days, thereafter it will be expire automatically.
  • For event notification, user should provide the topic from EU1 region only. Integrated Data Lake will not be able to send the notification to other region topics.
  • It will take approximately 5-10 minutes for the data to be available in the search, after uploading in Integrated Data Lake.
  • OData is supported for EU1 (AWS) region only.

To get the current list of limitations refer to release notes.

Example Scenario

The quality assurance representative of an airline company wants to upload flight data (years 2009-2019) to Industrial IoT. So, they can run analytics tools and make the data accessible for querying.

They can use the Integrated Data Lake Service to upload Excel sheets and enable data access from other accounts. This allows the airline company to integrate analytics tools like AWS Glue or Power BI on Azure and quickly perform queries. For example, they can query for "the most popular airport in last 10 years" or "the airport with most cancelled flights in the past year".


Last update: September 9, 2024

Except where otherwise noted, content on this site is licensed under the Development License Agreement.