Integrated Data Lake Service¶
The Integrated Data Lake (IDL) is a repository that allows you to store structured and unstructured data in its native format until it is needed. It handles large data pools for which the schema and data requirements are not defined until the data is queried. This offers more agility and flexibility than traditional data management systems.
The Integrated Data Lake Service allows you to store your data as is, analyze it using dashboards and visualizations or use it for big data processing, real-time analytics, and machine learning.
For accessing Integrated Data Lake Service you need to have the respective roles listed in Data lake Services roles and scopes.
A user can only interact with objects within their tenant and subtenants.
For accessing the Secure Data Sharing (SDS) protected APIs you need to have appropriate Policy Definitions in place. Please refer here for the list of supported APIs and Required Actions.
Data Upload and Download¶
The Integrated Data Lake Service enables data upload and download using AWS or Azure. The signed URLs for AWS and Shared Access Signatures (SAS) for Azure have an expiration date and time and can only be used by authorized tenant users or services.
|Data upload and download||AWS||Azure|
|Using service||Pre-signed URLs||Shared Access Signature (SAS)|
|Maximum object size limit||5 GB||256 MB|
Time Series Import¶
The Integrated Data Lake Service allows authorized tenant users or services to import time series data into the data lake. This enables on-demand time series upload for analytics and machine learning tools.
The Integrated Data Lake Service assigns each object a unique identifier. Additionally, the object can be assigned a set of extended metadata tags.
Using these services, you can enable (and disable) read and write access to your data for a specific tenant. For example, you can enable analytics tools to directly access your data for analyses without having to download it. This saves storage space and eliminates the need for regular data synchronization.
Alternatively, the Integrated Data Lake Service can generate temporary read and write access to your data, refer the below table:
|Deviation||AWS||Access Permission||Azure||Access Permission|
|Data access services||Cross account access||READ||Service Principal||READ and WRITE|
|Simple Token Service (STS)||READ, WRITE||Pre-signed URL||READ, WRITE|
|Access limits||5 cross account accesses||5 Service Principals|
The Security Token Service or Service Principal can only be used by an authorized tenant user or service.
For AWS only: For example, IDL user has third party application enabled on the AWS account, say tableau server. Now, the user wants to give access to the tableau server to the data that resides in IDL. This can be easily done by enabling the AWS account using cross account access and performing the desired use case. Further, it is also possible to provide read and delete access to the enabled cross accountthrough API or through the IDL manager.
The Integrated Data Lake Service provides a notification functionality, which reports when objects are ingested, updated or deleted using the service. Authorized tenant users or services can subscribe to notifications. Currently only tenant user or services can subscribe to only 15 notifications.
If the permission to send notification to SNS topic is removed, then the tenantAdmin will be notified through an email to check and respond. The email will be sent for a week and thereafter the subscription will be inactivated.
Event Subscription allows you to subscribe to the events to get notifications. You can register with a Simple Notification Service (SNS) destination path on AWS or Service Bus destination path on Azure for notifications which will be published by Integrated Data Lake service. These notifications include object events like create, update or delete in tenant prefix. You can add, view, edit and delete the event subscriptions.
If the permission to send notification to a SNS/service bus topic is removed or if there is any misconfiguration, then the TenantAdmin will be notified through an email. The status of the subscription will be changed to "Inactive" in the user interface. You can change the subscription status to "Active" in the user interface after resolving the configuration.
Subscription name is not mandatory. If the name is not provided, then (subscription_<<1234>>) will be considered as the default name for the subscription.
Integrated Data Lake services exposes its API for realizing the following tasks:
- Import time series data
- Generate signed URLs or to upload, update or download objects
- Delete objects
- Add, update and delete tags for objects
- Receive notifications
- Cross account access for AWS only / Service Principal (for Azure)
- Subtenancy support
- Bulk batch upload of objects
The data lake services exposes UI for below functionalities:
- Cross account access (on AWS) / Service Principal (on Azure) - For enabling native account to read the data from IDL.
- TimeSeries Import functionality - To import time series data into Data Lake
- Data Explorer - For enabling to explore the files/objects.
- Event Subscription - For creating subscription for events on data.
- All requests pass through MindSphere Gateway and must adhere to the MindSphere Gateway Restrictions.
- Maximum supported object size for object upload and download using signed URL is 5 GB.
- Maximum supported object size for object upload and download using Shared Access Signatures (SAS) is 256 MB.
- Signed URLs expire after two hours.
- Shared Access Signatures (SAS) expire after twelve hours.
- Objects are not version controlled.
- The cross account accesses will be emptied and stopped working as expected under the revocation process as per the bucket policy.
- The token will remain active until its expiry before deprovisioning.
- The S3 Signed URL will remain active until its expiry before deprovisioning.
- All the Bulk Import limitation will still be valid for time series import functionality in IDL.
- A maximum of 10 cross account accesses can be created in disabled state for AWS.
- A maximum of 5 cross account accesses can be enabled for any given time for AWS.
- A maximum of 5 service principals can be enabled for any given time for Azure.
- A user can subscribe to only 15 subscriptions.
- The data in UTS might take 48hrs to reflect.
- "Write" access cannot be provided at Time Series Import folder in Service Principal.
- Upload Pre-Signed URL in AWS and Shared Access Signatures (SAS) in Azure for Time Series Import folder cannot be created.
- User path should be pre-fixed with Time Series Import for downloading the time series data files.
- Characters used for values of file name must be in the character set '[a-zA-Z0-9.!*'() _-/=]'. Spaces are not allowed in the beginning or at the end. Also, consecutive spaces are not allowed within the name.
- Objects uploaded by using native URLs will be deleted by using native URLs only. IDL Service URLs do not support the deletion of files which are uploaded using native URLs.
- A maximum of 2 secrets can be generated at any given time for each Service Principal.
- Secret will be active for maximum 90 days, thereafter it will be expire automatically.
- For event notification, user should provide the topic from EU1 region only. Integrated Data Lake will not be able to send the notification to other region topics.
- It will take approximately 5-10 minutes for the data to be available in the search, after uploading in Integrated Data Lake.
To get the current list of limitations go to release notes and choose the latest date. From there go to "MindAccess Developer Plan Subscribers and MindAccess Operator Plan Subscribers" and pick the IoT service you are interested in.
The quality assurance representative of an airline company wants to upload flight data (years 2009-2019) to MindSphere. So, they can run analytics tools and make the data accessible for querying.
They can use the MindSphere Integrated Data Lake Service to upload Excel sheets and enable data access from other accounts. This allows the airline company to integrate analytics tools like AWS Glue or Power BI on Azure and quickly perform queries. For example, they can query for "the most popular airport in last 10 years" or "the airport with most cancelled flights in the past year".
Any questions left?
Except where otherwise noted, content on this site is licensed under the MindSphere Development License Agreement.