Data Lakehouse is a new data architecture paradigm that combines the benefits of data warehouse (transaction support, schema enforcement, BI support) and data lake (storage decoupled from compute, diverse data types, diverse workloads, streaming), and provides additional benefits (direct query on raw data without ETL/ELT). Here are good articles talking about data lakehouse.

Azure and AWS both talk about data lakehouse now. Azure has Azure data lake, Azure Databricks and former Azure SQL DW, but Microsoft tries to provide more integrated experience with Azure Synapse Analytics.

In this post, I am going to experience data lakehouse with Azure Synapse Analytics…


Databricks is a very popular data platform. Hashicorp Terraform is a popular cloud infrastructure provision tool. I like to try out new things in quick and easy way. Instead of manual provisioning which is tedious and error-prone, it is better to have 1-click that provisions all necessary resources. That’s what integration between Databricks and Terraform are promising.

https://databricks.com/blog/2020/09/11/announcing-databricks-labs-terraform-integration-on-aws-and-azure.html

Currently we need to take 2 steps to provision Azure Databricks:

  1. Provision Azure Databricks
  2. Provision Databricks resources (e.g. cluster, job, notebook, etc.)

Create Azure Databricks workspace

Use Azure cloud shell, which already has terraform installed. Use the following terraform template:

az login
terraform init
terraform plan
terraform apply…


In Python world, multiprocessing library has been widely used to parallelize computation in single machine. However, distributed multi-node, multi-process applications are still hard to write. This article talks about how you can easily turn multiprocessing Python application into distributed containerized application with Ray and Kubernetes. This article is inspired by the following articles.

If you are familiar with Python native multiprocessing.Pool to do parallel processing, you will find migrating to ray.util.multiprocessing.Pool quite easy. If you are not familiar, here is a quick pattern.

from multiprocessing import Pool
import time
# tasks to be parallelized
work = (["A", 5], ["B", 2], ["C"…

Amazon SageMaker feature store is announced in re:Invent 2020. But what is feature store? Below are articles around feature store.

The main code I use is below (thanks to Chris Fregly and Antje Barth).

and their excellent community

The usual workflow of using feature store is:

  1. Feature engineering: Generate features -> Define feature store group -> ingest features into feature store group
  2. Training: Discover usable features -> Load features -> Train model with features

One of important feature of feature store is reusability (it can be shared with other people, you can time-travel to previous features for model reproducibility)…


Overview

EMR Studio provides fully managed Jupyter notebooks and integrated with EMR. Previously there is EMR notebooks. However, here is the difference:

Integration with AWS SSO (Single Sign-On) is interesting. User does not need to AWS Management Console for EMR Studio, and EMR studio appears like an enterprise application and user can access from a user portal.

Provision

Enable AWS SSO

You must enable AWS SSO. Otherwise you cannot create EMR Studio.

https://console.aws.amazon.com/singlesignon

You need to configure AWS SSO (e.g. users, groups).

Create EMR Studio

AWS provides sample script to easily provision EMR studio.

Make sure you have latest AWS CLI (version 1 or 2)

git clone https://github.com/aws-samples/emr-studio-samples.git
cd…

Graph fundamentals

Major advantages of using graph database

Where graph database and graph modeling really shines is when your query involves not well-defined connections, unknown number of steps of traversal. For example, in article below, if you want to find all investors (firms or individuals) who directly or indirectly invested in a given company within 3-hops, it would be hard to write a concise relational query, and increasing hops would make the already-ugly query uglier.

Another example is entity resolution, for example, if you want to check if two users are the same user (each user may have name, phone number, email address, address, age, or more fields)…


Overview

Why: Encryption (hiding data sent on the wire) and Identification (ensure the computer you are speaking to is what it claims to be).

Introduction

In TLS (an updated replacement for SSL), a server is required to present a certificate as part of the initial connection setup. A client connecting to that server will perform the certification path validation algorithm:

  1. The subject of the certificate matches the hostname (i.e. domain name) to which the client is trying to connect;
  2. The certificate is signed by a trusted certificate authority.

A TLS server may be configured with a self-signed certificate. When…


Recently encountered a very interesting problem with Azure file share. Would like to document the findings here.

Conclusion

Don’t migrate to different technologies without thorough testing. For example, if your application is NFS-based, don’t change it to CIFS/SMB-based without thorough testing, because features, performance may be very different.

Background

We are migrating a legacy client application into Azure. The application is very old and is linux and NFS-based (all OS, python dependencies, code are stored in NFS server). There is no clear-cut dependency for each component. Therefore, we end up with 60GB binaries in unzipped formats. Also, the application uses lots of…


Using GPU on managed machine learning platform (e.g. Azure machine learning, Amazon Sagemaker, Google Cloud AI Platform) is easy, as lots of details are abstracted away from. For example, when you use their machine learning SDK, your machine learning code is usually packaged into a docker container with the supporting machine learning framework and executed on the platform, and usually you don’t need to build the container yourself, since the platform does this all for you. …


Everyday there is new technology coming up and we don’t have much time to digest. Since you don’t have time, if you scan through the description, often you are led to believe the technology can do something that it does not do. When you design solutions, you may choose wrong technology, which is frustrating. Especially big cloud vendors have lot of technologies that have overlapping functionalities on surface. How can we describe new technologies accurately in short time? This article provides a quick recap of some new technologies. Currently the focus is on cloud technologies.

Let’s take an example, Azure…

Xin Cheng

Multi-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store