Hello world to Amazon SageMaker Feature Store

Xin Cheng
4 min readMay 16, 2021

--

Amazon SageMaker feature store is announced in re:Invent 2020. But what is feature store? Below are articles around feature store.

The main code I use is below (thanks to Chris Fregly and Antje Barth).

and their excellent community

The usual workflow of using feature store is:

  1. Feature engineering: Generate features -> Define feature store group -> ingest features into feature store group
  2. Training: Discover usable features -> Load features -> Train model with features

One of important feature of feature store is reusability (it can be shared with other people, you can time-travel to previous features for model reproducibility).

Currently how SageMaker feature store works is that it creates associated Athena table, AWS Glue catalog and underlying feature data is stored in S3. You can load features back from Athena (yes, you still know the underlying details of underlying store for features, which I hope could be abstracted in the future: one interface for writing and reading.)

Below are the layouts in Athena and Glue:

The above notebook shows you how to create feature store group, create feature definitions and ingest features into feature group. Now suppose another user wants to use the features. He needs to read it from Athena (which is a serverless query service to query data in S3). Below we leverage feature_group variable created in the notebook to find information of the underlying Athena table which stores our features:

feature_query = feature_group.athena_query()
feature_query

feature_query is AthenaQuery(catalog=’AwsDataCatalog’, database=’sagemaker_featurestore’, table_name=’reviews-feature-group-15–18–52–42–1621104762', sagemaker_session=<sagemaker.session.Session object at 0x7ff922069ad0>, _current_query_execution_id=None, _result_bucket=None, _result_file_prefix=None)

The type is sagemaker.feature_store.feature_group.AthenaQuery

To load ingested features when you still have feature_group variable, you can review feature stores and Athena query from Sagemaker studio

query_string="SELECT * FROM \"sagemaker_featurestore\".\"reviews-feature-group-15-16-28-13-1621096093\" LIMIT 1000"
feature_query.run(query_string=query_string, output_location='s3://'+bucket+'/query_results/')
feature_query.wait()
dataset = feature_query.as_dataframe()
dataset.head()

Output

You can also query with PyAthena

import boto3
import sagemaker
sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
from pyathena import connect
import pandas as pd
s3_staging_dir = "s3://{}/athena/query-cache".format(bucket)
database_name = "AwsDataCatalog"
conn = connect(region_name=region, s3_staging_dir=s3_staging_dir)
query_string="SELECT * FROM \"sagemaker_featurestore\".\"reviews-feature-group-15-16-28-13-1621096093\" LIMIT 1000"
df = pd.read_sql(query_string, conn)
df.head()

Feature store is still nascent (as well as the whole production machine learning industry), can it become the centralized feature management center?

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified