SageMaker Introduction for Non-Developer

8 min readJul 10, 2022

What is AWS SageMaker?

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality models.

It can build models trained by data dumped into the S3 buckets, or a streaming data source like Kinesis shards. Once models are trained, SageMaker allows us to deploy them into production without any effort.

Build step is done by connecting to other AWS services like S3 and transforming data in Amazon SageMaker notebooks.
Train step is about using AWS SageMaker’s algorithms and frameworks, or bringing our own, for distributed training.
Once training is completed, models can be deployed to Amazon SageMaker endpoints, for real-time or batch predictions.

Some Benefits of Using AWS SageMaker

The SageMaker comes with a lot of built-in optimized ML algorithms which are widely used for training purposes. Now to build a model, we need data. We can either collect and prepare training data by ourselves or we can choose from the Amazon S3 buckets which are the storage service (kind of like harddrives in your system) inside the AWS SageMaker.

Highly Scalable
Fast Training
Maintains Uptime — Process keeps on running without any stoppage.
High Data Security

Lets see how we can make use of this service to build an end-to-end ML project.

Let’s start and I will be explaining stuffs on the go!

Step 1: Open AWS Management Console and Search for Amazon SageMaker.

Step 2: Open Studio and click on Launch SageMaker Studio.

Step 3: Click “+” to create new SageMaker Studio

Step 4: Fill the information

Then you can see the new Studio

Step 5: Launch the Studio by click to Launch App dropdown.

Lets use this SageMaker

Step 1: Launch new Notebook

Click + in upper left corner and choose Python 3 Notebook

Step 2: Initiate boto3 and sagemaker session

We need to instantiante the sagemaker session and client object using current boto3 session. This step is needed if you wanna use the current sagemaker power.

import timeimport boto3
import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # For munging tabular data
import sagemaker# Helper function for classification reports (see util folder):
from util.classification_report import generate_classification_report# Setting up SageMaker parameters
boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "sm101/xgboost-dm"  # Location in the bucket to store our files
sgmk_session = sagemaker.Session()
sgmk_client = boto_session.client("sagemaker")
sgmk_role = sagemaker.get_execution_role()

Step 3: Download and Explore the Data

In this example we will download the sample data from sagemaker tutorial located in S3.

!wget -P data/ -N https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zipimport zipfilewith zipfile.ZipFile("data/bank-additional.zip", "r") as zip_ref:
    print("Unzipping...")
    zip_ref.extractall("data")
print("Done")

Step 4: Transform and do some Feature Engineering

After you explore the data, you can do transformation and other feature engineering method. I won’t give code example since you all already know this step.

Step 5: Load the Built-in SageMaker ML model

We’ll be using SageMaker’s built-in XGBoost Algorithm: Benefiting from performance-optimized, pre-implemented functionality like multi-instance parallelization, and support for multiple input formats.

In general to use the pre-built algorithms, we’ll need to:

Refer to the Common Parameters docs to see the high-level configuration and what features each algorithm has
Refer to the algorithm docs to understand the detail of the data formats and (hyper)-parameters it supports

From these docs, we’ll understand what data format we need to upload to S3 (next), and find the URI for the container image implementing the algorithm… which is listed on the Common Parameters page but can also be extracted through the SageMaker SDK:

# specify container
training_image = sagemaker.image_uris.retrieve("xgboost", region=region, version="1.2-1")print(training_image)

Step 6: Train the Model

Training a model on SageMaker follows the usual steps with other ML libraries (e.g. SciKit-Learn):

Initiate a session (we did this up top).
Instantiate an estimator object for our algorithm (XGBoost).
Define its hyperparameters.
Start the training job.

A small competition!

SageMaker’s XGBoost includes 38 parameters. You can find more information about them here. For simplicity, we choose to experiment only with a few of them.

Then we can train the model by calling Estimator object from SageMaker library. We also can define the instance type to train the model and the hyperparameter.

# Instantiate an XGBoost estimator object
estimator = sagemaker.estimator.Estimator(
    image_uri=training_image,  # XGBoost algorithm container
    instance_type="ml.m5.xlarge",  # type of training instance
    instance_count=1,  # number of instances to be used
    role=sgmk_role,  # IAM role to be used
    max_run=20 * 60,  # Maximum allowed active runtime
    use_spot_instances=True,  # Use spot instances to reduce cost
    max_wait=30 * 60,  # Maximum clock time (including spot delays)
)# define its hyperparameters
estimator.set_hyperparameters(
    num_round=150,  # int: [1,300]
    max_depth=5,  # int: [1,10]
    alpha=2.5,  # float: [0,5]
    eta=0.5,  # float: [0,1]
    objective="binary:logistic",
)# start a training (fitting) job
estimator.fit({"train": s3_input_train, "validation": s3_input_validation})

Step 7: Deploy and Evaluate the Model

Now that we’ve trained the xgboost algorithm on our data, deploying the model (hosting it behind a real-time endpoint) is just one function call!

This deployment might take up to 5–10 minutes, and by default the code will wait for the deployment to complete.

If you like, you can instead:

Un-comment the wait=False parameter (or if you already ran the cell, press the ⏹ "stop" button in the toolbar above)
Use the Endpoints page of the SageMaker Console to check the status of the deployment
Skip over the Evaluation section below (which won’t run until the deployment is complete), and start the Hyperparameter Optimization job — which will take a while to run too, so can be started in parallel.

# Real-time endpoint:
predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
    # wait=False,  # Remember, predictor.predict() won't work until deployment finishes!
    # We will also turn on data capture here, in case you want to experiment with monitoring later:
    data_capture_config=sagemaker.model_monitor.DataCaptureConfig(
        enable_capture=True,
        sampling_percentage=100,
        destination_s3_uri=f"s3://{bucket_name}/{bucket_prefix}/data-capture",
    ),
)

Step 8: Do the inference

After we deploy the model in, the model will become available in other SageMaker instance and we can call them using predictor.predict() method.

X_test_numpy = test_data.drop(["y"], axis=1).valuespredictions = np.array(predictor.predict(X_test_numpy), dtype=float).squeeze()
predictions

Step 9 (Optional): Hyperparameter Optimization

We will use SageMaker HyperParameter Optimization (HPO) to automate the searching process effectively. Specifically, we specify a range, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune.

SageMaker hyperparameter tuning will automatically launch multiple training jobs with different hyperparameter settings, evaluate results of those training jobs based on a predefined “objective metric”, and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will specify the maximum number of HPO tries (max_jobs) and how many of these can happen in parallel (max_parallel_jobs).

Tip: max_parallel_jobs creates a trade-off between performance and speed (better hyperparameter values vs how long it takes to find these values). If max_parallel_jobs is large, then HPO is faster, but the discovered values may not be optimal. Smaller max_parallel_jobs will increase the chance of finding optimal values, but HPO will take more time to finish.

Next we’ll specify the objective metric that we’d like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: validation:auc and train:auc.

Area Under the ROC Curve (AUC) measures the ability of a binary ML model to predict a higher score for positive examples as compared to negative examples. See Machine Learning Key Concepts

We elected to monitor validation:auc as you can see below. In this case (because it's pre-built for us), we only need to specify the metric name.

For more information on the documentation of the Sagemaker HPO please refer here.

# import required HPO objects
from sagemaker.tuner import (
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
    IntegerParameter,
)# set up hyperparameter ranges
ranges = {
    "num_round": IntegerParameter(1, 300),
    "max_depth": IntegerParameter(1, 10),
    "alpha": ContinuousParameter(0, 5),
    "eta": ContinuousParameter(0, 1),
}# set up the objective metric
objective = "validation:auc"# instantiate a HPO object
tuner = HyperparameterTuner(
    estimator=estimator,  # the SageMaker estimator object
    hyperparameter_ranges=ranges,  # the range of hyperparameters
    max_jobs=10,  # total number of HPO jobs
    max_parallel_jobs=2,  # how many HPO jobs can run in parallel
    strategy="Bayesian",  # the internal optimization strategy of HPO
    objective_metric_name=objective,  # the objective metric to be used for HPO
    objective_type="Maximize",  # maximize or minimize the objective metric
)

Launch HPO

Now we can launch a hyperparameter tuning job by calling the fit() function. After the tuning job is created, we track its progress in the Training > Hyperparameter tuning jobs page of the SageMaker console.

# start HPO
tuner.fit({"train": s3_input_train, "validation": s3_input_validation})

HPO jobs often take quite a long time to finish and as such, sometimes you may want to free up the notebook and then resume the wait later.

Just like the Estimator, we won’t be able to deploy() the model until the HPO tuning job is complete; and the status is visible through both the AWS Console and the SageMaker API. We could for example write a polling script like the below:

Deploy HPO Model

# deploy the best model from HPO
hpo_predictor = tuner.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
    serializer=sagemaker.serializers.CSVSerializer(),
    deserializer=sagemaker.deserializers.CSVDeserializer(),
)

Step 9: Releasing Cloud Resources

It’s generally a good practice to deactivate all endpoints which are not in use.

Please uncomment the following lines and run the cell in order to deactivate the 2 endpoints that were created before.

predictor.delete_endpoint(delete_endpoint_config=True)
hpo_predictor.delete_endpoint(delete_endpoint_config=True)

Best Practice

Using SageMaker for ML Projects actually is not cheap, you need to do some best practice in order to keep low cost and no unnecessary instance usage.

Finale

SageMaker built-in algorithms are great for getting a first model fast, and this SageMaker service can be used to make your ML Ops easier

As we mentioned here, the best way to success with a built-in algorithm is to read the algorithm’s doc pages carefully — to understand what data format and parameters it needs!