Thursday, July 19, 2018

Using Google Cloud ML Engine to train a regression model

Submit a training model job in Google Cloud Datalab


Gabriel Santiago on Unsplash
Google Cloud Platform is an useful tool to run Machine learning code and processes. The benefits are many — setup and storage on the cloud, a ML toolbox, cloud VM resources, other well known cloud benefits, and easy setup. This article is specifically about submitting a job (task) for a training model created in a previous Jupyter notebook. The model is built using TensorFlowand the Google Cloud Datalab Machine Learning Toolbox, which contains out-of-the-box models. In this sample, a regression model is used. The specific type of regression model chosen for this sample is implemented as a deep neural network.
The code, and some of the explanation excerpts are from the US census regression model in the Google Cloud Platform Datalab sample docs. Based on several inputs, the model is trained to predict wages. The notebook uses Google Cloud Machine Learning Engine to submit training jobs to train the model, and will, in a soon to-be-posted article, deploy the resulting model for predictions.
When a job is submitted to ML Engine, this is what happens:
  • The code for the job is staged in Google Cloud Storage, and a job definition is submitted to the service.
  • The service queues the job, and thereafter the job can be monitored in the console (status and logs), as well as using TensorBoard.
  • The service also:
  • provisions computation resources based on the choice of scale tier
  • installs your code package and its dependencies
  • starts the [training] process. Thereafter, the service monitors the job for completion, and retries if necessary. The requester can monitor on Tensorboard as well.

Before you begin: ensure that you have Google Cloud Platform activated — the setup is easy and fast — and initially free . There are several good articles that detail the setup —here’s one (thanks to Amulya Aankul): https://towardsdatascience.com/running-jupyter-notebook-in-google-cloud-platform-in-15-min-61e16da34d52. To summarize:
  • Setup Virtual Machine (VM) and start it
  • Use Cloud shell or SSH to connect
  • Enable Cloud Machine Learning Engine API
Open a notebook using SSH or through the Cloud Shell.

Setup

workspace and Google Cloud Storage that will hold inputs and outputs
import google.datalab as datalab                 #work in datalab
import google.datalab.ml as ml                   #ml engine
import mltoolbox.regression.dnn as regression    #regression model
import os                                        #OS path-file join
import time                                      #time stamp file
#Setup workspace and Google Cloud Storage to hold inputs and outputs
​storage_bucket = 'gs://' + datalab.Context.default().project_id + '-datalab-workspace'
storage_region = 'us-central1'
workspace_path = os.path.join(storage_bucket, 'census')
#Set the training, evaluation (testing) and schema path+files.
train_data_path = os.path.join(workspace_path, 'data/train.csv')
eval_data_path = os.path.join(workspace_path, 'data/eval.csv')
schema_path = os.path.join(workspace_path, 'data/schema.json')
#ml datasets
train_data = ml.CsvDataSet(file_pattern=train_data_path, schema_file=schema_path)
eval_data = ml.CsvDataSet(file_pattern=eval_data_path, schema_file=schema_path)
analysis_path = os.path.join(workspace_path, 'analysis')

Training

Previously analyzed training data to produce statistics and vocabularies — these will be used during training (‘analysis’ in last code line above).
#configure job to Cloud Machine Learning Engine for submission
# - unique job name within project
# - select a region, usually same region as training data 
# - select a scale tier, BASIC - simple single node cluster
config = ml.CloudTrainingConfig(region=storage_region, scale_tier='BASIC')
training_job_name = 'census_regression_' + str(int(time.time()))
training_path = os.path.join(workspace_path, 'training')
Additionally there is a special key column — this is any column in the data that can be used to uniquely identify instances. The value of this column is ignored during training, but this value is quite useful when using the resulting model during prediction. In this case, it is the serial number.
The code below transforms the data to make it easier for the training model. WAGP (wages) is the target. The rest of the fields are inputs, transformed through embedding, or one-hot [encoding]. A detailed explanation of these methods is beyond the scope of this article, but suffice it to say that these processes map the column values into different items of a set that makes the ML process possible and more efficient.
features = {
  "WAGP": {"transform": "target"},
  "SERIALNO": {"transform": "key"},
  "AGEP": {"transform": "embedding", "embedding_dim": 2},  # Age
  "COW": {"transform": "one_hot"},                         # Class of worker
  "ESP": {"transform": "embedding", "embedding_dim": 2},   # Employment status of parents
  "ESR": {"transform": "one_hot"},                         # Employment status
  "FOD1P": {"transform": "embedding", "embedding_dim": 3}, # Field of degree
  "HINS4": {"transform": "one_hot"},                       # Medicaid
  "INDP": {"transform": "embedding", "embedding_dim": 5},  # Industry
  "JWMNP": {"transform": "embedding", "embedding_dim": 2}, # Travel time to work
  "JWTR": {"transform": "one_hot"},                        # Transportation
  "MAR": {"transform": "one_hot"},                         # Marital status
  "POWPUMA": {"transform": "one_hot"},                     # Place of work
  "PUMA": {"transform": "one_hot"},                        # Area code
  "RAC1P": {"transform": "one_hot"},                       # Race
  "SCHL": {"transform": "one_hot"},                        # School
  "SCIENGRLP": {"transform": "one_hot"},                   # Science
  "SEX": {"transform": "one_hot"},
  "WKW": {"transform": "one_hot"}                          # Weeks worked
}

Submit the Model

#submit the job - may take several minutes
job = regression.train_async(train_dataset=train_data, eval_dataset=eval_data,
             features=features,             #set defined above
             analysis_dir=analysis_path,    #analysis folder
             output_dir=training_path,      #output folder
             max_steps=2000,                #max # of iterations
             layer_sizes=[5, 5, 5],         #layers for tr. & size
             job_name=training_job_name,    #job name
             cloud=config)                  #config-region, tier
Once you run the command, it should show something like this:
Building package and uploading to gs://your-project-name-datalab-workspace/census/training/staging/trainer.tar.gz
Job request send. View status of job at
https://console.developers.google.com/ml/jobs?project=your-project-name
You can run Tensorboard to see job status in graphical format with this:
Note: user types in lines in bold :)
tensorboard_pid = ml.TensorBoard.start(training_path)
Output:
TensorBoard was started successfully with pid 4081. Click here to access it.TensorBoard was started successfully with pid 4081. Click here to access it.

ml.TensorBoard.stop(tensorboard_pid)
job.wait()
Output:
Job census_regression_1530833694 completed
!gsutil ls -r {training_path}/model       #list folders, contents
Output:
gs://your-project-name-datalab-workspace/census/training/model/:
gs://your-project-name-datalab-workspace/census/training/model/
gs://your-project-name-workspace/census/training/model/saved_model.pb
gs://your-project-name-workspace/census/training/model/assets.extra/:
gs://your-project-name-workspace/census/training/model/assets.extra/
gs://your-project-name-workspace/census/training/model/assets.extra/features.json
gs://your-project-name-workspace/census/training/model/assets.extra/schema.json
gs://your-project-name-datalab-workspace/census/training/model/variables/:
gs://your-project-name-datalab-workspace/census/training/model/variables/
gs://your-project-name-datalab-workspace/census/training/model/variables/variables.data-00000-of-00001
gs://your-project-name-workspace/census/training/model/variables/variables.index

The Trained Model

Once training is completed, the resulting trained model is saved and placed into Cloud Storage.
!gsutil ls -r {training_path}/model  #list the folders, contents
Output:
gs://cloud-ml-users-datalab-workspace/census/training/model/:
gs://cloud-ml-users-datalab-workspace/census/training/model/
gs://cloud-ml-users-datalab-workspace/census/training/model/saved_model.pb

gs://cloud-ml-users-datalab-workspace/census/training/model/assets.extra/:
gs://cloud-ml-users-datalab-workspace/census/training/model/assets.extra/
gs://cloud-ml-users-datalab-workspace/census/training/model/assets.extra/features.json
gs://cloud-ml-users-datalab-workspace/census/training/model/assets.extra/schema.json

gs://cloud-ml-users-datalab-workspace/census/training/model/variables/:
gs://cloud-ml-users-datalab-workspace/census/training/model/variables/
gs://cloud-ml-users-datalab-workspace/census/training/model/variables/variables.data-00000-of-00001
gs://cloud-ml-users-datalab-workspace/census/training/model/variables/variables.index
There! you have successfully submitted a training model in Google Cloud Platform. 
This article was posted in Medium.com as well.