Skip to content

Latest commit

 

History

History
304 lines (262 loc) · 20.6 KB

File metadata and controls

304 lines (262 loc) · 20.6 KB

ML context

  • ML context, which is the starting point for all the operations executed with ML.NET, such as loading data, creating and evaluating models, and getting detailed information about what is happening with the pipeline, such as errors or other events.
  • ML context can also seed the environment for splitting data and helping others run your code to achieve similar results.
using Microsoft.ML;
var context = new MLContext();
Console.WriteLine("Hello, World!");

Types of operations

  • ML context provides various operations that the ML.NET API can perform, which can be divided into four categories.
  • They are data operations, model operations, data transformations, and algorithms that can be used for training (trainers)
    1. The data operations category includes methods that allow ML.NET to load data from different sources.
    2. The model operations category includes methods that can be executed on the model itself, either on an existing model or a completely new one.
    3. The data transformations category includes methods used to process the data to get it into specific formats that machine learning algorithms might require.
    4. The trainers category is a set of algorithms built into ML.NET that can be used for different machine-learning scenarios.

Data operations

  • The data loading methods are available as part of the Data property of ML context (context)—these are:
    1. LoadFromBinary: Loads data from a binary file.
    2. LoadFromTextFile: Loads data from text files, including CSVs.
    3. LoadFromEnumerable: Loads data like arrays or lists.
    4. CreateDatabaseLoader: Connects to a SQL Server instance to retrieve data.
    5. Filter: As its name suggests, it is used for filtering data.
    6. TestTrainSplit: As its name suggests, it divides source data into a set for training a model and a set for testing and evaluating the trained model.
    7. Shuffle: As its name suggests, it is used to randomize the order in which data is processed during training, which is necessary to prevent training from stalling.

Model operations

  • Execute for the model itself to start making predictions and save a model once if it good.
    1. Load: As its name suggests, it is used for loading an existing model.
    2. Save: As its name suggests, it is used for saving a model.
    3. CreatePredictionEngine: This method will let you make a prediction given a specific model input.

Data transformations

  • Some changes may have to be done to your data during processing—that’s where data transformations come into play.
    1. Categorical: Applicable when working with data that can be categorized.
    2. Conversion: Applicable when converting from one data type to another.
    3. Text: When working with text columns, it is possible to transform those string values into their equivalent numerical values.
    4. ReplaceMissingValues: Replaces missing values within other data.
    5. DropColumns: Removes specific columns that might not be required.
    6. Concatenate: Concatenates multiple columns into one column.

Trainers

  • ML.NET comes with a set of built-in algorithms that are used for training your model based on input data for different scenarios—these algorithms are often referred to as trainers.
  • ML.NET provides algorithms for scenarios such as clustering, regression, anomaly detection, ranking, multiclass or binary classification, etc.

Adding a Model

  • POCML Project -> Right Click Add "Machine Learning Model" option -> select "Machine Learning Model (ML.NET)" and name it "DataClassificationModel.mbconfig" -> click Add.
  • DataClassificationModel.mbconfig file is added, and a new Modeol Builder UI is displayed
  • image
  • image

Scenarios and tasks

  • Before build the model and select the scenario.
  • A scenario is how Model Builder describes the type of prediction that you want to make with your data, which correlates to a ML task. The task is the type of prediction based on the question being asked.
  • Think of the scenario as a wrapper around a task; the task specifies the prediction type and uses various trainers (algorithms) to train the model.
  • image
  • Use the dataset to predict whether a text is a spam message (or not) using binary classification.
  • The binary classification doesn’t appear as an option within the Model Builder UI, understand how the scenarios listed in the Model Builder UI relate to different ML tasks, such as binary classification. The binary classification task corresponds to the data classification scenario
  • The Relationships between ML Tasks and Model Builder Scenarios
  • image
  • Binary classification is used to understand if something is positive or negative, if an email is a spam message or not, or in general, whether a particular item has a specific trait or property (or not).

Training with Model Builder

  • Choose "Data classification" the scenario -> "Select training environment" screen -> Select "Local (CPU)"
    1. image
  • Click Next, Add data to the model -> Import "spam_assassin_tiny.csv" file -> Column to predict (Label) is set to target (the column text indicates whether the text is spam (1) or not (0)).
    1. ML.NET will use this column to predict the value based on what is read from the text column of the dataset
    2. image
    3. Clicking Advanced data options.
      1. Model Builder has identified that the dataset contains two columns, text and target, as seen within Advanced data options, Column settings.
      2. The text column trains the model—it contains the actual email messages.
      3. In contrast, the target column contains the value (1 or 0) that is the value to predict.
      4. image
  • Click "Next Step" -> Click "Start Training", train the model with the dataset.
    1. image
    2. The time required to train the model is, in most cases, directly proportional to the size of the dataset.
    3. Larger dataset, the more computing resources and time are required.
    4. Typically, time is available; however, computing resources are mostly limited to the specs of the environment used.
    5. Instead of using the complete dataset, which includes 5,329 rows, I created a tiny subset with only the first 100 rows (99, given that the first row is a header).
    6. While the training takes place, the different trainers (algorithms) available for the ML task will be used—highlighted.
      1. image
    7. Advanced training options
      1. List of the trainers that are available and used. By default, all the trainers are selected.
      2. image
      3. It is also possible to use fewer trainers, which can be done by unchecking one or more.
      4. ML.NET has good documentation that dives deeper into what these algorithms do and how to choose one by clicking on the When should I use each algorithm option. At this stage, I wanted to show you that it is possible to change (enable or disable) some of the predefined algorithms for training a model in case the evaluation results are not optimal.
      5. image

Evaluating with Model Builder

  • Click Next Step -> the Evaluate option.
    1. From the spam_assassin_tiny.csv file, copy one of the rows (copied 6th row, without the target column value) and paste it into the text field above the Predict button.
    2. After clicking Predict, see how the model predicts based on the text input.
      1. The prediction is correct given that in this case the text input is not spam, but a legitimate message.
      2. Thus, the result is 0 (not spam) with a value of 67% certainty. The percentage is not a true mathematical probability, sometimes called a pseudo-probability.
    3. Feel free to try with other text input from the complete dataset.
      1. Remember that the evaluation process is an opportunity to tweak and improve the model if the results are not as expected.
      2. Trained this model with a tiny dataset, not using the entire dataset, the percentages (confidence) of the results will not be as high (when correct) or low (when incorrect) as they would be if the model had been trained using the complete dataset.
      3. So, evaluate as many times as needed and feel free to retrain the model with a slightly larger dataset to improve the accuracy of the results.
    4. The main reason I chose with a small dataset, was to save time. Try training the model with the large dataset.
    5. image

Consuming the model

  • Click "Next Step", is to consume (use) the model created by Model Builder within our application.
    1. image
    2. Model Builder makes it very easy. There are 2 options: 1) copy the Code snippet or 2) use one of the available project templates, which can be added to VS solution.
    3. I used option 1, copy the code snippet and paste it into the Main method of Program.cs
    4. image
// See https://aka.ms/new-console-template for more information
using Microsoft.ML;
using POC_ML;

var context = new MLContext();
Console.WriteLine("Hello, World!");

//Load sample data
var sampleData = new DataClassificationModel.ModelInput()
{
    Text = @"From gort44@excite.com Mon Jun 24 17:54:21 2002 Return-Path: gort44@excite.com Delivery-Date: ...tit4unow.com/pos******************",
};

//Load model and predict output
var result = DataClassificationModel.Predict(sampleData);
Console.WriteLine($"Predicted: {result.PredictedLabel}");

Model Builder generated 4 files, ValuePredictionModel.consumption.cs, *.evaluate.cs *.training.cs and *.mlnet

  • image
  • Behind the scenes a model was created.
  • DataClassificationModel.consumption.cs
    1. The generated code responsible for creating the prediction engine and invoking it, allowing the consumption of the model.
    2. Microsoft.ML - includes ML.NET core methods, such as the trainers (algorithms)
    3. Microsoft.ML.Data - contains ML.NET methods that interact with the dataset used by the model.
    4. DataClassificationModel class - a partial class, declared in both DataClassificationModel.consumption.cs and DataClassificationModel.training.cs file.
    5. ModelInput class - used as the model’s input.
    6. ModelOutput class is used as the model’s output.
      1. It contains Text, Target, Features, PredictedLabel (the predicted value) and Score (the confidence obtained for the results) properties.
      2. The Target property is an unsigned integer (uint), but the ModelInput class Target property is a string—this is because the Target property for a binary classification has to be an integer.
    7. MLNetModelPath variable - the model's metadata path (file name - DataClassificationModel.mlnet (?.zip)). The file includes the model’s schema, training information, and transformer chain metadata.
    8. PredictEngine variable - will hold the reference to the prediction engine (PredictionEngine) to make predictions on the trained model.
      1. The first parameter a lambda function that creates the engine, () => CreatePredictEngine(), and the second parameter (true) indicates whether the instance can be used by multiple threads (thread-safe).
    9. Predict method - making predictions based on the model
    10. CreatePredictEngine method - creates the prediction engine instance. out var _ parameter represents the modelInputSchema.
// This file was auto-generated by ML.NET Model Builder.
using Microsoft.ML;
using Microsoft.ML.Data;
{
    public partial class DataClassificationModel
    {
        /// <summary>
        /// model input class for DataClassificationModel.
        /// </summary>
        #region model input class
        public class ModelInput
        {
            [LoadColumn(0)]
            [ColumnName(@"text")]
            public string Text { get; set; }

            [LoadColumn(1)]
            [ColumnName(@"target")]
            public string Target { get; set; }

        }

        #endregion

        /// <summary>
        /// model output class for DataClassificationModel.
        /// </summary>
        #region model output class
        public class ModelOutput
        {
            [ColumnName(@"text")]
            public float[] Text { get; set; }

            [ColumnName(@"target")]
            public uint Target { get; set; }

            [ColumnName(@"Features")]
            public float[] Features { get; set; }

            [ColumnName(@"PredictedLabel")]
            public string PredictedLabel { get; set; }

            [ColumnName(@"Score")]
            public float[] Score { get; set; }

        }

        #endregion

        private static string MLNetModelPath = Path.GetFullPath("DataClassificationModel.mlnet");

        public static readonly Lazy<PredictionEngine<ModelInput, ModelOutput>> PredictEngine = new Lazy<PredictionEngine<ModelInput, ModelOutput>>(() => CreatePredictEngine(), true);


        private static PredictionEngine<ModelInput, ModelOutput> CreatePredictEngine()
        {
            var mlContext = new MLContext();
            ITransformer mlModel = mlContext.Model.Load(MLNetModelPath, out var _);
            return mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(mlModel);
        }

        ...
        ...
        ...

        /// <summary>
        /// Use this method to predict on <see cref="ModelInput"/>.
        /// </summary>
        /// <param name="input">model input.</param>
        /// <returns><seealso cref=" ModelOutput"/></returns>
        public static ModelOutput Predict(ModelInput input)
        {
            var predEngine = PredictEngine.Value;
            return predEngine.Predict(input);
        }

    }
}
  • Generated model (DataClassificationModel.training.cs)
    1. It describes how the ML pipeline for the model works and behaves by specifying the various types of transformers and algorithms used and their sequence.
    2. Microsoft.ML.Trainers.FastTree library - contains the algorithm implementation used by the model.
    3. RetrainModel method - retraining the model once the pipeline has been built.
      1. It returns ITransformer - responsible for transforming data within an ML.NET model pipeline.
      2. RetrainPipeline method - building the model’s pipeline, in which the different transforms and algorithm(s) that will be used are specified.
    4. BuildPipeline method - built through a series of transformations that get subsequently added using mlContext.Transforms.
      1. Text.FeaturizeText(inputColumnName:@"text",outputColumnName:@"text")
        1. The method transforms the input column strings (text) into numerical feature vectors (integers) that keep normalized counts of words and character n-grams.
      2. Then, a series of Append methods are chained to FeaturizeText.
      3. For every Append method, a transform operation or trainer is passed as a parameter, creating the ML.NET pipeline.
      4. The first Append, .Append(mlContext.Transforms.Concatenate(@"Features", new []{@"text"}) - concatenate the various input columns into a new output column (Features).
      5. The next Append, .Append(mlContext.Transforms.Conversion.MapValueToKey( outputColumnName:@"target",inputColumnName:@"target")
        1. MapValueToKey method - maps the input column (inputColumnName) to the output columns (outputColumnName) to convert categorical values into keys.
      6. the next Append, this is where the magic happens, .Append(mlContext.MulticlassClassification.Trainers.OneVersusAll( binaryEstimator:mlContext.BinaryClassification.Trainers.FastTree(new FastTreeBinaryTrainer.Options() { NumberOfLeaves=4,MinimumExampleCountPerLeaf=20,NumberOfTrees=4, MaximumBinCountPerFeature=254,FeatureFraction=1,LearningRate=0.1, LabelColumnName=@"target",FeatureColumnName=@"Features" }),labelColumnName: @"target"))
        1. OneVersusAll method - receives a binary estimator (binaryEstimator) algorithm instance as a parameter.
        2. The one-versus-all technique is a general ML algorithm that adapts a binary classification algorithm to handle a multiclass classification problem.
        3. The binary estimator instance - represents the ML binary classification task employed by ML.NET that contains the trainers, utilities, and options used by the FastTree algorithms used for making predictions on the model.
        4. Those options (NumberOfLeaves...) are then passed to the FastTree algorithms, predicting a target using a decision tree for binary classification.
        5. The final part (labelColumnName: @"target") indicates that all predictions done by FastTree will be set on the column with the target label.
// This file was auto-generated by ML.NET Model Builder.
using Microsoft.ML.Data;
using Microsoft.ML.Trainers.FastTree;
using Microsoft.ML.Trainers;
using Microsoft.ML;

namespace POC_ML
{
    public partial class DataClassificationModel
    {
       ...
       ...
       ...


        /// <summary>
        /// Retrains model using the pipeline generated as part of the training process.
        /// </summary>
        /// <param name="mlContext"></param>
        /// <param name="trainData"></param>
        /// <returns></returns>
        public static ITransformer RetrainModel(MLContext mlContext, IDataView trainData)
        {
            var pipeline = BuildPipeline(mlContext);
            var model = pipeline.Fit(trainData);

            return model;
        }


        /// <summary>
        /// build the pipeline that is used from model builder. Use this function to retrain model.
        /// </summary>
        /// <param name="mlContext"></param>
        /// <returns></returns>
        public static IEstimator<ITransformer> BuildPipeline(MLContext mlContext)
        {
            // Data process configuration with pipeline data transformations
            var pipeline = mlContext.Transforms.Text.FeaturizeText(inputColumnName:@"text",outputColumnName:@"text")      
                                    .Append(mlContext.Transforms.Concatenate(@"Features", new []{@"text"}))      
                                    .Append(mlContext.Transforms.Conversion.MapValueToKey(outputColumnName:@"target",inputColumnName:@"target",addKeyValueAnnotationsAsText:false))      
                                    .Append(mlContext.MulticlassClassification.Trainers.OneVersusAll(binaryEstimator:mlContext.BinaryClassification.Trainers.FastTree(new FastTreeBinaryTrainer.Options(){NumberOfLeaves=4,MinimumExampleCountPerLeaf=20,NumberOfTrees=4,MaximumBinCountPerFeature=254,FeatureFraction=1,LearningRate=0.1,LabelColumnName=@"target",FeatureColumnName=@"Features"}),labelColumnName: @"target"))      
                                    .Append(mlContext.Transforms.Conversion.MapKeyToValue(outputColumnName:@"PredictedLabel",inputColumnName:@"PredictedLabel"));

            return pipeline;
        }
    }
 }