My Friends,
A little off topic today. Sorry, but, for some, this may still be useful.
Artificial Intelligence, in particular NLP, Natural Language Processing, has a subcategory called Named Entity Recognition. This is a very useful tool, and it has many implementations, on many different platforms.
ML.NET 3.0 has implemented a trainer for NER, but the code is incomplete, and many have had a lot of trouble implementing it. I had a bit of a play with this and got it working. There is a good GitHub Issue Thread Here, that gives a bit of an idea on how to progress.
To make this work, you need to install the following packages:
<?xml version="1.0" encoding="utf-8"?>
<packages>
<package id="libtorch-cpu-win-x64" version="1.13.0.1" targetFramework="net461" />
<package id="Microsoft.ML" version="3.0.0-preview.23511.1" targetFramework="net461" />
<package id="TorchSharp" version="0.99.5" targetFramework="net461" />
...
</packages>
We need some helper classes to do some work on the data.
private class InputTrainingData
{
public string Sentence;
public string[] Label;
}
We need a Label class:
public class Label
{
// The Key: Person, Org...
public string Key { get; set; }
}
We need two classes to infer a sentence:
private class Input
{
public string Sentence;
public string[] Label;
}
private class Output
{
public string[] Predictions;
}
Here is the working class itself:
#region Using Statements:
using System;
using System.Collections.Generic;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.TorchSharp;
#endregion
public class Program
{
// Main method
public static void Main(string[] args)
{
try
{
var context = new MLContext()
{
FallbackToCpu = true,
GpuDeviceId = 0
};
var labels = context.Data.LoadFromEnumerable(
new[] {
// SpaCy Supported Types:
// See: https://www.kaggle.com/code/curiousprogrammer/entity-extraction-and-classification-using-spacy/notebook
new Label { Key = "PERSON" }, // People, including fictional.
new Label { Key = "NORP" }, // Nationalities or religious or political groups.
new Label { Key = "FAC" }, // Buildings, airports, highways, bridges, etc.
new Label { Key = "ORG" }, // Companies, agencies, institutions, etc.
new Label { Key = "GPE" }, // Countries, cities, states.
new Label { Key = "LOC" }, // Non-GPE locations, mountain ranges, bodies of water.
new Label { Key = "PRODUCT" }, // Objects, vehicles, foods, etc. (Not services.)
new Label { Key = "EVENT" }, // Named hurricanes, battles, wars, sports events, etc.
new Label { Key = "WORK_OF_ART" }, // Titles of books, songs, etc.
new Label { Key = "LAW" }, // Named documents made into laws.
new Label { Key = "LANGUAGE" }, // Any named language.
new Label { Key = "DATE" }, // Absolute or relative dates or periods.
new Label { Key = "TIME" }, // Times smaller than a day.
new Label { Key = "PERCENT" }, // Percentage, including "%".
new Label { Key = "MONEY" }, // Monetary values, including unit.
new Label { Key = "QUANTITY" }, // Measurements, as of weight or distance.
new Label { Key = "ORDINAL" }, // "first", "second", etc.
new Label { Key = "CARDINAL" }, // Numerals that do not fall under another type.
// Added Types by Me:
new Label { Key = "OBJECT" }, // An Object, Entity might be a Spoon, or a Soccer Ball. Needs Sub Categories.
});
var dataView = context.Data.LoadFromEnumerable(
new List<InputTrainingData>(new InputTrainingData[] {
new InputTrainingData()
{
// Testing longer than 512 words.
Sentence = "Alice and Bob live in the USA",
Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
},
new InputTrainingData()
{
Sentence = "Alice and Bob live in the USA",
Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
},
}));
var chain = new EstimatorChain<ITransformer>();
var estimator = chain.Append(context.Transforms.Conversion.MapValueToKey("Label", keyData: labels))
.Append(context.MulticlassClassification.Trainers.NameEntityRecognition(outputColumnName: "Predictions"))
.Append(context.Transforms.Conversion.MapKeyToValue("Predictions"));
var transformer = estimator.Fit(dataView);
var transformerSchema = transformer.GetOutputSchema(dataView.Schema);
string sentence = "Alice and Bob live in the USA";
var Encoded = Tokenizer.Tokenize(sentence);
// var trainedModel = context.Model.Load(GetOutputFilePath(), out DataViewSchema _);
var engine = context.Model.CreatePredictionEngine<Input, Output>(transformer);
Output predictions = engine.Predict(new Input { Sentence = sentence });
transformer.Dispose();
Console.WriteLine("Success!");
Console.ReadLine();
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex.Message}");
Console.ReadLine();
}
}
}
We need to instantiate the Tokenizer class:
#region Using Statements:
using Microsoft.ML.Tokenizers;
#endregion
public class Tokenizer
{
private static Microsoft.ML.Tokenizers.Tokenizer _instance;
private static EnglishRoberta Roberta = new EnglishRoberta("Data/encoder.json", "Data/vocab.bpe", "Data/dict.txt");
/// <summary>
/// .
/// </summary>
public static TokenizerResult Tokenize(string input)
{
Roberta.AddMaskSymbol();
_instance = new Microsoft.ML.Tokenizers.Tokenizer(Roberta, new RobertaPreTokenizer());
return _instance.Encode(input);
}
}
The files: "encoder.json", "vocab.bpe", "dict.txt", you can download via the links provided, and save them in a Data folder. Don't forget to copy to output directory.
The prediction is fairly accurate, with only two training examples, here is the prediction I got:
We should be getting:
new InputTrainingData()
{
Sentence = "Alice and Bob live in the USA",
Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
},
At position [6] we should be getting: "COUNTRY". With some more training examples, this will improve drastically!
The EnglishRoberta class, encodes, or tokenizes words like so:
NER is a very useful tool, used in many areas in IT and Data Aquisition! It is useful for automatically extracting information from large texts!
Best Wishes,
Chris