Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
Earlier this year (May 2018) Microsoft announced ML.NET, an open source and cross-platform machine learning framework built for .NET developers. It is exciting news to be able to integrate custom machine learning with .NET/C# applications. Although ML.NET is still in preview release version 0.5.0 at the time of writing, you can test drive it to explore the potential power of the framework.
ml.net open source and cross-platform machine learning framework
There are already a number of tutorials for ML.NET available from Microsoft and third parties. However, the example data sources are mostly flat files in the format of TSV (Tab Separated Values). This post is written for the plethora of datasets available in JSON format, unstructured datasets from web events, or perhaps datasets that are already stored in MongoDB.
This post is going to focus on how to develop ML.NET classification sentiment analysis using data stored in MongoDB. This post is based on Microsoftâs Tutorial: Use ML.NET in a sentiment analysis binary classification with notable differences:
- The training dataset is in JSONÂ format.
- It reads from MongoDB as its data source instead of a file.
- It uses .NET Core (Ubuntu/Linux).
The full code example and data can be found on github.com/sindbach/mlnet_mongodb. I would recommend reviewing Microsoftâs tutorial for more information.
The Data
A good machine learning journey always starts with a good dataset. The dataset used is from Yelp Dataset Challenge. The data is provided by Yelp as part of their dataset challenge, which ends 31st December 2018. The data is ~2.9GB in size and, most importantly, in JSONÂ format.
Part of the dataset that is of interest is in the yelp_academic_dataset_review.json file. The sentiment analysis model will be trained based on the Yelp reviews to predict whether a review has a positive or negative sentiment.
The following is an example JSON structure from the file:
{ "business_id": "iCQpiavjjPzJ5_3gPD5Ebg", "cool": 0, "date": "2011-02-25", "funny": 0, "review_id": "x7mDIiDB3jEiPGPHOmDzyw", "stars": 2, "text": "The pizza was okay. Not the best I've had. I prefer Biaggio's on Flamingo / Fort Apache. The chef there can make a MUCH better NY style pizza. The pizzeria @ Cosmo was over priced for the quality and lack of personality in the food. Biaggio's is a much better pick if youre going for italian - family owned, home made recipes, people that actually CARE if you like their food. You dont get that at a pizzeria in a casino. I dont care what you say...", "useful": 0, "user_id": "msQe1u7Z_XuqjGoqhB0J5g"}
There are two important fields from the structure: text and stars. The text field contains a userâs review comment, and the stars field contains an indication whether the review is positive or not.
The Database
Time to load the review data into a database. For this post, the data will be loaded into MongoDB Atlas, a cloud hosted database-as-a-service for MongoDB. You can follow MongoDBâs tutorial to create an Atlas FREE tier if you would like to test the data loading as well.
The data can be loaded to MongoDB Atlas using mongoimport. For example, the following command will import a file called yelp_academic_dataset_review.json into the review collection in the yelp database:
mongoimport --uri "mongodb+srv://user:pwd@dataset-demo.mongodb.net/yelp" --collection review ./yelp_academic_dataset_review.json
Once the import has completed, use either the mongo shell or MongoDB Compass to check the data.
MongoDB Compass Document View
Thereâs one more preparation that needs to be performed before jumping into the code. Since weâre trying to create a binary classification, we need a binary value to determine whether a review is positive / 1 or negative / 0. Fortunately every document contains a star rating, a range of 1 to 5 where a value of 1 indicates a negative review and a value of 5 is a positive review.
The MongoDB Aggregation Pipeline can be used to add a new field called sentiment to the dataset where the value is based on the stars rating. The sentiment value will be determined with the following logic: any review with a stars value greater than 3 is positive, and any value equal or less than 3 is negative.
For example, use the $addFields stage to add the new field and $out stage to store the output into a separate collection:
db.review.aggregate([ {â$addFieldsâ: {âsentimentâ: {â$condâ: {âifâ:{â$gtâ:["$stars", 3]}, âthenâ: 1, âelseâ: 0 }}}}, {"$out":"review_train"}]);
MongoDB Compass Aggregation Pipeline Builder
Note: You can also find a small portion of the JSON data on github.com/sindbach/mlnet_mongodb: data. The training data consists of 5000 positive reviews and 5000 negative reviews.
The Code
This post will be using .NET Core, a free and open-source managed framework for Windows, macOS and Linux. The only two dependencies for the project are :
- MongoDB .NET/C# driver version 2.7.0
- ML.NET version 0.5.0
The SentimentData class is modified as follows to serialize and/or deserialize the review document structure from MongoDB:
[BsonIgnoreExtraElements]public class SentimentData{ [BsonId] [BsonRepresentation(BsonType.ObjectId)] public string Id {get; set;} [BsonElement("sentiment")] public float Label { get; set; } public string text { get; set; }}
BsonIgnoreExtraElements ignores all fields in the document except for id, sentiment (mapped to Label), and text. These are the fields we will use for training. Next, we instantiate a MongoClient object to connect to MongoDB using a connection string URI:
static string mongoURI = "mongodb+srv://usr:pwd@dataset-demo.mongodb.net";static readonly MongoClient client = new MongoClient(mongoURI);
Using the MongoClient object, we can access the data in the yelp database and review_train collection:
var db = client.GetDatabase("yelp");var collection = db.GetCollection<SentimentData>("review_train");
The ML.NET LearningPipeline requires an enumerable object which we can easily get by invoking Find() on collection:
var documents = collection.Find<SentimentData>(new BsonDocument()).ToEnumerable();pipeline.Add(CollectionDataSource.Create(documents));
To test the sentiment analysis model, weâll fetch four current reviews displayed on Yelp for restaurants in Sydney Australia:
- âVery bad service and low quality of coffee too. Waiting for so long even tried to rush them already.â
- âThis place is amazing!! I had the classic cheese burger with fries. Hands down the best burger I have ever hadâ.
- âIf I could give zero stars I would. Terribly overpriced. Dried over cooked barramundi with no seasoning or flavor at allâ.
- âSmall menu but the food is quite good. Itâs fast and easy, one of the better options around the area. We had the seafood laksa and seafood Pad Kee Maoâ.
The prediction results are:
Sentiment Predictions
---------------------
Sentiment: Very bad service and low quality of coffee too. Waiting for so long even tried to rush them already. | Prediction: Negative
Sentiment: This place is amazing!! I had the classic cheese burger with fries. Hands down the best burger I have ever had | Prediction: Positive
Sentiment: If I could give zero stars I would. Terribly overpriced. Dried over cooked barramundi with no seasoning or flavor at all | Prediction: Negative
Sentiment: Small menu but the food is quite good. It's fast and easy, one of the better options around the area. We had the seafood laksa and seafood Pad Kee Mao | Prediction: Positive
Note: You can find the full code example on github.com/sindbach/mlnet_mongodb: sentiment.
Loading and reading data from MongoDB as a ML.NET data source is quite trivial. The potential of utilising ML.NET to integrate machine learning with datasets stored in MongoDB is exciting, and Iâm looking forward to future releases of ML.NET.
ML.NET Sentiment Analysis with MongoDB was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.