Some Notes About Machine Learning Projects
Machine learning is a phrase that is becoming common in lots of areas of life. But it’s a complex and often misunderstood subject. So, here’s some of the basics.
Data scientists study and research for years to qualify. They are the ones who know what they are talking about. But it may be necessary to get up to speed in some way. And for project managers, the terminology alone is enough for a headache.
I’ll cover some of the fundamental things that tip up a data science project.
What is data?
Data means different things to different people. For developers, we often think of it as the stuff you store in your database. Or import from CSV/JSON files.
And that does qualify as data in the general sense.
For machine learning projects, data is a more precise definition. The first thing to keep in mind is:
The input data has to be in a format your computer can understand
That might be stating the obvious but let me extend that a little.
It is helpful to think of data in a table with rows and columns. Each row of data represents a data point. Then, there are properties that make up that data point.
For example, lets say you were storing data about people. Their properties might be height, weight, age and so on.
You will often hear each data point described as an entity. And each entity has a collection of variables. That would be height, weight, age etc. given in this example.
Those variables are sometimes called features - to cover off more terminology.
Why input data is critical
The input data collected for your machine learning model has to be complete.
Take the people entity example above. Then, imagine we were only collecting surname as a variable. There’s no algorithm that would be able to predict gender. The data isn’t there, pure and simple.
That leads to some guiding questions when thinking about a machine learning project…
Stuff to think about
Here’s some good things to keep in mind when planning a project.
1. Is the data right?
Be clear on the question(s) you are going to answer. Because that is the only way you can be sure about your data. I mean sure in the sense that the data can answer the question(s).
2. How do I phrase my question(s)?
It’s one thing to think of a question, it’s another to think of it in a machine learning context. Are you asking a question in a way that the data can answer?
3. Do I have enough data?
I fell foul of this one. Jumping in to build the app with a small sample of migration data is a big fail. You have to have enough data in the system to represent the problem you are trying to solve.
4. Have I got the right variables?
Remember those variables (features) for each entity? Did you get and extract the ones you need to enable the right predictions?
5. What does success look like?
Simple this one. How will you know it’s working?
Imagine you were getting data from movement sensors around the home of an elderly person. You want an algorithm that triggers an alert when unusual movement (or lack of) occurs.
When the sensor detects movement in the kitchen (for example) the algorithm learns from it.
When there is movement in the kitchen at 3am that breaks the normal pattern it raises questions. Why was this entity (person) in their kitchen at that time?
In the case of an elderly person, does it mean they are waking up dehydrated? Or is it that they can’t sleep?
either way, the algorithm detects something out of the ordinary and triggers an alert.
One final thing for now…
I’ll deal with supervised and unsupervised machine learning in upcoming posts. But since data is at the heart of any machine learning project, the information here is a small start.
Data is king, in the same way content is often cited to be on the web.