Data scientist – sounds cool right? People seem to be fascinated by my role and I constantly get asked about my day-to-day. Truth be told, no two data scientist roles are alike. Want to know more about a typical day of a data scientist at Genetec?
A good place to start is to ask: “What is a data scientist?”
A data scientist is a person who has expertise in the following areas:
- Mathematics, probability, and statistics
- Computer science
- Business/industry knowledge
Now the question is, how do these all relate to the creation of a “data scientist” role? Data scientists need to have expertise in computer science to extract, transform, and load (ETL) data. They also need to be experts in mathematics, probability, and statistics - so when they look at data, they can recognize which probability distribution or algorithm will possibly fit best and to be able to assess the quality of the fitting algorithm. Lastly, nothing can replace the knowledge of one’s domain. Being an expert in security systems, for example, is the only way to build data science-based systems for security.
What does a data scientist do at Genetec?
In my role, I mainly deal with two types of problems - structured vs unstructured.
Structured problems are problems where we know the output of the data. For example, automatic license plate recognition (ALPR) is a structured problem because when we train our algorithms we have a data set of raw ALPR images and the classified output (this is an image of license plate XYZ123 and the plate in at this location in the image). These are the most common problems in machine learning.
Unstructured problems are often related to “automatic” learning problems. When someone says: “I want the algorithm to teach itself” that’s usually a clue that you’re looking at an unstructured problem because we don’t know the ground truth. A big focus in unstructured problems is anomaly detection, identifying when something is outside the norm for the data, without knowing what the norm is in advance.
Structured problem example
Imagine our access control team is working on integrating a card reader. They’ve been noticing that this device occasionally spits out null (all 0’s) or corrupted credential IDs and they’d like to know why. So, what I would ask them for is a sample data set of normal activity with the following parameters:
- Firmware version
- Card type
- Time of day, etc.
Additionally, I’d ask them for another data set of the same parameters for when the reader gives “bad” data. I would then say that the first data set has an output of 1 [success] and the bad data set has an output of 0 [fail] and then combine the data sets together. A big part of data science is extracting and transforming the data then loading it into a new system.
Now I have a new data set with an extra parameter, this is called the response, which is  or . The parameters of the event which we collected are called covariates – this is the data we will use to see what causes a “bad” read. This would look something like:
Let's dig in a little deeper...
Since I’m dealing with a binary response variable, I’ll likely first try and model the problem as a Binomial distribution with a link function that links all the covariates together. The Binomial distribution is a probability distribution for binary [0/1] data. It says that a success occurs with probability p and a failure occurs with probability 1-p. Then I’ll model p as a generalized linear model (GLM) using the link function:
The things we don’t know here are the set of β variables, which we will learn from the sample data sets we received from the access control team. It is fit by a method called iteratively reweighted least squares which as a really neat side-effect tells us which β variables are important predictors for getting an output of 1 versus 0. Essentially it tells us which of the many factors we have asked for are most powerful for guessing if a bad event will occur.
Also, for the ones that are important, we can say how important and to what degree by having something like firmware versions 1 and 2 effects the rate of bad events. We can then, mathematically, show that firmware versions 1 and 2 makes for a 10% rise in the rate of corrupt or null credential reads [as an example]. So, we can report this to the manufacturer and work together to make a better product!
Unstructured problem example
Still reading? Good! As I mentioned earlier, unstructured machine learning is the realm of machine learning which deals with data where we don’t know the correct output. For example, if you’re monitoring a stream of network data, you often don’t know what it normally should look like. Should there be that many camera connections? Does that make sense? Maybe it’s normal and I just want to look when things get weird.
One-way data scientists tackle problems like this is with something called anomaly detection. To do this, I first need to pick a mathematical model which I think represents the data best. For now, let’s assume the data we are dealing with is distributed according to a Guassian (normal) distribution. This is one of the most common models applied to data because it has some very nice mathematical properties and can solve many types of problems.
Next, we need to “fit” the parameters of the model to the data. For a normal distribution, this means computing the mean and variance of the data. One we have estimated what we think are the mean and variance, we can pass each data point through a standard score test (also known as the Z-score or Z-value) or optionally compute something like the p-value. With either of these metrics, we can get a general understanding of if each point in the collection of data appears to be distributed from a normal distribution with the parameters we estimated. We lastly set a threshold on “acceptable” values for either the standard score or p-value and when we get a value outside our normal range we call that point an anomaly!
How do we work on a new problem?
The way in which these problems always start is by having another entity describing what they want the system to do for them. We then work together to (1) manage expectations about what is possible and (2) discuss alternate use cases which they may not even be aware of. After that, the next step is to receive a data set to work with. Even though I might think a problem is feasible, a data scientist never knows this until they see and work with the data. Data can have a multitude of problems ranging from missing attributes/properties to having a variation that is too large. When this happens, we follow the “data science process”.
The data science process
- Plot the data
- Does the plot seem to demonstrate anything?
- If no, return to (1.) and re-plot the data a different way or with different attributes
- If yes, continue to (3.)
- Take a guess at which distribution or mathematical model you want to try applying
- Fit the parameters of the model to the data
- If the model doesn’t fit, try another
- If the model seems to fit well, continue to step 5
- Convert the model (if necessary or possible) to a sequential method for online estimation/detection
- Use the model estimate along with new data to identify data points outside the “normal” range.
- Spend an additional 5000% time creating the support systems for a product (security, UX, services, APIs, etc.)
Data science products
Some products that I worked on at Genetec include:
This is a crime prediction and resource deployment software-as-a-service (SaaS) product which helps cities and law enforcement deploy their physical resources more efficiently based on predicted trends in crime. It utilizes a sequential algorithm to learn crime trends and then predicts forward given recent crime in a city. It was developed using the Chicago open-data portal.
Citigraf Command utilizes a high-tech correlation engine to identify, in high-dimensions, data points which may be related to another (including events). It works in real-time to associate data together to help law enforcement and public safety build a complete picture of an event.
This is a feature which is being developed to be included in our System Availability Monitor (SAM) SaaS product. It will help integrators and IT professionals’ asses the future health of their systems and notify an operator how long they have until the system heads towards a critical failure.
Want to become a data scientist at Genetec?
If you’re still reading this far, then that might mean you have a real interest in a job in the field of data science! Check out our careers page to view our available positions.
About the AuthorFollow on Linkedin More content by Sean Lawlor