Home > Uncategorized > A conversation with Imad: quantiles, histograms, standard deviation, and outliers

A conversation with Imad: quantiles, histograms, standard deviation, and outliers


A picture of your esteemed author

 

The following is a conversation with a financial quant I know about finding outliers in data, and the irrelevance of relying on mean and standard deviation in data that is not in a normal distribution.

[6/24/2013 2:04:33 PM] Matt Brown: hey imad
[6/24/2013 2:04:35 PM] Matt Brown: hope all is well
[6/24/2013 2:04:53 PM] Matt Brown: i have a question about when to calculate the standard deviation of a sample versus a population
[6/24/2013 5:01:30 PM] Matt Brown: I think i got it :)
[6/24/2013 5:01:45 PM] Matt Brown: now... how do I understand when a standard deviation is "relevant"
[6/24/2013 5:01:57 PM] Matt Brown: as in, it's really only relevant in normal and uniform distributions? no?
[3:11:07 AM] Imad Hammad: Hi matt
[3:11:40 AM] Imad Hammad: I haven't seen your msgs
[3:14:46 AM] Imad Hammad: no the std dev is defined in many more cases
[3:14:58 AM] Imad Hammad: I would say even in most cases
[8:32:50 AM] Matt Brown: cool imad
[8:32:55 AM] Matt Brown: I found a method to find outliers in data
[8:33:03 AM] Matt Brown: on wikihow actually :D
[8:33:14 AM] Matt Brown: it uses mean and stddev
[8:34:49 AM] Imad Hammad: I guess that it's the method that everybody uses
[8:34:54 AM] Matt Brown: http://www.wikihow.com/Reject-Outliers-in-Data
[8:34:57 AM] Imad Hammad: compute the mean and the stdev
[8:35:26 AM] Imad Hammad: and if the distance between the point and the mean is more than k*stdev
[8:35:34 AM] Imad Hammad: so that's an outlier
[8:35:37 AM] Imad Hammad: no
[8:35:37 AM] Imad Hammad: ?
[8:36:23 AM] Imad Hammad: your problem is how to find a good k
[8:46:06 AM] Matt Brown: sorry
[8:46:08 AM] Matt Brown: :)
[8:46:11 AM] Matt Brown: yes, kelvin
[8:46:20 AM] Matt Brown: scoville units
[8:46:45 AM] Matt Brown: that sounds good
[8:47:05 AM] Matt Brown: i think most people assume +/- 1 is acceptable and then anything outside of that is an outlier, from what i read briefly
[8:47:29 AM] Matt Brown: how would i find k?
[8:48:02 AM] Matt Brown: for instance, I would assume that there are a bunch of transactions between, say, my outlook at smarsh's servers of the same size in bytes
[8:48:17 AM] Matt Brown: i have this info in general, there are about 104 different transaction sizes
[8:48:39 AM] Matt Brown: so how can i calculate the k in this instance of "transaction bytes"
[8:48:50 AM] Matt Brown: to find if a "transaction bytes" is an outlier?
[8:51:51 AM] Imad Hammad: what do you mean by they assume +/- 1
[8:52:05 AM] Imad Hammad: are you talking about k
[8:52:06 AM] Imad Hammad: ?
[8:52:10 AM] Imad Hammad: k=1 or -1
[8:52:11 AM] Matt Brown: just stddev
[8:52:38 AM] Imad Hammad: if you take a normal distribution
[8:52:52 AM] Imad Hammad: the probability of being further than 1 stdev
[8:53:00 AM] Matt Brown: but i think that's wrong, at least in my set, as i have stddev's of over 200 some times
[8:53:04 AM] Imad Hammad: is ~ 30%
[8:53:22 AM] Imad Hammad: do you think that 30% of your data are outliers
[8:53:22 AM] Imad Hammad: ?
[8:53:28 AM] Matt Brown: no sir
[8:53:35 AM] Matt Brown: i know that 0% of that data is an outlier
[8:53:41 AM] Imad Hammad: lol
[8:53:43 AM] Matt Brown: :D
[8:54:05 AM] Matt Brown: i mean, it's just the run of the mill stuff, I suppose the only way to do this sort of thing is to look at historical activity and compare?
[8:54:45 AM] Imad Hammad: that will be a good way to do it
[8:54:55 AM] Imad Hammad: just take quantiles on your historical data
[8:55:04 AM] Matt Brown: what's a quantile? :D
[8:56:09 AM] Imad Hammad: a stupid thing
[8:56:17 AM] Imad Hammad: like
[8:56:32 AM] Imad Hammad: I know 100% of the data is < max of the data
[8:57:52 AM] Imad Hammad: so the max is the quantile at 100%
[8:58:08 AM] Imad Hammad: so the quantile at x%
[8:58:21 AM] Imad Hammad: is the number y for which I'm sure
[8:58:30 AM] Imad Hammad: that x% of the data is < y
[8:58:47 AM] Imad Hammad: and the other (1-x)% of the data is >=y
[8:59:01 AM] Imad Hammad: am I clear
[8:59:04 AM] Imad Hammad: ?
[8:59:26 AM] Matt Brown: yes, relatively, i don't quite understand the notation (1-x)%, just meaning the inverse?
[8:59:38 AM] Imad Hammad: yes if x=99%
[8:59:55 AM] Imad Hammad: (1-x)% is 1%
[9:00:06 AM] Matt Brown: got it :)
[9:00:20 AM] Matt Brown: yes, so how is this valuable?
[9:00:30 AM] Matt Brown: I can put a weight on a transaction?
[9:00:36 AM] Imad Hammad: ok you take for example your historical data
[9:00:40 AM] Matt Brown: and consider this is the bigger picture and other data sources?
[9:01:08 AM] Imad Hammad: it is valuable in the sens of
[9:01:22 AM] Imad Hammad: if you take your historical data
[9:01:34 AM] Imad Hammad: and compute the quantiles
[9:02:11 AM] Imad Hammad: let's say at 0.1 % and the quantile at 99.9 %
[9:02:25 AM] Imad Hammad: you will have two numbers
[9:02:36 AM] Imad Hammad: y1 and y2
[9:03:05 AM] Imad Hammad: you know by construction that 0.1% of the data is < than y1
[9:03:44 AM] Imad Hammad: and 99.9% of the data is < than y2 which means that 0.1% of the data is > y2
[9:04:06 AM] Imad Hammad: so if you observe a new number
[9:04:15 AM] Imad Hammad: let's say z
[9:04:35 AM] Imad Hammad: if z<y1 or z>y2
[9:05:03 AM] Imad Hammad: you know that the probability of being to get a number like z is < 0.2%
[9:05:16 AM] Imad Hammad: which is a small probability
[9:05:26 AM] Imad Hammad: which means that z might be an outlier
[9:05:49 AM] Matt Brown: I was about to ask the value in this, but this is clearly the best way... to determine quantiles
[9:06:29 AM] Matt Brown: because, it doesn't consider directly the stddev value, as it simply considers the data itself, sound right?
[9:06:48 AM] Matt Brown: because my data isn't in a normal distribution... this is likely the best bet (?)
[9:07:09 AM] Imad Hammad: perfect
[9:07:11 AM] Matt Brown: :D
[9:07:13 AM] Matt Brown: cool imad!
[9:07:22 AM] Matt Brown: thanks much!
[9:07:30 AM] Imad Hammad: another thing
[9:07:46 AM] Matt Brown: determining quantiles, and creating a histogram, are the same thing?
[9:08:32 AM] Matt Brown: i mean, a histogram is a representation of "percentile distribution" right?
[9:08:35 AM] Imad Hammad: a histogram is used to approximate the real distribution of the data
[9:09:10 AM] Imad Hammad: the quantiles should be computed with the real distribution of the data
[9:09:32 AM] Imad Hammad: but as we don't have access to it (it's theorical)
[9:09:50 AM] Imad Hammad: we assume that the data that we have IS the true distribution
[9:09:58 AM] Imad Hammad: and we compute the quantiles using it
[9:10:17 AM] Imad Hammad: so in a way
[9:10:32 AM] Imad Hammad: a histogram gives more information than the quantiles
[9:10:44 AM] Imad Hammad: but if you have the quantiles for every x
[9:10:52 AM] Imad Hammad: you can reconstruct the histogram
[9:11:41 AM] Matt Brown: very cool
[9:12:16 AM] Matt Brown: as far as standard deviation goes, at this point, it no longer seems relavant?
[9:12:19 AM] Matt Brown: to find outliers?
[9:12:28 AM] Matt Brown: i'm trying to understand when it is relevant
[9:13:11 AM] Matt Brown: for instance, it is an aggregation, right, so it can be used to compare to other aggregated data, in my instance, over the same time period?
[9:14:37 AM] Imad Hammad: people like to use the standard deviation
[9:14:42 AM] Imad Hammad: because in most cases
[9:14:58 AM] Imad Hammad: the distributions that they work on
[9:15:07 AM] Imad Hammad: look like a normal distribution
[9:15:24 AM] Imad Hammad: so all what they need
[9:15:37 AM] Imad Hammad: is to compute the mean and the stdev
[9:15:48 AM] Imad Hammad: and with this two values
[9:15:58 AM] Imad Hammad: you can compute the quantiles
[9:16:09 AM] Imad Hammad: for whatever x% you want
[9:16:15 AM] Imad Hammad: by a simple formula
[9:16:19 AM] Matt Brown: ahh hah... this is the method described in that wikihow article
[9:16:47 AM] Imad Hammad: so for example
[9:16:50 AM] Imad Hammad: there is a rule
[9:16:59 AM] Imad Hammad: that says that 68% of the data
[9:17:18 AM] Imad Hammad: is between mean +- 1 stdev
[9:17:45 AM] Imad Hammad: http://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule
[9:17:48 AM] Matt Brown: which means it's a normal distribution, right?
[9:18:55 AM] Imad Hammad: I guess
[9:19:15 AM] Matt Brown: well, maybe, according to this definition
[9:19:34 AM] Imad Hammad: but for example if your data
[9:19:45 AM] Imad Hammad: doesn't look like a normal distribution
[9:19:48 AM] Imad Hammad: you can't say
[9:20:06 AM] Imad Hammad: ok there is a prob of 68%
[9:20:13 AM] Matt Brown: considering the standard deviation is greater than 3?
[9:20:32 AM] Imad Hammad: of being in mean +- 1stdev
[9:21:13 AM] Imad Hammad: do you have a way to plot the histogram of your data
[9:21:16 AM] Imad Hammad: ?
[9:21:27 AM] Matt Brown: i was about to look into using python
[9:22:01 AM] Imad Hammad: you should start by plotting the histogram
[9:22:06 AM] Imad Hammad: to see how the data looks like
[9:22:12 AM] Matt Brown: okay, sounds good
[9:22:13 AM] Matt Brown: :)
[9:22:24 AM] Imad Hammad: let me know when you get it
[9:22:30 AM] Matt Brown: thanks much imad, i'll start with this, then I'll consider quantiles
[9:22:31 AM] Imad Hammad: and from there we can continue
[9:23:06 AM] Matt Brown: let me ask... in a bigger picture sense
[9:23:30 AM] Matt Brown: if I have this data, and I am to consider a time-wise basis to determine outliers
[9:23:40 AM] Matt Brown: what size of time span should i consider?
[9:24:31 AM] Matt Brown: at what point is it "too much" to say... consider 1-second "bars"?
[9:25:31 AM] Imad Hammad: I don't really understand
[9:25:47 AM] Imad Hammad: because I think that that's what we're trying to do
[9:25:52 AM] Imad Hammad: when you'll be done
[9:26:01 AM] Imad Hammad: you'll have a number
[9:26:10 AM] Matt Brown: yes, that's correct :)
[9:26:13 AM] Imad Hammad: that is you get z > number
[9:26:16 AM] Matt Brown: i guess i'll worry about it later
[9:26:18 AM] Imad Hammad: you'll say
[9:26:28 AM] Imad Hammad: hey this z is really big
[9:27:24 AM] Matt Brown: yes, this is what i want :)
[9:27:30 AM] Matt Brown: okay, let me hang out with Karl Pearson a bit and I'll get back to you
[9:27:33 AM] Matt Brown: thanks very much

Created python script to pull then plot data I wanted in a histogram.

[2:09:49 PM] Matt Brown: hey imad
[2:10:07 PM] Matt Brown: I've made a histogram, but I am not sure how to judge the distribution per se
[2:11:03 PM] Imad Hammad: can I see it
[2:11:04 PM] Imad Hammad: ?
[2:11:07 PM] Matt Brown: yessir
[2:11:14 PM] Matt Brown: one moment I'm also saving teh source
[2:11:33 PM] Matt Brown: only 1611 points
[2:11:56 PM] Imad Hammad: if there isn't lot of noise
[2:12:02 PM] Imad Hammad: it could be enough
[2:12:10 PM] Matt Brown: s:\matt\argus_data
[2:12:20 PM] Matt Brown: the Y-axis is quite huge
[2:12:39 PM] Matt Brown: i can easily make it smaller, but the I'm missing data
[2:12:40 PM] Matt Brown: :D
[2:13:30 PM] Imad Hammad: ok so your distribution is not gaussian (normal)
[2:13:46 PM] Imad Hammad: it looks like an exponential density
[2:13:57 PM] Imad Hammad: but do you understand why you have
[2:14:03 PM] Imad Hammad: some jumps
[2:14:05 PM] Imad Hammad: ?
[2:14:15 PM] Imad Hammad: like at
[2:14:16 PM] Matt Brown: the data is sorted ascending
[2:14:20 PM] Imad Hammad: 20000
[2:14:27 PM] Imad Hammad: and a bit
[2:14:42 PM] Matt Brown: i'm sorry i'm not following
[2:14:51 PM] Imad Hammad: before
[2:15:04 PM] Imad Hammad: if I have to guess
[2:15:22 PM] Imad Hammad: at 14000 on the x axis
[2:15:31 PM] Imad Hammad: you have a small jump
[2:15:48 PM] Imad Hammad: (you have more data)
[2:15:58 PM] Imad Hammad: do you see what I'm talking about
[2:15:59 PM] Imad Hammad: ?
[2:16:28 PM] Matt Brown: let's use S:\matt\argus_data\histo_y_20000.png
[2:16:50 PM] Imad Hammad: ah much better
[2:16:53 PM] Matt Brown: :D
[2:17:09 PM] Imad Hammad: just before 15000
[2:17:17 PM] Matt Brown: i'm not 100% sure what i'm doing, so i'm not sure if it was relevant to plot _everything_
[2:17:18 PM] Imad Hammad: between 14000 and 15000
[2:17:21 PM] Matt Brown: yes i see this
[2:17:29 PM] Matt Brown: and it is clear in the ascending dbytes_source_data.txt
[2:17:44 PM] Imad Hammad: do you understand why
[2:17:52 PM] Imad Hammad: you have something like that
[2:18:09 PM] Imad Hammad: if you go back to your orginal data
[2:18:20 PM] Imad Hammad: can you find the data
[2:18:26 PM] Matt Brown: as in, do I understand why I would have "20 out of 1611 given dbyte values within the 14000-15000 range?"
[2:18:45 PM] Imad Hammad: yes
[2:18:50 PM] Matt Brown: yes I see it in the source
[2:19:07 PM] Imad Hammad: ok so for example that's an outlier
[2:19:11 PM] Imad Hammad: do you agree
[2:19:12 PM] Imad Hammad: ?
[2:19:18 PM] Matt Brown: this is the difficult thing, this metric/data point is not predictable, as in, it will not always be a given value
[2:19:33 PM] Matt Brown: it might be 0, it might be, in this case, 1992207
[2:19:43 PM] Matt Brown: not exactly, it isn't an outlier
[2:19:57 PM] Matt Brown: I mean, but I guess it is given this histogram and the distribution
[2:20:05 PM] Matt Brown: this is what makes this soo difficult
[2:20:27 PM] Imad Hammad: anyway
[2:20:36 PM] Imad Hammad: you have two choices
[2:21:06 PM] Imad Hammad: 1- fit an exponential distribution to your data and use the quantiles of the exponential distribution
[2:21:15 PM] Imad Hammad: in this case
[2:21:43 PM] Imad Hammad: if you are looking for the number y for which probability of (z >=y) is less than x
[2:22:56 PM] Imad Hammad: I think that number must be
[2:23:12 PM] Imad Hammad: y = - 1/lambda * log (x)
[2:23:30 PM] Imad Hammad: where lambda is a parameter that you should find
[2:24:06 PM] Imad Hammad: maybe if you just compute the inverse of the average of your data
[2:24:17 PM] Imad Hammad: you can get a good approximation of lambda
[2:24:24 PM] Imad Hammad: or choice 2
[2:24:48 PM] Imad Hammad: 2- find the quantile at x% just using the data
[2:25:11 PM] Imad Hammad: python must have a function that does that
[2:25:16 PM] Imad Hammad: am I clear
[2:25:18 PM] Imad Hammad: ?
[2:25:39 PM] Matt Brown: I'm not familiar with the notation you're using, but I can research
[2:26:08 PM] Matt Brown: can I mention one more concept to bring more color to the data I have
[2:26:55 PM] Matt Brown: this is a single given host->host upload instance, I could also consider comparing them to another host->host upload instance
[2:27:52 PM] Matt Brown: this moves away from this data point itself and determining outliers from this, and instead would consider a given timespan's value of say, a stddev or mean, and then try to detect off of this
[2:29:06 PM] Matt Brown: which is the best way: 1) use an "exponential distribution" or 2) find the quantile at x%
both can work
[2:32:50 PM] Imad Hammad: they are very easy to implement
[2:33:01 PM] Imad Hammad: and then you can choose the one that you like
[2:33:38 PM] Imad Hammad: but your idea is good
[2:34:05 PM] Imad Hammad: is that if you have host internal > host outside
[2:34:34 PM] Imad Hammad: you can for each host internal compute the average
[2:34:57 PM] Imad Hammad: and then just look at the biggest  one
[2:35:14 PM] Imad Hammad: it really depends on what you want to detect
[2:35:43 PM] Imad Hammad: can you tell me in simple words
[2:35:48 PM] Imad Hammad: what is an outlier
[2:35:52 PM] Imad Hammad: ?
[2:36:02 PM] Imad Hammad: in your context
[2:36:04 PM] Matt Brown: yes I have several questions I think are worth answering, and I want to derive a weight from them
[2:36:16 PM] Matt Brown: and then see if they are worthy of sending a real alert for action
[2:36:48 PM] Matt Brown: in simple words, I can't quite do that, truly, because I am not familiar enough with the data
[2:37:13 PM] Matt Brown: I am attempting to build a profile off of the following combinations (where saddr = "host internal" and daddr = "internet host")
[2:37:20 PM] Matt Brown: Considering a daddr + saddr pair:
- has this daddr+saddr been seen before?
- what is the nature of the previous traffic:
 - what protocol?
 - What time of the day?  How many times?  Can you create a standard deviation?
 - how many dbytes and sbytes?  Is this an outlier?
 - what is the appbyte ratio? (consumer versus producer)  is this an outlier?
 - what are the flow durations (-s mean and -s stddev).  is this an outlier?
 - what are the packet sizes?  is this an outlier?
- what is the country code of daddr/saddr?  is this an outlier?
[2:37:35 PM] Matt Brown: I can answer some of those simply, others I can not
[2:38:26 PM] Matt Brown: but you see, I am not using this instance of an dbytes(saddr->daddr)to make a decision
[2:39:06 PM] Matt Brown: but if it's an outlier, then I can add a weight, like "100" to the event and then calculate further, and if the weight rises over say... 500, I can throw an alert
[2:39:33 PM] Imad Hammad: nice
[2:39:42 PM] Imad Hammad: I like what you're trying to do
[2:39:50 PM] Matt Brown: it seems to be the best way, or else, I will have a lot of false positives
[2:40:00 PM] Matt Brown: :D anomaly detection in network flows
[2:40:05 PM] Matt Brown: there are several papers written on it
[2:40:28 PM] Imad Hammad: I think most of your data
[2:40:39 PM] Imad Hammad: is gonna be from the exponential family
[2:41:06 PM] Imad Hammad: but I think you should start
[2:41:14 PM] Imad Hammad: by the quantiles
[2:41:44 PM] Imad Hammad: for each question, you compute the quantile at x%
[2:42:11 PM] Imad Hammad: then when you have a running framework that can put all thins things together
[2:42:20 PM] Imad Hammad: and that can answer each question
[2:42:33 PM] Imad Hammad: you can start by improving what you get
[2:43:15 PM] Imad Hammad: you have just to keep in mind that you might have to go back to the method that you're using
[2:43:19 PM] Imad Hammad: and to change it
[2:43:47 PM] Imad Hammad: start with something simple
[2:43:55 PM] Imad Hammad: put a good framework in place
[2:44:01 PM] Matt Brown: understood, I am not well versed in my options, and this is what I fear... but I guess it will be apparent if the results I'm seeing are bad (false positives, etc)
[2:44:19 PM] Imad Hammad: yes it's always the same approach
[2:44:26 PM] Imad Hammad: start with simple things
[2:44:30 PM] Imad Hammad: and when you're done
[2:44:43 PM] Imad Hammad: improve them if they are not good enough
[2:45:17 PM] Matt Brown: i'm going to focus on this one example, because it should be fairly predictable... it is unique (me sending email with outlook, it speaks a specific protocol, and should be predictable, relatively, day-to-day)
[2:45:32 PM] Matt Brown: thanks imad!
[2:45:47 PM] Matt Brown: I'll work on the quantiles and see what i can get
[2:45:51 PM] Matt Brown: I really appreciate the help
[2:47:27 PM] Imad Hammad: my pleasure

After some time, I hacked together a script to find the quantile of a given value within a set of data/values.

Advertisements
  1. April 19, 2014 at 7:04 am

    I shouldn’t have missed this post, this is very interesting read even though a long shot.

    Thanks for sharing.

    • April 20, 2014 at 6:44 pm

      Thanks CS. I’ve slowed down my investigations to creatively mine argus data since I started a new job about six weeks ago. I’ll likely implement argus stuff, but in a very long time (more than six months).

      I think using outlier as a data point to consider anomalous data might be of some value. I posted more stuff about the topic here.

      Carter has repeatedly mentioned the use of ARMA and moving averages while trying to determine outliers, and other stuff.

      Would love to see some stuff about it on your site!

      Also, wanted to mention BayesDB, could be quite interesting to find if an event falls within a tolerance range of expected action.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: