Part 2: Locating outliers using an empirical method in python with scipy’s mquantile()


Setup environment:

yum -y install make gcc gcc-c++ gcc-gfortran cmake python-devel python-pip pcre pcre-devel freetype-devel libpng-devel atlas-devel libgfortran
wget http://www.python.org/ftp/python/2.7.4/Python-2.7.4.tgz
tar zxvf Python-*
cd Python-*
./configure && make && make install
cd
wget http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg
sh setuptools-0.6c11-py2.7.egg
curl -O https://raw.github.com/pypa/pip/master/contrib/get-pip.py
python get-pip.py
pip-2.7 install MySQL-python numpy matplotlib scipy
#matplotlib's install is dependant on the numpy egg, freetype-devel and libpng-devel as installed previously with yum
#scipy's install is dependant on the numpy egg, atlas-devel, libgfortran and gcc-gfortran

Creating a graphical histogram to review the distribution of data:
Plot a histogram of data with python to understand the distribution. Is it uniform, normal, exponential?

In this instance, it is exponential.

Locating which (per mille) quantile a given X falls into:
If you need to determine the distribution of data: http://stackoverflow.com/questions/17330252/when-dealing-with-an-exponential-data-set-is-using-mquantiles-accurate

But, it is known that this method works just fine.

Using an empircal model to find the quantile at x%:
1) get tuple/list
2) sort list ascending
3) calculate: index=trunc(x%*(N-1))
4) get N[index]
5) this is the value

See this gist for a python script.

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: