Part 2: Locating outliers using an empirical method in python with scipy’s mquantile()
yum -y install make gcc gcc-c++ gcc-gfortran cmake python-devel python-pip pcre pcre-devel freetype-devel libpng-devel atlas-devel libgfortran wget http://www.python.org/ftp/python/2.7.4/Python-2.7.4.tgz tar zxvf Python-* cd Python-* ./configure && make && make install cd wget http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg sh setuptools-0.6c11-py2.7.egg curl -O https://raw.github.com/pypa/pip/master/contrib/get-pip.py python get-pip.py pip-2.7 install MySQL-python numpy matplotlib scipy #matplotlib's install is dependant on the numpy egg, freetype-devel and libpng-devel as installed previously with yum #scipy's install is dependant on the numpy egg, atlas-devel, libgfortran and gcc-gfortran
Creating a graphical histogram to review the distribution of data:
Plot a histogram of data with python to understand the distribution. Is it uniform, normal, exponential?
In this instance, it is exponential.
Locating which (per mille) quantile a given X falls into:
If you need to determine the distribution of data: http://stackoverflow.com/questions/17330252/when-dealing-with-an-exponential-data-set-is-using-mquantiles-accurate
But, it is known that this method works just fine.
Using an empircal model to find the quantile at x%:
1) get tuple/list
2) sort list ascending
3) calculate: index=trunc(x%*(N-1))
4) get N[index]
5) this is the value