Statistique descriptive univariée

In [1]:
%matplotlib nbagg

import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Histogramme

En Python, il y a deux fonctions hist pour tracer des histogrammes :

  • la fonction hist de pandas qui s'applique aux dataframe

  • la fonction hist de matplotlib qui s'applique aux array ou aux séquences d'array

Les deux fonctions se distinguent légèrement par leurs options.

Voyons quelques exemples :

In [2]:
data = pd.DataFrame(np.random.randn(1000,1),columns=['X'])
data.head()
Out[2]:
X
0 -1.412894
1 -0.467703
2 0.604958
3 1.902368
4 -0.612601
In [3]:
data.hist(normed=True)
Out[3]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x5240b50>]], dtype=object)

L'option normed=True est obligatoire pour obtenir un histogramme comme défini dans le cours. Vérifiez ce qui se passe si vous enlevez cette option.

La fonction hist de pandas a des nombreuses options. Regardez :

In [4]:
help(data.hist)
Help on method hist_frame in module pandas.tools.plotting:

hist_frame(data, column=None, by=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, **kwds) method of pandas.core.frame.DataFrame instance
    Draw histogram of the DataFrame's series using matplotlib / pylab.
    
    Parameters
    ----------
    data : DataFrame
    column : string or sequence
        If passed, will be used to limit data to a subset of columns
    by : object, optional
        If passed, then used to form histograms for separate groups
    grid : boolean, default True
        Whether to show axis grid lines
    xlabelsize : int, default None
        If specified changes the x-axis label size
    xrot : float, default None
        rotation of x axis labels
    ylabelsize : int, default None
        If specified changes the y-axis label size
    yrot : float, default None
        rotation of y axis labels
    ax : matplotlib axes object, default None
    sharex : boolean, default True if ax is None else False
        In case subplots=True, share x axis and set some x axis labels to
        invisible; defaults to True if ax is None otherwise False if an ax
        is passed in; Be aware, that passing in both an ax and sharex=True
        will alter all x axis labels for all subplots in a figure!
    sharey : boolean, default False
        In case subplots=True, share y axis and set some y axis labels to
        invisible
    figsize : tuple
        The size of the figure to create in inches by default
    layout: (optional) a tuple (rows, columns) for the layout of the histograms
    bins: integer, default 10
        Number of histogram bins to be used
    kwds : other plotting keyword arguments
        To be passed to hist function

Voici les options les plus utiles pour nous :

  • normed=True pour obtenir un vrai histogramme

  • bins le nombre de sous-intervalles

  • grid (de type booléan) pour tracer/enlever la grille

Dans l'exemple ci-dessus, essayez plusieurs valeurs pour bins. Quelle est la meilleure valeur de bins ?

Pour des dataframe à plusieurs colonnes :

  • column le(s) numéro(s) des colonnes pour lesquelles on veut tracer l'histogramme

  • sharex, sharey (de type booléan) pour indiquer si tous les histogrammes doivent être représentés sur la même échelle ou pas

  • by pour sélectionner une partie du tableau (voir exemple ci-desssous)

In [5]:
data['Y'] = np.random.binomial(3,.5,1000)
data.head(10)
Out[5]:
X Y
0 -1.412894 2
1 -0.467703 3
2 0.604958 1
3 1.902368 2
4 -0.612601 2
5 -1.436431 2
6 0.816971 2
7 -0.168146 2
8 0.228638 2
9 0.057354 2
In [7]:
data.hist(normed=True,sharex=True,sharey=True)
Out[7]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x5921430>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x599f2d0>]], dtype=object)
In [8]:
data.hist(column='X',by='Y',normed=True,sharex=True)
Out[8]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x5cbe470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x5e49530>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x5e86bd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x5eb8dd0>]], dtype=object)

Et maintenant pour comparer, voyons la fonction hist de matplotlib :

In [9]:
data2 = pd.DataFrame(np.random.randn(500,1),columns=['Normal'])
data2['Uniform'] = np.random.rand(500,1)
data2['Exponentiel'] = np.random.exponential(2,500)
data2['Poisson'] = np.random.poisson(2,500)
data2.head()
Out[9]:
Normal Uniform Exponentiel Poisson
0 -0.472721 0.141535 1.176710 2
1 -0.904730 0.707581 3.751751 3
2 1.321365 0.854124 3.880027 2
3 -0.968062 0.796058 2.172694 1
4 -0.214211 0.678800 4.683781 2

Les paramètres de plt.hist sont :

  • normed=True pour obtenir un vrai histogramme

  • bins le nombre de sous-intervalles de l'histogramme

  • range l'intervalle des abscisses

  • et pour d'autres options, regardez l'aide :

In [10]:
help(plt.hist)
Help on function hist in module matplotlib.pyplot:

hist(x, bins=10, range=None, normed=False, weights=None, cumulative=False, bottom=None, histtype=u'bar', align=u'mid', orientation=u'vertical', rwidth=None, log=False, color=None, label=None, stacked=False, hold=None, data=None, **kwargs)
    Plot a histogram.
    
    Compute and draw the histogram of *x*. The return value is a
    tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
    [*patches0*, *patches1*,...]) if the input contains multiple
    data.
    
    Multiple data can be provided via *x* as a list of datasets
    of potentially different length ([*x0*, *x1*, ...]), or as
    a 2-D ndarray in which each column is a dataset.  Note that
    the ndarray form is transposed relative to the list form.
    
    Masked arrays are not supported at present.
    
    Parameters
    ----------
    x : (n,) array or sequence of (n,) arrays
        Input values, this takes either a single array or a sequency of
        arrays which are not required to be of the same length
    
    bins : integer or array_like, optional
        If an integer is given, `bins + 1` bin edges are returned,
        consistently with :func:`numpy.histogram` for numpy version >=
        1.3.
    
        Unequally spaced bins are supported if `bins` is a sequence.
    
        default is 10
    
    range : tuple or None, optional
        The lower and upper range of the bins. Lower and upper outliers
        are ignored. If not provided, `range` is (x.min(), x.max()). Range
        has no effect if `bins` is a sequence.
    
        If `bins` is a sequence or `range` is specified, autoscaling
        is based on the specified bin range instead of the
        range of x.
    
        Default is ``None``
    
    normed : boolean, optional
        If `True`, the first element of the return tuple will
        be the counts normalized to form a probability density, i.e.,
        ``n/(len(x)`dbin)``, i.e., the integral of the histogram will sum
        to 1. If *stacked* is also *True*, the sum of the histograms is
        normalized to 1.
    
        Default is ``False``
    
    weights : (n, ) array_like or None, optional
        An array of weights, of the same shape as `x`.  Each value in `x`
        only contributes its associated weight towards the bin count
        (instead of 1).  If `normed` is True, the weights are normalized,
        so that the integral of the density over the range remains 1.
    
        Default is ``None``
    
    cumulative : boolean, optional
        If `True`, then a histogram is computed where each bin gives the
        counts in that bin plus all bins for smaller values. The last bin
        gives the total number of datapoints.  If `normed` is also `True`
        then the histogram is normalized such that the last bin equals 1.
        If `cumulative` evaluates to less than 0 (e.g., -1), the direction
        of accumulation is reversed.  In this case, if `normed` is also
        `True`, then the histogram is normalized such that the first bin
        equals 1.
    
        Default is ``False``
    
    bottom : array_like, scalar, or None
        Location of the bottom baseline of each bin.  If a scalar,
        the base line for each bin is shifted by the same amount.
        If an array, each bin is shifted independently and the length
        of bottom must match the number of bins.  If None, defaults to 0.
    
        Default is ``None``
    
    histtype : {'bar', 'barstacked', 'step',  'stepfilled'}, optional
        The type of histogram to draw.
    
        - 'bar' is a traditional bar-type histogram.  If multiple data
          are given the bars are aranged side by side.
    
        - 'barstacked' is a bar-type histogram where multiple
          data are stacked on top of each other.
    
        - 'step' generates a lineplot that is by default
          unfilled.
    
        - 'stepfilled' generates a lineplot that is by default
          filled.
    
        Default is 'bar'
    
    align : {'left', 'mid', 'right'}, optional
        Controls how the histogram is plotted.
    
            - 'left': bars are centered on the left bin edges.
    
            - 'mid': bars are centered between the bin edges.
    
            - 'right': bars are centered on the right bin edges.
    
        Default is 'mid'
    
    orientation : {'horizontal', 'vertical'}, optional
        If 'horizontal', `~matplotlib.pyplot.barh` will be used for
        bar-type histograms and the *bottom* kwarg will be the left edges.
    
    rwidth : scalar or None, optional
        The relative width of the bars as a fraction of the bin width.  If
        `None`, automatically compute the width.
    
        Ignored if `histtype` is 'step' or 'stepfilled'.
    
        Default is ``None``
    
    log : boolean, optional
        If `True`, the histogram axis will be set to a log scale. If `log`
        is `True` and `x` is a 1D array, empty bins will be filtered out
        and only the non-empty (`n`, `bins`, `patches`) will be returned.
    
        Default is ``False``
    
    color : color or array_like of colors or None, optional
        Color spec or sequence of color specs, one per dataset.  Default
        (`None`) uses the standard line color sequence.
    
        Default is ``None``
    
    label : string or None, optional
        String, or sequence of strings to match multiple datasets.  Bar
        charts yield multiple patches per dataset, but only the first gets
        the label, so that the legend command will work as expected.
    
        default is ``None``
    
    stacked : boolean, optional
        If `True`, multiple data are stacked on top of each other If
        `False` multiple data are aranged side by side if histtype is
        'bar' or on top of each other if histtype is 'step'
    
        Default is ``False``
    
    Returns
    -------
    n : array or list of arrays
        The values of the histogram bins. See **normed** and **weights**
        for a description of the possible semantics. If input **x** is an
        array, then this is an array of length **nbins**. If input is a
        sequence arrays ``[data1, data2,..]``, then this is a list of
        arrays with the values of the histograms for each of the arrays
        in the same order.
    
    bins : array
        The edges of the bins. Length nbins + 1 (nbins left edges and right
        edge of last bin).  Always a single array even when multiple data
        sets are passed in.
    
    patches : list or list of lists
        Silent list of individual patches used to create the histogram
        or list of such list if multiple input datasets.
    
    Other Parameters
    ----------------
    kwargs : `~matplotlib.patches.Patch` properties
    
    See also
    --------
    hist2d : 2D histograms
    
    Notes
    -----
    Until numpy release 1.5, the underlying numpy histogram function was
    incorrect with `normed`=`True` if bin sizes were unequal.  MPL
    inherited that error.  It is now corrected within MPL when using
    earlier numpy versions.
    
    Examples
    --------
    .. plot:: mpl_examples/statistics/histogram_demo_features.py
    
    Notes
    -----
    
    In addition to the above described arguments, this function can take a
    **data** keyword argument. If such a **data** argument is given, the
    following arguments are replaced by **data[<arg>]**:
    
    * All arguments with the following names: 'x', 'weights'.
    
    
    
    
    Additional kwargs: hold = [True|False] overrides default hold state

In [11]:
plt.figure()
plt.subplot(221)
plt.hist(data2['Normal'], normed=True,bins=25)
plt.subplot(222)
plt.hist(data2['Uniform'], normed=True,bins=25)
plt.subplot(223)
plt.hist(data2['Exponentiel'], normed=True,bins=25)
plt.subplot(224)
plt.hist(data2['Poisson'], normed=True,bins=25)
Out[11]:
(array([ 0.5    ,  0.     ,  0.     ,  0.83125,  0.     ,  0.     ,
         0.88125,  0.     ,  0.     ,  0.475  ,  0.     ,  0.     ,
         0.28125,  0.     ,  0.     ,  0.1125 ,  0.     ,  0.     ,
         0.0375 ,  0.     ,  0.     ,  0.     ,  0.     ,  0.     ,  0.00625]),
 array([ 0.  ,  0.32,  0.64,  0.96,  1.28,  1.6 ,  1.92,  2.24,  2.56,
         2.88,  3.2 ,  3.52,  3.84,  4.16,  4.48,  4.8 ,  5.12,  5.44,
         5.76,  6.08,  6.4 ,  6.72,  7.04,  7.36,  7.68,  8.  ]),
 <a list of 25 Patch objects>)
In [13]:
plt.figure()
plt.hist((data2['Normal'],data2['Exponentiel'],data2['Uniform'],data2['Poisson']), normed=True,bins=15)
plt.legend(['Normal','Exponentiel','Uniform','Poisson'])
Out[13]:
<matplotlib.legend.Legend at 0x7548d10>

Boxplot

Pour tracer des boxplots (ou boîtes à moustaches) on utilise la fonction boxplot de pandas ou la fonction boxplot dans matplotlib.

In [15]:
plt.figure()
bp = data2.boxplot()
/Library/Python/2.7/site-packages/ipykernel/__main__.py:2: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  from ipykernel import kernelapp as app

Les options de boxplot dans pandas sont :

  • column liste de colonnes pour lesquelles on veut tracer un boxplot

  • by colonne pour faire des groupes

  • figsize taille de la figure

  • grid (de type booléan) pour tracer/enlever la grille

  • etc.

In [16]:
data.boxplot(column='X',by='Y')
plt.xlabel(' ')
plt.title(' ')
Out[16]:
<matplotlib.text.Text at 0x7d6f4b0>

QQ-plot

Pour tracer des QQ-plots, nous utiliserons la fonction suivante :

In [17]:
def qqplot(x, y, alpha=np.arange(0.05,0.95,.01)):
    qx = x.quantile(alpha)
    qy = y.quantile(alpha)
    plt.scatter(qx, qy, marker='o',s=60, facecolors='none', linewidths=1);
    
    plt.plot([min(x)]+qx+[max(x)], [min(x)]+qx+[max(x)], '--')
In [18]:
plt.figure()
qqplot(data2['Normal'],data2['Uniform'])
In [19]:
plt.figure()
qqplot(data2.ix[0:250,'Normal'],data2.ix[250:500,'Normal'])

Comment interpréter la forme des QQ-plots ci-dessus ?

Exercice à rendre

Pour cet exercice, il vous est demandé de créer un nouveau notebook, qui contiendra tous les codes pour produire les figures et les réponses aux questions suivantes.

L'exercice est à rendre dans votre boîte de dépôt sur Sakai ou par email à l'adresse tabea.rebafka@upmc.fr si vous rencontrez des problèmes avec Sakai.

Vous pouvez rendre cet exercice en binôme. Dans ce cas, chacun doit en déposer une copie dans sa boîte de dépôt Sakai ET il faut indiquer vos deux noms sur le notebook.

L'exercice porte sur les jeux de données des précédents notebooks.

Toutes les figures doivent être facile à comprendre (pensez à mettre des légendes, annoter les axes, mettre des titres etc.)

  1. Choissisez trois poussins de chaque régime et tracer leurs courbes de poids dans un même graphique. Utiliser des couleurs différentes pour les différents régimes.

  2. Tracer les quatre boxplots du poids des poussins au jour 0 par groupe de régime. Commentez le graphique.

  3. Tracer les quatre boxplots du poids des poussins au jour 21 par groupe de régime. Commentez le graphique.

  4. Tracer les trois histogrammes des neurones dans une même figure. Commentez la figure.

  5. Tracer les QQ-plots des neurones pour toutes les combinaisons de deux jeux de donnée possibles. Interprétez les figures.