Document

pandas 1.4.2

Notes Parameters Returns

describe(self: 'NDFrameT', percentiles=None, include=None, exclude=None, datetime_is_numeric=False) -> 'NDFrameT'

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Notes

For numeric data, the result's index will include count , mean , std , min , max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75 . The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result's index will include count , unique , top , and freq . The top is the most common value. The freq is the most common value's frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame , the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The :None:None:`include` and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series .

Parameters

percentiles : list-like of numbers, optional: The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75] , which returns the 25th, 50th, and 75th percentiles.
include : 'all', list-like of dtypes or None (default), optional: A white list of data types to include in the result. Ignored for Series . Here are the options:
exclude : list-like of dtypes or None (default), optional,: A black list of data types to omit from the result. Ignored for Series . Here are the options:
datetime_is_numeric : bool, default False: Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.

versionadded

Returns

Series or DataFrame: Summary statistics of the Series or Dataframe provided.

Generate descriptive statistics.

See Also

DataFrame.count: Count number of non-NA/null observations.

DataFrame.max: Maximum of the values in the object.

DataFrame.mean: Mean of the values.

DataFrame.min: Minimum of the values in the object.

DataFrame.select_dtypes: Subset of a DataFrame including/excluding columns based on their dtype.

DataFrame.std: Standard deviation of the observations.

Examples

Describing a numeric Series .

This example is valid syntax, but we were not able to check execution

>>> s = pd.Series([1, 2, 3])
... s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical Series .

This example is valid syntax, but we were not able to check execution

>>> s = pd.Series(['a', 'a', 'b', 'c'])
... s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series .

This example is valid syntax, but we were not able to check execution

>>> s = pd.Series([
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
... s.describe(datetime_is_numeric=True)
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Describing a DataFrame . By default only numeric fields are returned.

This example is valid syntax, but we were not able to check execution

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
... df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

This example is valid syntax, but we were not able to check execution

>>> df.describe(include='all')  # doctest: +SKIP
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

This example is valid syntax, but we were not able to check execution

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

This example is valid syntax, but we were not able to check execution

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

This example is valid syntax, but we were not able to check execution

>>> df.describe(include=[object])  # doctest: +SKIP
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a DataFrame description.

This example is valid syntax, but we were not able to check execution

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              d
freq             1

Excluding numeric columns from a DataFrame description.

This example is valid syntax, but we were not able to check execution

>>> df.describe(exclude=[np.number])  # doctest: +SKIP
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

Excluding object columns from a DataFrame description.

This example is valid syntax, but we were not able to check execution

>>> df.describe(exclude=[object])  # doctest: +SKIP
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0

See :

Local connectivity graph

Hover to see nodes names; edges to Self not shown, Caped at 50 nodes.

Using a canvas is more power efficient and can get hundred of nodes ; but does not allow hyperlinks; , arrows or text (beyond on hover)

SVG is more flexible but power hungry; and does not scale well to 50 + nodes.

All aboves nodes referred to, (or are referred from) current nodes; Edges from Self to other have been omitted (or all nodes would be connected to the central node "self" which is not useful). Nodes are colored by the library they belong to, and scaled with the number of references pointing them

File: /pandas/core/generic.py#9976
type: <class 'function'>
Commit: