describe(self, **kwargs)
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN
values.
Analyzes both numeric and object series, as well as DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.
For numeric data, the result's index will include count
, mean
, std
, min
, max
as well as lower, 50
and upper percentiles. By default the lower percentile is 25
and the upper percentile is 75
. The 50
percentile is the same as the median.
For object data (e.g. strings or timestamps), the result's index will include count
, unique
, top
, and freq
. The top
is the most common value. The freq
is the most common value's frequency. Timestamps also include the first
and last
items.
If multiple object values have the highest count, then the count
and top
results will be arbitrarily chosen from among those with the highest count.
For mixed data types provided via a DataFrame
, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all'
is provided as an option, the result will include a union of attributes of each type.
The :None:None:`include`
and exclude
parameters can be used to limit which columns in a DataFrame
are analyzed for the output. The parameters are ignored when analyzing a Series
.
The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles.
A white list of data types to include in the result. Ignored for Series
. Here are the options:
A black list of data types to omit from the result. Ignored for Series
. Here are the options:
Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.
Summary statistics of the Series or Dataframe provided.
Generate descriptive statistics.
DataFrame.count
Count number of non-NA/null observations.
DataFrame.max
Maximum of the values in the object.
DataFrame.mean
Mean of the values.
DataFrame.min
Minimum of the values in the object.
DataFrame.select_dtypes
Subset of a DataFrame including/excluding columns based on their dtype.
DataFrame.std
Standard deviation of the observations.
Describing a numeric Series
.
>>> s = pd.Series([1, 2, 3])
... s.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 dtype: float64
Describing a categorical Series
.
>>> s = pd.Series(['a', 'a', 'b', 'c'])
... s.describe() count 4 unique 3 top a freq 2 dtype: object
Describing a timestamp Series
.
>>> s = pd.Series([
... np.datetime64("2000-01-01"),
... np.datetime64("2010-01-01"),
... np.datetime64("2010-01-01")
... ])
... s.describe(datetime_is_numeric=True) count 3 mean 2006-09-01 08:00:00 min 2000-01-01 00:00:00 25% 2004-12-31 12:00:00 50% 2010-01-01 00:00:00 75% 2010-01-01 00:00:00 max 2010-01-01 00:00:00 dtype: object
Describing a DataFrame
. By default only numeric fields are returned.
>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
... 'numeric': [1, 2, 3],
... 'object': ['a', 'b', 'c']
... })
... df.describe() numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing all columns of a DataFrame
regardless of data type.
>>> df.describe(include='all') # doctest: +SKIP categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN a freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN
Describing a column from a DataFrame
by accessing it as an attribute.
>>> df.numeric.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64
Including only numeric columns in a DataFrame
description.
>>> df.describe(include=[np.number]) numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Including only string columns in a DataFrame
description.
>>> df.describe(include=[object]) # doctest: +SKIP object count 3 unique 3 top a freq 1
Including only categorical columns from a DataFrame
description.
>>> df.describe(include=['category']) categorical count 3 unique 3 top d freq 1
Excluding numeric columns from a DataFrame
description.
>>> df.describe(exclude=[np.number]) # doctest: +SKIP categorical object count 3 3 unique 3 3 top f a freq 1 1
Excluding object columns from a DataFrame
description.
>>> df.describe(exclude=[object]) # doctest: +SKIP categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0See :
Hover to see nodes names; edges to Self not shown, Caped at 50 nodes.
Using a canvas is more power efficient and can get hundred of nodes ; but does not allow hyperlinks; , arrows or text (beyond on hover)
SVG is more flexible but power hungry; and does not scale well to 50 + nodes.
All aboves nodes referred to, (or are referred from) current nodes; Edges from Self to other have been omitted (or all nodes would be connected to the central node "self" which is not useful). Nodes are colored by the library they belong to, and scaled with the number of references pointing them