How-to Guide: Relationships
The class cc_tk.relationship.RelationshipSummary
can be used to quickly evaluate the relationships between features and a target.
The relationships are evaluated through statistical tests. For now, the tests are:
numeric feature, numeric target: pearson correlation test
numeric feature, categorical target: anova test if its hypotheses are verified, Kruskal-Wallis otherwise
idem for categorical feature with numeric target
categorical feature, categorical target: chi-2 test
The cc_tk.relationship
submodule provides functions that allow you to study the relationship between variables.
It is particularly useful for feature selection when you have a lot of variables and you want to understand which ones are the most statistically significant to discriminate the target variable.
Target variable can be either numeric or categorical.
Overall Summary
from cc_tk.relationship import RelationshipSummary
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True, as_frame=True)
X = X.assign(
sepal_type=(X["sepal length (cm)"] < 6).map({False: "big", True: "small"})
)
When using RelationshipSummary
, you can build summary with its build_summary
method and/or save it to an excel file with to_excel
method.
relationship_summary = RelationshipSummary(X, y.astype(object))
# relationship_summary.to_excel("../../data/output/test_relationship.xlsx")
relationship_summary.build_summary();
/home/docs/checkouts/readthedocs.org/user_builds/clementcome-toolkit/checkouts/latest/cc_tk/relationship/distribution.py:100: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
numeric_features.groupby(target)
You can access the overall distributions for numeric and categorical variables with numeric_distribution
and categorical_distribution
attributes of the summary_output
.
relationship_summary.summary_output.numeric_distribution
Variable | count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|
0 | sepal length (cm) | 150.0 | 5.843333 | 0.828066 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 |
1 | sepal width (cm) | 150.0 | 3.057333 | 0.435866 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 |
2 | petal length (cm) | 150.0 | 3.758000 | 1.765298 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 |
3 | petal width (cm) | 150.0 | 1.199333 | 0.762238 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 |
You can acces the relationships summary with numeric_significance
and categorical_significance
attributes of the summary_output
.
In the following output, we can interpret that the petal length values are significantly lower in the group 0 and significantly higher for the group 2, and this is confirmed by the distribution by group (min and max for example).
relationship_summary.summary_output.numeric_significance.drop(columns=["pvalue", "statistic", "message"])
influence | significance | count | mean | std | min | 25% | 50% | 75% | max | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Variable | Target | ||||||||||
petal length (cm) | 0 | -- | strong | 50.0 | 1.462 | 0.173664 | 1.0 | 1.400 | 1.50 | 1.575 | 1.9 |
1 | strong | 50.0 | 4.260 | 0.469911 | 3.0 | 4.000 | 4.35 | 4.600 | 5.1 | ||
2 | ++ | strong | 50.0 | 5.552 | 0.551895 | 4.5 | 5.100 | 5.55 | 5.875 | 6.9 | |
petal width (cm) | 0 | -- | strong | 50.0 | 0.246 | 0.105386 | 0.1 | 0.200 | 0.20 | 0.300 | 0.6 |
1 | strong | 50.0 | 1.326 | 0.197753 | 1.0 | 1.200 | 1.30 | 1.500 | 1.8 | ||
2 | ++ | strong | 50.0 | 2.026 | 0.274650 | 1.4 | 1.800 | 2.00 | 2.300 | 2.5 | |
sepal length (cm) | 0 | -- | strong | 50.0 | 5.006 | 0.352490 | 4.3 | 4.800 | 5.00 | 5.200 | 5.8 |
1 | strong | 50.0 | 5.936 | 0.516171 | 4.9 | 5.600 | 5.90 | 6.300 | 7.0 | ||
2 | ++ | strong | 50.0 | 6.588 | 0.635880 | 4.9 | 6.225 | 6.50 | 6.900 | 7.9 | |
sepal width (cm) | 0 | ++ | strong | 50.0 | 3.428 | 0.379064 | 2.3 | 3.200 | 3.40 | 3.675 | 4.4 |
1 | -- | strong | 50.0 | 2.770 | 0.313798 | 2.0 | 2.525 | 2.80 | 3.000 | 3.4 | |
2 | strong | 50.0 | 2.974 | 0.322497 | 2.2 | 2.800 | 3.00 | 3.175 | 3.8 |
In the following output, we see that the big sepals are over-represented in the group 2 and under-represented in the group 0.
relationship_summary.summary_output.categorical_significance.drop(columns=["pvalue", "statistic", "message"])
influence | significance | count | proportion | |||
---|---|---|---|---|---|---|
Variable | Target | Value | ||||
sepal_type | 0 | big | -- | strong | NaN | NaN |
small | + | strong | 50.0 | 1.00 | ||
1 | big | strong | 24.0 | 0.48 | ||
small | strong | 26.0 | 0.52 | |||
2 | big | ++ | strong | 43.0 | 0.86 | |
small | - | strong | 7.0 | 0.14 |
Single variable relationship
You may also want to use directly the underlying functions to study the relationship between a single variable and the target variable.
Warning
Be careful as when you use these functions you should be aware of the type of both feature variable and target variable.
from cc_tk.relationship import (
significance_numeric_categorical, significance_categorical_categorical
)
# Create a dataframe
X, y = load_iris(return_X_y=True, as_frame=True)
# Artificially create a categorical variable
X["sepal_length_cat"] = (X["sepal length (cm)"] > 5.5).astype(str)
# Study the relationship specific features and y
significance_sepal_length_num = significance_numeric_categorical(X["sepal length (cm)"], y.astype(object))
significance_sepal_length_cat = significance_categorical_categorical(X["sepal_length_cat"], y.astype(object))
significance_sepal_length_num
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[8], line 1
----> 1 from cc_tk.relationship import (
2 significance_numeric_categorical, significance_categorical_categorical
3 )
5 # Create a dataframe
6 X, y = load_iris(return_X_y=True, as_frame=True)
ImportError: cannot import name 'significance_numeric_categorical' from 'cc_tk.relationship' (/home/docs/checkouts/readthedocs.org/user_builds/clementcome-toolkit/checkouts/latest/cc_tk/relationship/__init__.py)
Future work
I am planning to add more features to the cc_tk.relationship
submodule.
Already planned features are:
a scikit-learn transformer that will allow you to select the most significant features based on the relationship with the target variable
a parametrization of significance tests to allow the user to choose the most appropriate test for their data