How-to Guide: Relationships

The class cc_tk.relationship.RelationshipSummary can be used to quickly evaluate the relationships between features and a target.

The relationships are evaluated through statistical tests. For now, the tests are:

  • numeric feature, numeric target: pearson correlation test

  • numeric feature, categorical target: anova test if its hypotheses are verified, Kruskal-Wallis otherwise

  • idem for categorical feature with numeric target

  • categorical feature, categorical target: chi-2 test

The cc_tk.relationship submodule provides functions that allow you to study the relationship between variables. It is particularly useful for feature selection when you have a lot of variables and you want to understand which ones are the most statistically significant to discriminate the target variable. Target variable can be either numeric or categorical.

Overall Summary

from cc_tk.relationship import RelationshipSummary
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True, as_frame=True)
X = X.assign(
    sepal_type=(X["sepal length (cm)"] < 6).map({False: "big", True: "small"})
)

When using RelationshipSummary, you can build summary with its build_summary method and/or save it to an excel file with to_excel method.

relationship_summary = RelationshipSummary(X, y.astype(object))
# relationship_summary.to_excel("../../data/output/test_relationship.xlsx")
relationship_summary.build_summary();
/home/docs/checkouts/readthedocs.org/user_builds/clementcome-toolkit/checkouts/latest/cc_tk/relationship/distribution.py:100: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  numeric_features.groupby(target)

You can access the overall distributions for numeric and categorical variables with numeric_distribution and categorical_distribution attributes of the summary_output.

relationship_summary.summary_output.numeric_distribution
Variable count mean std min 25% 50% 75% max
0 sepal length (cm) 150.0 5.843333 0.828066 4.3 5.1 5.80 6.4 7.9
1 sepal width (cm) 150.0 3.057333 0.435866 2.0 2.8 3.00 3.3 4.4
2 petal length (cm) 150.0 3.758000 1.765298 1.0 1.6 4.35 5.1 6.9
3 petal width (cm) 150.0 1.199333 0.762238 0.1 0.3 1.30 1.8 2.5

You can acces the relationships summary with numeric_significance and categorical_significance attributes of the summary_output.

In the following output, we can interpret that the petal length values are significantly lower in the group 0 and significantly higher for the group 2, and this is confirmed by the distribution by group (min and max for example).

relationship_summary.summary_output.numeric_significance.drop(columns=["pvalue", "statistic", "message"])
influence significance count mean std min 25% 50% 75% max
Variable Target
petal length (cm) 0 -- strong 50.0 1.462 0.173664 1.0 1.400 1.50 1.575 1.9
1 strong 50.0 4.260 0.469911 3.0 4.000 4.35 4.600 5.1
2 ++ strong 50.0 5.552 0.551895 4.5 5.100 5.55 5.875 6.9
petal width (cm) 0 -- strong 50.0 0.246 0.105386 0.1 0.200 0.20 0.300 0.6
1 strong 50.0 1.326 0.197753 1.0 1.200 1.30 1.500 1.8
2 ++ strong 50.0 2.026 0.274650 1.4 1.800 2.00 2.300 2.5
sepal length (cm) 0 -- strong 50.0 5.006 0.352490 4.3 4.800 5.00 5.200 5.8
1 strong 50.0 5.936 0.516171 4.9 5.600 5.90 6.300 7.0
2 ++ strong 50.0 6.588 0.635880 4.9 6.225 6.50 6.900 7.9
sepal width (cm) 0 ++ strong 50.0 3.428 0.379064 2.3 3.200 3.40 3.675 4.4
1 -- strong 50.0 2.770 0.313798 2.0 2.525 2.80 3.000 3.4
2 strong 50.0 2.974 0.322497 2.2 2.800 3.00 3.175 3.8

In the following output, we see that the big sepals are over-represented in the group 2 and under-represented in the group 0.

relationship_summary.summary_output.categorical_significance.drop(columns=["pvalue", "statistic", "message"])
influence significance count proportion
Variable Target Value
sepal_type 0 big -- strong NaN NaN
small + strong 50.0 1.00
1 big strong 24.0 0.48
small strong 26.0 0.52
2 big ++ strong 43.0 0.86
small - strong 7.0 0.14

Single variable relationship

You may also want to use directly the underlying functions to study the relationship between a single variable and the target variable.

Warning

Be careful as when you use these functions you should be aware of the type of both feature variable and target variable.

from cc_tk.relationship import (
    significance_numeric_categorical, significance_categorical_categorical
)

# Create a dataframe
X, y = load_iris(return_X_y=True, as_frame=True)

# Artificially create a categorical variable
X["sepal_length_cat"] = (X["sepal length (cm)"] > 5.5).astype(str)

# Study the relationship specific features and y
significance_sepal_length_num = significance_numeric_categorical(X["sepal length (cm)"], y.astype(object))
significance_sepal_length_cat = significance_categorical_categorical(X["sepal_length_cat"], y.astype(object))

significance_sepal_length_num
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[8], line 1
----> 1 from cc_tk.relationship import (
      2     significance_numeric_categorical, significance_categorical_categorical
      3 )
      5 # Create a dataframe
      6 X, y = load_iris(return_X_y=True, as_frame=True)

ImportError: cannot import name 'significance_numeric_categorical' from 'cc_tk.relationship' (/home/docs/checkouts/readthedocs.org/user_builds/clementcome-toolkit/checkouts/latest/cc_tk/relationship/__init__.py)

Future work

I am planning to add more features to the cc_tk.relationship submodule. Already planned features are:

  • a scikit-learn transformer that will allow you to select the most significant features based on the relationship with the target variable

  • a parametrization of significance tests to allow the user to choose the most appropriate test for their data