How-to Guide: Relationships

The class cc_tk.relationship.RelationshipSummary can be used to quickly evaluate the relationships between features and a target.

The relationships are evaluated through statistical tests. For now, the tests are:

numeric feature, numeric target: pearson correlation test
numeric feature, categorical target: anova test if its hypotheses are verified, Kruskal-Wallis otherwise
idem for categorical feature with numeric target
categorical feature, categorical target: chi-2 test

The cc_tk.relationship submodule provides functions that allow you to study the relationship between variables. It is particularly useful for feature selection when you have a lot of variables and you want to understand which ones are the most statistically significant to discriminate the target variable. Target variable can be either numeric or categorical.

Overall Summary

from cc_tk.relationship import RelationshipSummary
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True, as_frame=True)

X = X.assign(
    sepal_type=(X["sepal length (cm)"] < 6).map({False: "big", True: "small"})
)

When using RelationshipSummary, you can build summary with its build_summary method and/or save it to an excel file with to_excel method.

relationship_summary = RelationshipSummary(X, y.astype(object))
# relationship_summary.to_excel("../../data/output/test_relationship.xlsx")
relationship_summary.build_summary();

/home/docs/checkouts/readthedocs.org/user_builds/clementcome-toolkit/checkouts/latest/cc_tk/relationship/distribution.py:100: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  numeric_features.groupby(target)

You can access the overall distributions for numeric and categorical variables with numeric_distribution and categorical_distribution attributes of the summary_output.

relationship_summary.summary_output.numeric_distribution

	Variable	count	mean	std	min	25%	50%	75%	max
0	sepal length (cm)	150.0	5.843333	0.828066	4.3	5.1	5.80	6.4	7.9
1	sepal width (cm)	150.0	3.057333	0.435866	2.0	2.8	3.00	3.3	4.4
2	petal length (cm)	150.0	3.758000	1.765298	1.0	1.6	4.35	5.1	6.9
3	petal width (cm)	150.0	1.199333	0.762238	0.1	0.3	1.30	1.8	2.5

You can acces the relationships summary with numeric_significance and categorical_significance attributes of the summary_output.

In the following output, we can interpret that the petal length values are significantly lower in the group 0 and significantly higher for the group 2, and this is confirmed by the distribution by group (min and max for example).

relationship_summary.summary_output.numeric_significance.drop(columns=["pvalue", "statistic", "message"])

		influence	significance	count	mean	std	min	25%	50%	75%	max
Variable	Target
petal length (cm)	0	--	strong	50.0	1.462	0.173664	1.0	1.400	1.50	1.575	1.9
	1		strong	50.0	4.260	0.469911	3.0	4.000	4.35	4.600	5.1
	2	++	strong	50.0	5.552	0.551895	4.5	5.100	5.55	5.875	6.9
petal width (cm)	0	--	strong	50.0	0.246	0.105386	0.1	0.200	0.20	0.300	0.6
	1		strong	50.0	1.326	0.197753	1.0	1.200	1.30	1.500	1.8
	2	++	strong	50.0	2.026	0.274650	1.4	1.800	2.00	2.300	2.5
sepal length (cm)	0	--	strong	50.0	5.006	0.352490	4.3	4.800	5.00	5.200	5.8
	1		strong	50.0	5.936	0.516171	4.9	5.600	5.90	6.300	7.0
	2	++	strong	50.0	6.588	0.635880	4.9	6.225	6.50	6.900	7.9
sepal width (cm)	0	++	strong	50.0	3.428	0.379064	2.3	3.200	3.40	3.675	4.4
	1	--	strong	50.0	2.770	0.313798	2.0	2.525	2.80	3.000	3.4
	2		strong	50.0	2.974	0.322497	2.2	2.800	3.00	3.175	3.8

In the following output, we see that the big sepals are over-represented in the group 2 and under-represented in the group 0.

relationship_summary.summary_output.categorical_significance.drop(columns=["pvalue", "statistic", "message"])

			influence	significance	count	proportion
Variable	Target	Value
sepal_type	0	big	--	strong	NaN	NaN
	0	small	+	strong	50.0	1.00
	1	big		strong	24.0	0.48
	1	small		strong	26.0	0.52
	2	big	++	strong	43.0	0.86
	2	small	-	strong	7.0	0.14

Single variable relationship

You may also want to use directly the underlying functions to study the relationship between a single variable and the target variable.

Warning

Be careful as when you use these functions you should be aware of the type of both feature variable and target variable.

from cc_tk.relationship import (
    significance_numeric_categorical, significance_categorical_categorical
)

# Create a dataframe
X, y = load_iris(return_X_y=True, as_frame=True)

# Artificially create a categorical variable
X["sepal_length_cat"] = (X["sepal length (cm)"] > 5.5).astype(str)

# Study the relationship specific features and y
significance_sepal_length_num = significance_numeric_categorical(X["sepal length (cm)"], y.astype(object))
significance_sepal_length_cat = significance_categorical_categorical(X["sepal_length_cat"], y.astype(object))

significance_sepal_length_num

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[8], line 1
----> 1 from cc_tk.relationship import (
      2     significance_numeric_categorical, significance_categorical_categorical
      3 )
      5 # Create a dataframe
      6 X, y = load_iris(return_X_y=True, as_frame=True)

ImportError: cannot import name 'significance_numeric_categorical' from 'cc_tk.relationship' (/home/docs/checkouts/readthedocs.org/user_builds/clementcome-toolkit/checkouts/latest/cc_tk/relationship/__init__.py)

Future work

I am planning to add more features to the cc_tk.relationship submodule. Already planned features are:

a scikit-learn transformer that will allow you to select the most significant features based on the relationship with the target variable
a parametrization of significance tests to allow the user to choose the most appropriate test for their data