There has been a considerable effort aimed at regulation of AI systems. In the European Union, DG Connect has spearheaded the AI Act, but its implementation is yet to be developed within delegated acts, implementation acts, standards, guidelines, etc. In the US, there is the Executive Order and the management memo, which are both very high-level, accompanied by a variety of documents issued by the NIST. In both cases, there are non-trivial challenges in facing any implementation.
Broadly speaking, the challenges stem from our lack of understanding of artificial intelligence, in general, and the subfields of artificial intelligence are of recent interest, in particular, such as deep neural networks in general-purpose AI systems. Consider, for example, the issues of accuracy and robustness of such system, which are featured in Articles (74)-(75) of the AI Act. Even measuring robustness of a general-purpose AI system is an open research question. The development of guarantees of accuracy and robustness for training such systems is a wide open research field. Indeed, a popular witticism suggests that artificial intelligence are precisely those algorithms we do not understand yet.
Consequently, there is a gap between real-world AI systems and their legal treatment, including in the AI Act. Important questions, such as what attributes are considered sensitive, how intersectional (subgroup) fairness is handled, or how to handle fairness/non-discrimination, in general, is left to the providers and deployers of AI systems. In particular, it is largely recognized by the literature at the intersection of law and algorithms that intersectional discrimination is rather poorly handled in cases/precedents of the EU Court (Algorithmic discrimination in Europe, Bias Preservation in Machine Learning: The Legality of Fairness Metrics Under EU Non-Discrimination Law, Multi-dimensional Discrimination in Law and Machine Learning – A Comparative Overview). Further, there exists large ambiguity in EU Court cases when examining proxy discrimination (Algorithmic discrimination in Europe, Bias Preservation in Machine Learning: The Legality of Fairness Metrics Under EU Non-Discrimination Law). Another challenge is how to appropriately map the social context of specific AI application use cases into specific instantiations of specific fairness definitions (Bias Preservation in Machine Learning: The Legality of Fairness Metrics Under EU Non-Discrimination Law, Legal perspective on possible fairness measures – A legal discussion using the example of hiring decisions). This challenge is directly related to the dilemma/trade-off between equal opportunity and equal outcome, or, in other words, between selecting either bias preserving or bias transforming definitions (Bias Preservation in Machine Learning: The Legality of Fairness Metrics Under EU Non-Discrimination Law) for bias auditing and correction.
Here, we survey several challenges in the regulation of AI, which may be less than obvious, at first. We also suggest some provisional answers developed within the project, Human-Compatible AI with Guarantees, a project funded by DG Connect within Human-01-01. The opinions expressed are those of the author(s) only and should not be considered as representative of the European Commission’s official position.
How to deal with the diversity of applications of AI?
A key feature of AI is the diversity of its applications, compared to the limited number of problems considered within traditionally regulated areas, such as consumer finance. Despite the diversity, most of the research in the regulation of AI (and AI fairness) had focussed on classification problems until late 2023. Now, there is also the priority to consider fairness in generative AI.
Indeed, it is realistic to assume that classification and generative AI will be only a small proportion of the high-risk AI systems available in Europe in the near future. Notably:
● fairness in rankings: many HR applications are closely related to ranking, e.g., short-listing candidates for a job, or presenting a selection of the most suitable jobs to a candidate. Research on fairness in rankings (e.g. https://arxiv.org/abs/2403.19419) is underdeveloped.
● fairness in recommendations: many social-media recommenders are closely related to ranking. Again, research on fairness in multi-stakeholder recommender systems and related approaches (e.g.,https://doi.org/10.1007/s00778-021-00697-y) is underdeveloped.
● spatial fairness: bias in financial services has long been associated with discrimination based on spatial (i.e. location) considerations (e.g. red-lining in mortgage markets in the USA). Testing bias in such spatial problems is again underdeveloped (cf. https://arxiv.org/abs/2302.12333).
● fairness in forecasting: bias in forecasting problems has sometimes been incorrectly quantified using measures of classification bias. For example, in the case of the famous COMPAS benchmark, where one may wish to estimate the time to re-offending, many researchers consider the binary property of re-offending within the next two years, thereby recasting the problem as a one of binary classification. However, we have shown (JAIR paper) that when one minimizes appropriate measures of bias in forecasting, one can obtain predictive performance (https://www.jair.org/index.php/jair/article/view/14050) that improves on those approaches.
● fairness in the sharing economy: in a number of papers, we have shown that the sharing economy with its repeated interactions with an AI system by multiple populations (e.g. drivers and ride-seeking customers), requires different definitions of fairness. See “How to reason about repeated uses of an AI system?” below.
As Jensen Huang has recently pointed out (GTC 2024), there is a great number of applications of AI waiting to be explored (“Optimization AI”), from both the “how to do this” and “how to measure the fairness” points of view.
How to set up the compliance assessment?
In computational finance, practitioners have long relied on model validation as a key component of model risk management, considering the underlying assumptions and suitability for the intended use case. This involves, among other things, checks of conceptual soundness and outcome analysis. For the relatively simple and (by regulatory fiat) explainable models used in risk assessment, it may be feasible to check conceptual soundness in terms of data quality and suitability, variable selection, model interpretability, as well as benchmarking against alternative methodologies. Likewise, Outcome Analysis, including the robustness to data shift, noise, or corrupted data would be desirable, but seems essentially impossible.
Currently proposed standards, such as IEEE 7000™-2021 – Model Process for Addressing Ethical Concerns During System Design provide high-level guidance, rather than assurance.
How to pick the number of protected attributes?
Within many approaches to fairness, such as those addressed in the California Consumer Privacy Act (CCDA), there are dozens of protected attributes (PAs) listed, and it may seem desirable to add ever more PAs to the list. However, there are significant challenges associated with larger numbers of PAs. Three related challenges stand out:
● sample complexity: the more PAs there are, the more samples one needs to collect to estimate the bias within a given error with a given probability (which is known as probabilistically approximately correct learning). For every – even binary – PA that one includes in the model, one splits the available training data into ever smaller groups for learning the respective PA-conditional models. This dilution problem is a fundamental consequence of the curse of dimensionality, and engenders under-representation biases in learning and fairness-repair algorithms. The right trade-off between model expressivity (by increasing the number of PAs) and sufficient learning (by capping the number of PAs) needs to be addressed.
● nonstationarity: the problem of dilution is particularly acute when training sets are nonstationary. Effectively, the pseudo-stationary window lengths available for tracking changes in the model state (here represented by the PAs) limit the amount of dynamic training that can occur. In such cases – which arise whenever dynamical learning is taking place with longitudinal and time-series data, for instance – it is again imperative to choose the size of the state set (here, the number of PAs) as a function of these pseudo-stationary lengths.
● pairwise distances: key results in high-dimensional probability (https://bib.dbvis.de/uploadedFiles/155.pdf) show that as the dimension grows, a pair of points sampled uniformly at random from a fixed finite region, such as a unit ball, will have their Euclidean distances ever more concentrated around the mean.
These concerns can be alleviated if it is made clear that some PAs should be considered only marginally. So, for example, gender-based groups and religion-based groups might be modelled and repaired separately, but the jointly-conditioned gender-religion sub-groups – whose number is increasing exponentially – might not). Also, more attention to non-standard distances is required, in relation to the dimension of the data space.
Of course, data attributes can be protected (PAs, such as those mentioned earlier in this document) or not (such as a socio-economic label), with the latter being openly measured, recorded and processed into decisions by AI engines, while the former may not (this is discussed further in Section 6). If unfairness is not defined conditionally with respect to the UAs, it can remain undetected and unrepaired in a context where such engines legally discriminate on the basis of those attributes that are not protected. In our work [https://arxiv.org/abs/2403.13864], we have used this notion of conditional independence to distinguish between what we call AI unfairness and societal unfairness, and we have designed conditional repair schemes based on detection of the former. A deeper appreciation of the consequences of conditional (un)fairness is now required, not least in terms of the dilution problems discussed earlier in this section.
A related question to the number of protected attributes is the encoding (such as one-hot encoding utilizing dummy variables to represent a categorical variable) of the chosen protected attributes. Indeed, the more attributes one considers, the more “gaps” (such as zeros zeros in the one-hot encoding or “null values” due to the users opting for “I prefer not to say”; etc.) one may have. While it is known that cosine distances may alleviate some of the issues in dealing with sparsity, when there are many gaps, even those will be based only on a small subset of the most diligent users (i.e., those not opting for “I prefer not to say” even once).
How to choose the measure of data quality and how to assess the uncertainty in fairness proxies?
In estimating the quality of training data and the bias in the AI system, one needs to consider the distance or divergence of the distribution given by the training data to the distribution in the general population, or some applicable population. There are many options: dozens of divergences, total variation distance, Wasserstein-1 and Wasserstein-2 distances etc. In order to avoid “regulatory arbitrage”, it would be desirable to focus on one measure, but its choice is non-trivial due to the fact that:
● measures have widely varying sample complexity, i.e., the number of samples required to estimate the bias within a given error with a given probability (which is also known as probabilistically approximately correct learning).
● some of the measures are inapproximable (e.g., Wasserstein-1, cf. Theorem 5 in https://arxiv.org/abs/2301.06148), while for others, their approximability status is unclear.
● the consistent distance or divergence is often dictated by the model or the optimization task. For instance, if the data are known to be normally distributed, Euclidean norm is induced in the data space. If data are to be repaired via an unregularized optimal transport (OT) plan or map, Wasserstein distance is induced, but it is Kullback-Leibler divergence that is induced if the plan is entropy-regularized.
If a regulation requires the computation of a measure that is known not to be computable within an epsilon error, then this is potentially a major concern.
It appears that variants of the L0-norm require the least amount of data for computation to a target accuracy. Previous research of the consortium members (https://arxiv.org/abs/1802.03649) has shown that the L0-norm allows for efficient sub-sampling with probably approximately correct (i.e. PAC) bounds. Plausibly, such results could be extended here to procedures for measuring data quality.
A related but separate issue is in relation to the fairness proxies – such as disparate impact, statistical parity, etc – that are widely adopted as summaries of bias in training data. In reality, these act merely as certainty equivalents, and do not capture the uncertainty in the data models themselves. Progress is therefore needed in eliciting the distribution of such proxies. Such distributions will provide users with uncertainty quantification for these proxies. Furthermore, they can contribute to ensuring that fairness repair algorithms are robust to model uncertainties.
Another complication in the estimation of data quality and bias in AI systems is that in order to obtain absolute values, one strictly needs the protected attributes. If the regulation suggests an absolute threshold, such as 42, one would need to set up the regulatory sandbox, collect the protected attributes, and evaluate the data quality and bias in AI systems. While we have developed methods (https://arxiv.org/abs/2310.11407, https://arxiv.org/abs/2403.13864) that can improve the data quality when having access to the protected attributes, we yet have to develop methods that could estimate the bias without recording the protected attributes. The associated problems in robust statistics are non-trivial.
How to improve data quality without recording the protected attributes (PAs)?
An obvious complication in the training of many AI systems is the quality of training data. In many cases, the training data that have been collected are biased (e.g., towards the more active users; e.g. many businesses see the number of transactions grow with household wealth; many health insurers see more cases from longer-living, often female, patients; etc.). As already noted in Section 3, the training data often do not include protected attributes (PAs), considering that – in many countries – it has been illegal to measure and record PAs such as race, religion and sexual orientation. While the AI Act’s provisions for regulatory sandboxes make it possible to collect such data in the future, it seems obvious that businesses would like to keep the recording of protected attributes to a minimum. For ranking (https://arxiv.org/abs/2403.19419), classification and regression (https://arxiv.org/abs/2310.11407), we have developed methods that can repair data for fairness, without having access to the PAs. Further refinements of these methods are warranted, and they need to be applied in other fairness-aware scenarios.
Is there “robustness and accuracy” beyond a best-effort basis?
Many recent proposals as to the regulation of AI suggest that developers and deployers of AI systems should assure their robustness. Inherently, however, many AI techniques are not robust.
Specifically, for deep neural networks, which underlie much of recent work on AI including all of generative AI, there are:
● no guarantees on the training of the system. It may well be that there are weights of the neural network that are arbitrarily better than those found by the currently used methods.
● no widely agreed-upon measures of robustness. While some researchers consider, e.g., Lipschitz modulus of continuity, it is often not directly applicable (e.g., to samples that are not from a normed space, such as graphs or tree-structured data), nor computable for large-enough networks.
● no guarantees on the robustness, other than empirical tests (https://github.com/Trusted-AI/adversarial-robustness-toolbox) against a collection of known attacks.
Measures of robustness that would be truly useful, as in applicable to a wide variety of data and and easy to compute or bound from above, are missing, but being developed by the consortium.
How to test individual interactions with AI systems?
Even when one resolves the dilemma of selecting the most appropriate fairness measure in a use case, challenges remain. Indeed, we believe that counterfactual fairness is a promising definition, as also proposed by recent work of other teams (Legal perspective on possible fairness measures – A legal discussion using the example of hiring decisions), since it allows to properly incorporate social context, summary statistics and domain knowledge and effectively separate biased and non-biased contribution of the examined attributes to the model’s decision, thus effectively handling proxy discrimination. Notice, however, the implementation of such a fairness measure.
AI systems are increasingly complex. In traditional approaches to testing, the higher the dimension of inputs and the more parameters trained on these inputs, the more test cases are required in order to achieve a complete coverage. Indeed, without further assumptions, the number of tests grows exponentially with the number of inputs. At the same time, the extensive evaluation of the effects of AI systems may be unacceptable in some high-risk contexts. This is complicated further by the fact that the evaluation of the decision may often be undertaken only with some delay.
Consider, for example, the use cases in social care. In the infamous case of the toeslagenaffaire in the Netherlands, poor decision making of human social programme administrators aided by traditional statistics led to the removal of children from the care of their parents on the grounds of benefit-fraud accusations. Imagine the decision making were to be made by an AI system. Clearly, the system would have to be tested, but it seems hard to imagine that outcomes such as unnecessary removal from the care would be allowed in testing of an AI system, in order to establish its false positive rate. Notice that in this particular case, AI systems would also not be documented due to the exception for fraud detection from the requirement to document AI systems, and the undesirable nature of some of the decisions would be available immediately, but made clear only in court cases spanning multiple years. At the same time, while it may seem tempting to prohibit the use of AI in such cases, this may be a step too far.
Within the consortium, researchers including Shie Mannor at Technion have developed methods for studying fairness within sequential decision making (so-called bandit problems) with known covariance structures in some of the observations (cf. https://arxiv.org/abs/1106.2436). Sequential decision making has been originally developed half a century ago with clinical trials of new compounds in mind. In such clinical trials, some variants of the compound may prove deadly, with the need to stop the trial early. It seems that at least some of the applications of AI may come with similarly high risks.
Notice also that it seems realistic to assume that classification and generative AI will be only a small proportion of the high-risk AI systems available in Europe, as highlighted above. It will hence be necessary to consider other (e.g. spatial) aspects of fairness, where the definitions of bias are yet to be developed.
How to reason about repeated uses of an AI system?
AI systems are also increasingly personalized. Owing to the personalization of – for example – recommenders within Netflix or Youtube, the behaviour of such systems adapts to the behaviour of the user. This, however, introduces an extra complication into the testing, as one wants to test with respect to the trajectory of interactions between the AI system and the user. This, once again, introduces a combinatorial explosion even for trajectories of interactions of finite length. It renders the testing even more complicated, particularly since one wants to avoid adopting a finite limit on the duration of the user’s interactions with the AI system.
In this context, we have developed a framework (https://arxiv.org/abs/2209.01410) that makes it possible to establish certain basic predictability and fairness properties. In particular, we wish to establish that the the user-system behaviour is predictable. Considering a closed-loop model of the AI system and its interactions with the users (i.e. personalization dynamics), together with a probabilistic model of the users’ responses, one can define predictability as the existence of a unique invariant measure. In this way, many notions of fairness can be redefined via long-run properties whose meaning is well defined thanks to the existence of the unique invariant measure.
What does “red-teaming” mean within the context of AI?
Across both the Executive Order and the AI Act, the term AI red-teaming is used to refer to the finding of flaws and vulnerabilities in an AI system. The term is inspired by the notion of “red teams” in cybersecurity, where adversarial methods are used to identify vulnerabilities. Having said that, in AI, the notion of what is harmful or discriminatory is – as yet – unclear (see questions above), as are definitions of other undesirable system behaviours. Red-teaming is therefore neither well scoped nor well structured. Indeed, some researchers argue (https://arxiv.org/abs/2401.15897) that AI red-teaming currently amounts merely to `security theatre’.
In our view, post hoc, black-box testing of AI systems is a major challenge, and needs to start with a regulatory sandbox. It seems impossible to provide any guarantees for frugal methods of post hoc, black-box testing of AI systems without involving a regulatory sandbox (see questions above). Red-teaming is, in some ways, an attempt to institute a best-effort form of certification without any notion of criminal liability for the certification. If there is a single expert who pronounces a system fit for a particular purpose and can be held individually responsible for the action or any omissions, this may be preferable to having a large and expensive team that is not subject to criminal liability. We thus argue for criminal liability of all individuals involved in red-teaming.
How to communicate and explain the output of AI systems
It is clear that the output of an AI system has to be properly communicated and explained to the user. The importance of having powerful and intuitive tools for explaining the outcomes of AI and machine learning systems, especially when many such systems can be so complicated that they essentially operate as “black boxes,” has been the subject of much recent discussion. For example, we know that there are a variety of biases in commonly used training data which may result in the error of an AI system being spread very unevenly across individuals and subgroups. Explainability methods, and in particular counterfactual explanations which have been proven to be both simple enough to easily understand and sufficiently powerful to express sophisticated and nuanced outcome explanations, can be used to understand the fairness implications of different AI and machine learning techniques. Simply stated, a counterfactual explanation to a given AI classification outcome encodes how much a given object needs to change for its label to flip. Thus, it provides a mechanism to compare between objects and evaluate the fairness of the AI system outcomes.
Within the field of “Explainable AI” there are many papers published on counterfactual explanations, yet non-trivial challenges abound. For example, the fundamental problem of quantifying the trade-off between algorithmic statistical performance and the different notions of bias remains. Additionally, recent work has focused on “local counterfactual explanations” which attempt to explain individual AI system outcomes. To further the relevancy of the approach, future work should focus on more general, global explanations. Initial approaches include techniques that provide explanations while considering the fairness and the numerous subgroups obtained by multiple protected attributes (cf. https://arxiv.org/abs/2306.14978), and techniques that explicitly consider the plausibility of the explanations (https://arxiv.org/abs/2401.14086), and algorithms that derive general, global counterfactual explanations (http://arxiv.org/abs/2404.13476) while considering several optimisation factors including feasibility, simplicity, and compactness of the explanation.
Efficient algorithms are required that are able to explain unfairness on arbitrary subgroups to handle “intersectional bias”, being robust to phenomena such as gerrymandering or data sparseness. While our work (listed above) moves towards this direction, these are only first tentative steps.