THEORETICAL GUARANTEES FOR GENERALIZATION

This topic is the most challenging one and a lot of effort is ongoing all over the word.
A part of the effort will be dedicated to prospective research and the other part will focus on state of the art and benchmarking new solutions arising from the large scientific community working actively on the subject.

Faculty members : Sébastien Gerchinovitz, François Malgouyres, Jean-Michel Loubes, Mathieu Serrurier, Edouard Pauwels

Datascientists : TBD

Students : 5 PhD

SCOPE

The neural networks have demonstrated an impressive capability of generalization despite a high number of parameters. This capacity for generalization is still poorly understood from the theoretical point of view, so that it is not possible to ensure in any way this generalization.
The goal of this challenge will be to define new methods, conditions and tools to have more guaranties on the behavior of neural networks.

These new methods will allow under given assumptions to have better confidence in the robustness of the deep learning algorithms while proposing several indicators. These indicators could be used in order to persuade the competent authorities during the certification process.

The typical neural networks used in applications have a number of parameters that is larger or comparable to the number of learning samples. This property leads indeed to a loss function whose landscape is sufficiently simple so-that descent algorithms provide a low cost solution. This property also guarantees that the expressivity of the network is sufficient to fit the data. Classical statistical learning theory tells us that such a configuration should lead to overfitting and a large generalization error but this does not occur in practice.

Understanding the details of the phenomena behind the success of the optimization, supervised and unsupervised (GAN), the expressivity of the networks, the stability of the features of the network and the size of the generalization error is key to design network topologies and regularization strategies leading to robust neural networks.

These guaranties will be very helpful to certify the neural networks embedded in critical systems. Both neural networks as well as machine learning methods could be investigated in this challenge.

Examples of industrial use cases

All applications requiring trust or even certification can be concerned. Benchmark will be defined along the project, based on the available and sharable topics.

State of the art and limitations

Approximation properties are currently not entirely well understood (though a density result was proved in the 80’s for one-hidden layer neural networks, the power of the depth and the role of the network architecture are still preliminary). Many unsupervised machine learning techniques like Generative Adversarial Networks (GANs) [10], VAE [11], Flow Base models [12] has been developed to find meaningful features from highdimensional data. This search of features with meaning is also expected with transfer learning technics on supervised tasks. However, none of these technics can assure guaranties on their generalization, features expressivity capabilities, convergence (for GAN) or even robustness. Though GANs are explicitly defined as a game between two networks, most algorithms that have been studied so far (mostly empirically) do not use this fact. It is actually frequent to see two stochastic gradient descent algorithms used independently for the generator and the discriminator, with no proved guarantees on why this should lead to a good joint solution of the min-max problem. There have been recent attempts to address the GAN training task as a min-max problem per se, through game-theoretic/online-learning approaches (e.g., [17] [18] [19]). We would like to further explore these directions, with a careful attention on computational issues. Optimization guarantees for deep networks are also in the making (e.g., open problems about the landscape of the objective function). Finally, and quite importantly, generalization bounds, especially for regularization methods on deep networks, are still scarce. An evidence of cracks in the generalization of the neural networks is that they are known to be vulnerable to small perturbation of the input data [6]. This is an important issue for critical applications. First attempt to circumvent this problem are based on the adaptation of robust optimization techniques to deep network architectures. These approaches are mostly empirical and it has been demonstrated that many such adversarial training strategies can be circumvented [20]. An alternative to empirical approaches is to certify robustness against adversarial attacks. We will be mostly interested in convex relaxation based certification approaches such as LP relaxations [26] and SDP relaxations [24]. A tutorial on state of the art in this field is given in [21].

Scientific approach to solve the challenge

To tackle these problems, four main strategies will be studied. A first direction of the research program (Risk Analysis) will be to prove risk bounds on the error (in expectation or with high probability) of such algorithms, for various deep learning tasks such as regression or classification. Proving that the generalization error is small, as well as the optimization error and the empirical risk, would enable us to show that the risk of the learning algorithm is small. This would be a strong robustness guarantee. An intern is currently working on the landscape of the objective function in deep learning. He will most probably start a PhD in September. A second direction of the research program (Features Stability) will be to design a neural network with stable features which is a required condition to achieve interpretable features [13]. In a first step, it will be demonstrated that specific neural networks architectures based on dropout and ReLU have this stability capability. The stability condition should lead to conditions on the number of samples and the Dropout parameter to ensure the stability. The second step will be to demonstrate on some hypothesis that the stability leads to the generalization. The third study, from a numerical and theoretical point of view, how stability can be enforced by enforcing parsimony in the intermediate layers. We will adopt for that: – an empirical point of view, by implementing networks in which the intermediate layers must be sparse and by testing these networks on concrete problems; – a theoretical point of view by adapting co-sparsity conditions (see, for example, in [16]) in the case of deep networks. As mentioned in the Villani report, game theory might carry the seeds of the next AI revolution. A third direction of the research program (Game theory and artificial intelligence) is to investigate the possible applications of game theory concepts and methods to AI in particular new game-theoretic algorithms for GAN training. We would like to investigate how some algorithms that have been originally designed for games could prove useful for GAN training. These algorithms may come from non-cooperative game theory (beyond fictitious play) or from adversarial online learning (no-regret algorithms), and will need to be largely adapted to the GAN context. Convergence bounds may be combined with approximation and generalization guarantees for neural networks. In addition, payoffs in GANs may be modified, e.g., they could be defined through new probability metrics designed for partially observable Markov decision processes and dynamic games with incomplete information. More generally, it is worthwhile to try to go beyond the 2-player zero-sum game setup and design new games to learn and mimic distributions with high-dimensional data, with improved empirical and theoretical properties. Multi-player games could prove useful and might become an important topic. The fourth direction of the research program (Certifying robustness of deep networks) focuses on the robustness of neural networks thanks to the Lasserre’s Hierarchy which can be used to solve global polynomial optimization problems over semi-algebraic sets [22]. It has many applications among which providing numerical certificates for formal proofs and certification [23]. In a first step we would like to investigate how this approach could be used to extend the approaches of [26] [24] to provide a hierarchy of adversarial robustness bounds for deep networks. In a second step we will investigate how the specific structure of neural network robustness certification problems could be used to devise efficient numerical algorithms to scale the approach up to real world networks. A second axe (Christoffel-Darboux kernels with applications in deep learning explainability) is to reformulate the initial problem as an optimization problem over probability measures. Recent research [27], investigated the ability of Christoffel-Darboux kernels to capture information about the support of an unknown probability measure. A distinguishing feature of this approach is to allow one to infer support characteristics, based on the knowledge of finitely many moments of the underlying measure.

Expected scientific outcomes

  1. Bounds of the generalization error with the weakest possible assumptions
  2. Architecture and dataset conditions for stable neural networks design
  3. A new tractable algorithms with proofs of convergence / finite-time error bounds for GAN

Data required to implement the challenge

To be defined along the project.

Success criteria for the challenge

Challenge will be successful if we have developped:

  1. A methodology and conditions (architecture and dataset) to design stable neural networks with
    theoretical bound
  2. Benchmarks demonstrating the performance of these algorithms on various datasets
  3. A more stable training technics for GAN
  4. A tool providing indicators to evaluate robustness within a certification perspective

References

[1] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations, Cambridge University Press, 2009

[2] D. Yarotsky. Error bounds for approximations with deep ReLU networks, Neural Networks, vol. 94, pp. 103-114, 2017

[3] D. Yarotsky. Optimal approximation of continuous functions by very deep ReLU networks, Proceedings of COLT’18, pp. 639-649, 2018

[4] R. Eldan and O. Shamir. The Power of Depth for Feedforward Neural Networks, Proceedings of COLT’16, pp. 907-940, 2016

[5] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks, arXiv:1703.02930, 2017

[6] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems (NIPS), 2016

[7] P. Bartlett, D. J. Foster, M. Telgarsky. Spectrally-normalized margin bounds for neural networks, Advances in Neural Information Processing Systems 30 (Proceedings of NIPS 2017), 2017

[8] Noah Golowich, Alexander Rakhlin, Ohad Shamir. Size-Independent Sample Complexity of Neural Networks, Proceedings of COLT’18, pp. 297-299, 2018

[9] J. Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function, arXiv:1708.06633, 2017

[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative adversarial nets – 2014 npis

[11] Diederik P Kingma, Max Welling – Auto-Encoding Variational Bayes – 2014

[12] Diederik P. Kingma, Prafulla Dhariwal Glow: Generative Flow with Invertible 1×1 Convolutions, 10 Jul 2018. arXiv:1807.03039

[13] François Malgouyres, Joseph Landsberg – Multilinear compressive sensing and an application to convolutional linear networks – https://www.math.univtoulousefr/~fmalgouy/tmp/identifiable_long.pdf

[14] Peter Buehlmann and Sara van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications, Springer, 2011

[15] Florentina Bunea. Honest variable selection in linear and logistic regression models via L1 and L1+L2 penalization, Electron. J. Statist, vol. 2, pp. 1153-1194, 2008

[16] Maryia Kabanava, Holger Rauhut. Cosparsity in compressed sensing, Compressed Sensing and Its Applications, pp. 315–339, 2015

[17] H. Ge, Y. Xia, X. Chen, R. Berry and Y. Wu. Fictitious GAN: Training GANs with Historical Models Computer Vision, In Proceedings of ECCV’18, 122-137, 2018

[18] P. Grnarova, K. Y. Levy, A. Lucchi, T. Hofmann and A. Krause. An Online Learning Approach to Generative Adversarial Networks, In Proceedings of International Conference on Learning Representations (ICLR’18), 2018

[19] Y.-P. Hsieh, C. Liu and V. Cevher. Finding Mixed Nash Equilibria of Generative Adversarial Networks, Proceedings of International Conference on Learning Representations (ICLR’19), 2019

[20] A. Athalye, N. Carlini and D. Wagner (2018). Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. ICML

[21] Z Kolter, Madry (2018). Adversarial Robustness: Theory and Practice. NIPS tutorial, adversarial-ml-tutorial.org

[22] J.B. Lasserre (2001). Global optimization with polynomials and the problem of moments. SIAM Journal on optimization, 11(3), 796-817

[23] V. magron, X. Allamigeon, S.Gaubert and B.Werner (2015). Formal Proofs for Nonlinear Optimization. Journal of Formalized Reasoning Vol, 8(1), 1-24

[24] A. Raghunathan, J. Steinhardt and P. S. Liang (2018). Semide_nite relaxations for certifying robustness to adversarial examples. NEURIPS

[6] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations

[26] E. Wong and J. Z. Kolter (2018). Provable defenses against adversarial examples via the convex outer adversarial polytope. ICML

[27] E. Pauwels and J.-B. Lasserre (2019). The empirical Christo_el function with appication in data analysis. To appear in Advances in Computational Mathematics