{"title": "Graph-based Discriminators: Sample Complexity and Expressiveness", "book": "Advances in Neural Information Processing Systems", "page_first": 6699, "page_last": 6708, "abstract": "A basic question in learning theory is to identify if two\ndistributions are identical when we have access only to examples sampled from the distributions.\nThis basic task is considered, for example, in the context of\nGenerative Adversarial Networks (GANs), where a discriminator is trained to distinguish between a real-life distribution and a synthetic distribution.\nClassically, we use a hypothesis class $H$ and claim that the two\ndistributions are distinct if for some $h\\in H$ the expected value\non the two distributions is (significantly) different.\n\nOur starting point is the following fundamental problem: \"is having\nthe hypothesis dependent on more than a single random example\nbeneficial\". To address this challenge we define $k$-ary based\ndiscriminators, which have a family of Boolean $k$-ary functions\n$\\G$. Each function $g\\in \\G$ naturally defines a hyper-graph,\nindicating whether a given hyper-edge exists. A function $g\\in \\G$\ndistinguishes between two distributions, if the expected value of\n$g$, on a $k$-tuple of i.i.d examples, on the two distributions is\n(significantly) different.\n\nWe study the expressiveness of families of $k$-ary functions,\ncompared to the classical hypothesis class $H$, which is $k=1$. We\nshow a separation in expressiveness of $k+1$-ary versus $k$-ary\nfunctions. This demonstrate the great benefit of having $k\\geq 2$ as\ndistinguishers.\n\nFor $k\\geq 2$ we introduce a notion similar to the VC-dimension, and\nshow that it controls the sample complexity. We proceed and provide upper and\nlower bounds as a function of our extended notion of VC-dimension.", "full_text": "Graph-based Discriminators: Sample Complexity\n\nand Expressiveness\n\nRoi Livni\n\nTel Aviv University\n\nrlivni@tauex.tau.ac.il\n\nYishay Mansour\n\nTel Aviv University and Google\nmansour.yishay@gmail.com\n\nAbstract\n\nA basic question in learning theory is to identify if two distributions are identical\nwhen we have access only to examples sampled from the distributions. This\nbasic task is considered, for example, in the context of Generative Adversarial\nNetworks (GANs), where a discriminator is trained to distinguish between a real-\nlife distribution and a synthetic distribution. Classically, we use a hypothesis class\nH and claim that the two distributions are distinct if for some h \u2208 H the expected\nvalue on the two distributions is (signi\ufb01cantly) different.\nOur starting point is the following fundamental problem: \"is having the hypothesis\ndependent on more than a single random example bene\ufb01cial\". To address this\nchallenge we de\ufb01ne k-ary based discriminators, which have a family of Boolean\nk-ary functions G. Each function g \u2208 G naturally de\ufb01nes a hyper-graph, indicating\nwhether a given hyper-edge exists. A function g \u2208 G distinguishes between two\ndistributions, if the expected value of g, on a k-tuple of i.i.d examples, on the two\ndistributions is (signi\ufb01cantly) different.\nWe study the expressiveness of families of k-ary functions, compared to the classi-\ncal hypothesis class H, which is k = 1. We show a separation in expressiveness\nof k + 1-ary versus k-ary functions. This demonstrate the great bene\ufb01t of having\nk \u2265 2 as distinguishers.\nFor k \u2265 2 we introduce a notion similar to the VC-dimension, and show that it\ncontrols the sample complexity. We proceed and provide upper and lower bounds\nas a function of our extended notion of VC-dimension.\n\n1\n\nIntroduction\n\nThe task of discrimination consists of a discriminator that receives \ufb01nite samples from two distribu-\ntions, say p1 and p2, and needs to certify whether the two distributions are distinct. Discrimination has\na central role within the framework of Generative Adversarial Networks [12], where a discriminator\ntrains a neural net to distinguish between samples from a real-life distribution and samples generated\nsynthetically by another neural network, called a generator.\nA possible formal setup for discrimination identi\ufb01es the discriminator with some distinguishing class\nD = {f : X \u2192 R} of distinguishing functions. In turn, the discriminator wishes to \ufb01nd the best\nd \u2208 D that distinguishes between the two distributions. Formally, she wishes to \ufb01nd d \u2208 D such that1\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2212 \u0001.\n\n[d\u2217(x)] \u2212 E\nx\u223cp2\n\n[d\u2217(x)]\n\n(1)\n\n(cid:12)(cid:12)(cid:12)(cid:12) E\n\nx\u223cp1\n\n(cid:12)(cid:12)(cid:12)(cid:12) > sup\n\nd\u2217\u2208D\n\n(cid:12)(cid:12)(cid:12)(cid:12) E\n\nx\u223cp1\n\n[d(x)] \u2212 E\nx\u223cp2\n\n[d(x)]\n\n1Note that with such d at hand, with an order of O(1/\u00012) examples one can verify if any discriminator in the\n\nclass certi\ufb01es that the two distributions are distinct.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFor examples, in GANs, the class of distinguishing functions we will consider could be the class of\nneural networks trained by the discriminator.\nThe \ufb01rst term in the RHS of eq. (1) is often referred to as the Integral Probability Metric (IPM distance)\nw.r.t a class D [16], denoted IPMD. As such, we can think of the discriminator as computing the\nIPMD distance.\nWhether two, given, distributions can be distinguished by the discriminator becomes, in the IPM\nsetup, a property of the distinguishing class. Also, the number of examples needed to be observed\nwill depend on the class in question. Thus, if we take a large expressive class of distinguishers, the\ndiscriminator can potentially distinguish between any two distributions that are far in total variation.\nIn that extreme, though, the class of distinguishers would need to be very large and in turn, the\nnumber of samples needed to be observed scales accordingly. One could also choose a \u201csmall\" class,\nbut at a cost of smaller distinguishing power that yields smaller IPM distance.\nFor example, consider two distributions over [n] to be distinguished. We could choose as a distin-\nguishing class the class of all possible subsets over n. This distinguishing class give rise to the total\nvariation distance, but the sample complexity turns out to be O(n). Alternatively we can consider the\nclass of singletones: This class will induce a simple IPM distance, with graceful sample complexity,\nhowever in worst case the IPM distance can be as small as O(1/n) even though the total variation\ndistance is large.\nThus, IPM framework initiates a study of generalization complexity where we wish to understand\nwhat is the expressive power of each class and what is its sample complexity.\nFor this special case that D consists of Boolean functions, the problem turns out to be closely related\nto the classical statistical learning setting and prediction [20]. The sample complexity (i.e., number of\nsamples needed to be observed by the discriminator) is governed by a combinatorial measure termed\nVC dimension. Speci\ufb01cally, for the discriminator to be able to \ufb01nd a d as in eq. (1), she needs to\nobserve order of \u0398( \u03c1\nIn this work we consider a natural extension of this framework to more sophisticated discriminators:\nFor example, consider a discriminator that observes pairs of points from the distribution and checks\nfor collisions \u2013 such a distinguisher cannot apriori be modeled as a test of Boolean functions, as the\ntester measures a relation between two points and not a property of a single point. The collision test\nhas indeed been used, in the context of synthetic data generation, to evaluate the diversity of the\nsynthetic distribution [2].\nMore generally, suppose we have a class of 2-ary Boolean functions: G = {g : g(x1, x2) \u2192 {0, 1}}\nand the discriminator wishes to (approximately) compute\n\n\u00012 ) examples, where \u03c1 is the VC dimension of the class D [5, 20].\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .\n\n(2)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nsup\ng\u2208G\n\nE\n\n(x1,x2)\u223cp2\n\n1\n\n[g(x1, x2)] \u2212\n\nE\n\n(x1,x2)\u223cp2\n\n2\n\n[g(x1, x2)]\n\nHere p2 denotes the product distribution over p. More generally, we may consider k-ary mappings,\nbut for the sake of clarity, we will restrict our attention in this introduction to k = 2.\nSuch 2-ary Boolean mapping can be considered as graphs where g(x1, x2) = 1 symbolizes that there\nexists an edge between x1 and x2 and similarly g(x1, x2) = 0 denotes that there is no such edge.\nThe collision test, for example, is modelled by a graph that contains only self\u2013loops. We thus call\nsuch multi-ary statistical tests graph-based distinguishers.\nTwo natural question then arise\n\n1. Do graph\u2013based discriminators have any added distinguishing power over classical discrimi-\n\nnators?\n\n2. What is the sample complexity of graph\u2013based discriminators?\n\nWith respect to the \ufb01rst question we give an af\ufb01rmative answer and we show a separation between the\ndistinguishing power of graph\u2013based discriminators and classical discriminators. As to the second\nquestion, we introduce a new combinatorial measure (termed graph VC dimension) that governs the\nsample complexity of graph\u2013based discriminators \u2013 analogously to the VC characterization of the\nsample complexity of classical discriminators. We next elaborate on each of these two results.\n\n2\n\n\f\u221a\n\n\u221a\n\nAs to the distinguishing power of graph\u2013based discriminators, we give an af\ufb01rmative answer in the\nfollowing sense: We show that there exists a single graph g such that, for any distinguishing class D\nwith bounded VC dimension, and \u0001, there are two distributions p1 and p2 that are D\u2013indistinguishable\nbut g certi\ufb01es that p1 and p2 are distinct. Namely, the quantity in eq. (2) is at least 1/4 for G = {g}.\nThis result may be surprising. It is indeed known that for any two distributions that are \u0001\u2013far in total\nvariation, there exists a boolean mapping d that distinguishes between the two distributions. In that\nsense, distinguishing classes are known to be universal. Thus, asymptotically, with enough samples\nany two distribution can be ultimately distinguished via a standard distinguishing function.\nNevertheless, our result shows that, given \ufb01nite data, the restriction to classes with \ufb01nite capacity\nis limiting, and there could be graph-based distinguishing functions whose distinguishing power is\nnot comparable to any class with \ufb01nite capacity. We stress that the same graph competes with all\n\ufb01nite\u2013capacity classes, irrespective of their VC dimension.\nWith respect to the second question, we introduce a new VC-like notion termed graph VC dimension\nthat extends naturally to graphs (and hypergraphs). On a high level, we show that for a class of graph-\nbased distinguishers with graph VC dimension \u03c1, O(\u03c1) examples are suf\ufb01cient for discrimination\nand that \u2126(\n\u03c1 which we leave as an open\nquestion.\nThe notion we introduce is strictly weaker than the standard VC\u2013dimension of families of multi-ary\nfunctions, and the proofs we provide do not follow directly from classical results on learnability of\n\ufb01nite VC classes [20, 5]. In more details, a graph-based distinguishing class G is a family of Boolean\nfunctions over the product space of vertices V: G \u2286 {0, 1}V 2. As such it is equipped with a VC\ndimension, the largest set of pairs of vertices that is shattered by G.\nIt is not hard to show that \ufb01nite VC is suf\ufb01cient to achieve \ufb01nite sample complexity bounds over 2-ary\nfunctions [9]. It turns out, though, that it is not a necessary condition: For example, one can show\nthat the class of k-regular graphs has \ufb01nite graph VC dimension but in\ufb01nite VC dimension. Thus,\neven though they are not learnable in the standard PAC setting, they have \ufb01nite sample complexity\nwithin the framework of discrimination.\nThe reason for this gap, between learnability and discriminability, is that learning requires uniform\nconvergence with respect to any possible distribution over pairs, while discrimination requires\nuniform convergence only with respect to product distributions \u2013 formally then, it is a weaker task,\nand, potentially, can be performed even for classes with in\ufb01nite VC dimension.\n\n\u03c1) examples are necessary. This leaves a gap of factor\n\n1.1 Related Work\n\nThe task of discrimination has been considered as early as the work of Vapnik and Chervonenkis in\n[20]. In fact, even though Vapnik and Chervonenkis original work is often referred in the context of\nprediction, the original work considered the question of when the empirical frequency of Boolean\nfunctions converges uniformly to the true probability over a class of functions. In that sense, this\nwork can be considered as a natural extension to k-ary functions and generalization of the notion of\nVC dimension.\nThe work of [9, 8] studies also a generalization of VC theory to multi-ary functions in the context\nof ranking tasks and U-statistics. They study the standard notion of VC dimension. Speci\ufb01cally\nthey consider the function class as Boolean functions over multi-tuples and the VC dimension is\nde\ufb01ned by the largest set of multi-tuples that can be shattered. Their work provides several interesting\nfast-rate convergence guarantees. As discussed in the introduction, our notion of capacity is weaker,\nand in general the results are incomparable.\n\nGANs A more recent interest in discrimination tasks is motivated by the framework of GANs,\nwhere a neural network is trained to distinguish between two sets of data \u2013 one is real and the other is\ngenerated by another neural network called generator. Multi-ary tests have been proposed to assess\nthe quality of GANs networks. [2] suggests birthday paradox to evaluate diversity in GANs. [18]\nuses Binning to assess the solution proposed by GANs.\nCloser to this work [15] suggests the use of a discriminator that observes samples from the m-th\nproduct distribution. Motivated by the problem of mode collapse they suggest a theoretical framework\nin which they study the algorithmic bene\ufb01ts of such discriminators and observe that they can\n\n3\n\n\fsigni\ufb01cantly reduce mode collapse. In contrast, our work is less concerned with the problem of mode\ncollapse directly and we ask in general if we can boost the distinguishing power of discriminators via\nmulti-ary discrimination. Moreover, we provide several novel sample complexity bounds.\n\nProperty Testing A related problem to ours is that of testing closeness of distributions [3, 11].\nTraditionally, testing closeness of distribution is concerned with evaluating if two discrete distributions\nare close vs. far/identical in total variation. [11], motivated by graph expansion test, propose a\ncollision test to verify if a certain distribution is close to uniform. Interestingly, a collision test is a\ngraph-based discriminator which turns out to be optimal for the setting[17]. Our sample\u2013complexity\nlower bounds are derived from these results. Speci\ufb01cally we reduce discrimination to testing\nuniformity [17]. Other lower bounds in the literature can be similarly used to achieve alternative\n(yet incomparable bounds) (e.g. [7] provides a \u2126(n2/3/\u00013/4) lower bounds for testing whether two\ndistributions are far or close).\nIn contrast with the aforementioned setup, here we do not measure distance between distributions\nin terms of total variation but in terms of an IPM distance induced by a class of distinguishers. The\nadvantage of the IPM distance is that it sometimes can be estimated with limited amount of samples,\nwhile the total variation distance scales with the size of the support, which is often too large to allow\nestimation.\nSeveral works do study the question of distinguishing between two distributions w.r.t a \ufb01nite capacity\nclass of tests, Speci\ufb01cally the work of [14] studies refutation algorithms that distinguish between\nnoisy labels and labels that correlate with a bounded hypothesis class. [19] studies a closely related\nquestion in the context of realizable PAC learning. A graph-based discriminator can be directly turned\nto a refutation algorithm, and both works of [14, 19] show reductions from refutation to learning.\nIn turn, the agnostic bounds of [14] can be harnessed to achieve lower bounds for graph-based\ndiscrimination. Unfortunately this approach leads to suboptimal lower bounds. It would be interesting\nto see if one can improve the guarantees for such reductions, and in turn exploit it for our setting.\n\n2 Problem Setup\n\n2.1 Basic Notations \u2013 Graphs and HyperGraphs\nRecall that a k-hypergraph g consists of a a set Vg of vertices and a collection of non empty k\u2013\ntuples over V: Eg \u2286 V k, which are referred to as hyperedges. If k = 2 then g is called a graph.\n1\u2013hypergraphs are simply identi\ufb01ed as subsets over V. We will normally use d to denote such\n1-hypergraphs and will refer to them as distinguishers. A distinguisher d can be identi\ufb01ed with a\nBoolean function according to the rule: d(x) = 1 iff x \u2208 Ed.\nSimilarly we can identify a k-hypergraph with a function g : V k \u2192 {0, 1}. Namely, for any graph g\nwe identify it with the Boolean function\n\n(cid:26)1\n\n0\n\ng(v1, . . . , vk) =\n\n(v1, . . . , vk) \u2208 Eg\nelse\n\nWe will further simplify and assume that g is undirected, this means that for any permutation\n\u03c0 : [k] \u2192 [k], we have that\n\ng(v\u03c0(1), v\u03c0(2), . . . , v\u03c0(k)) = g(v1, . . . , vk).\n\nWe will call undirected k-hypergraphs, k-distinguishers. A collection of k-distinguishers over a\ncommon set of vertices V will be referred to as a k-distinguishing class. If k = 1 we will simply call\nsuch a collection a distinguishing class. For k > 1 we will normally denote such a collection with G\nand for k = 1 we will often use the letter D.\nNext, given a distribution P over vertices and a k\u2013hypergraph g let us denote as follows the frequency\nof an edge w.r.t P :\n\nEP (g) = E\n\nv1:k\u223cP k\n\n[g(v1:k)] = P k [{(v1, . . . , vk) : (v1, . . . , vk) \u2208 Eg)}] ,\n\n4\n\n\fwhere we use the notation v1:t in shorthand for the sequence (v1, . . . , vt) \u2208 V t, and P k denotes the\nproduct distribution of P k times.\nSimilarly, given a sample S = {vi}m\n\n(cid:88)\n\nu1:k\u2208Sk\n\nES(g) =\n\n1\nmk\n\ni=1 we denote the empirical frequency of an edge:\n|{(u1, . . . , uk) \u2208 Eg : \u2200i, ui \u2208 S}|\n\ng(u1:k) =\n\nmk\n\nAs a \ufb01nal set of notations: Given a k-hypergraph g a sequence v1:n where n < k, we de\ufb01ne a\nk \u2212 n\u2013distinguisher gv1:n as follows:\n\ngv1:n (u1:k\u2212n) = g(v1, . . . vn, u1, . . . uk\u2212n).\n\nIn turn, we de\ufb01ne the following distinguishing classes: For every sequence v1:n, n < k, the\ndistinguishing class Gv1:n is de\ufb01ned as follows:\n\n(3)\nFinally, we point out that we will mainly be concerned with the case that |V| \u2264 \u221e or V = N.\nHowever, all the results here can be easily extended to other domains as long as certain (natural)\nmeasurability assumptions are given to ensure that VC theory holds (see [20, 4]).\n\nGv1:n = {gv1:n : g \u2208 G}\n\nIPM distance\n\n2.2\nGiven a class of distinguishers D the induced IPM distance [16], denoted by IPMD, is a (pseudo)\u2013\nmetric between distributions over V de\ufb01ned as follows\n\nIPMD(p1, p2) = sup\nd\u2208D\n\n|Ep1 (d) \u2212 Ep2 (d)| = sup\nd\u2208D\n\n[d(v)] \u2212 E\nv\u223cp2\n\n[d(v)]\n\nThe de\ufb01nition can naturally be extended to a general family of graphs, and we de\ufb01ne:\n\nIPMG(p1, p2) = sup\ng\u2208G\n\n|Ep1(g) \u2212 Ep2(g)]| = sup\ng\u2208G\n\n[g(v1:k)] \u2212 E\n\nv1:k\u223cpk\n\n2\n\n[g(v1:k)]]\n\nAnother metric we would care about is the total variation metric. Given two distributions p1 and p2\nthe total variation distance is de\ufb01ned as:\n\n|p1(E) \u2212 p2(E)|\n\nTV(p1, p2) = sup\nE\nwhere E \u2286 V{0,1} goes over all measurable events.\nIn contrast with an IPM distance, the total variation metric is indeed a metric and any two distributions\np1 (cid:54)= p2 we have that TV(p1, p2) > 0. In fact, for every distinguishing class D, IPMD (cid:22) TV.2\nFor \ufb01nite classes of vertices V, it is known that the total variation metric is given by\n\nTV(p1, p2) =\n\n1\n2\n\n|p1(v) \u2212 p2(v)|.\n\n(cid:88)\n\nv\u2208V\n\nFurther, if we let D = P (V) the power set of V we obtain\n\nIPMP (V)(p1, p2) = TV(p1, p2).\n\n2.3 Discriminating Algorithms\nDe\ufb01nition 1. Given a distinguishing class G a G-discriminating algorithm A with sample complexity\nm(\u0001, \u03b4) is an algorithm that receives as input two \ufb01nite samples S = (S1, S2) of vertices and outputs\na hyper-graph gA\n\nS \u2208 G such that:\n\n2we use the notation f1 (cid:22) f2 to denote that for every x, y we have f1(x, y) \u2264 f2(x, y).\n\n5\n\n(cid:12)(cid:12)(cid:12)(cid:12) E\n\nv\u223cp1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E\n\nv1:k\u223cpk\n\n1\n\n(cid:12)(cid:12)(cid:12)(cid:12) .\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\fIf S1, S2 are drawn IID from some unknown distributions p1, p2 respectively and |S1|,|S2| > m(\u0001, \u03b4)\nthen w.p. (1 \u2212 \u03b4) the algorithm\u2019s output satis\ufb01es:\n\n|Ep1(gA\n\nS ) \u2212 Ep2 (gA\n\nS )| > IPMG(p1, p2) \u2212 \u0001.\n\nThe sample complexity of a class G is then given by the smallest possible sample complexity of a\nG-discriminating algorithm A.\nA class G is said to be discriminable if it has \ufb01nite sample complexity. Namely there exists a\ndiscriminating algorithm for G with sample complexity m(\u0001, \u03b4) < \u221e.\n\nVC classes are discriminable For the case k = 1, discrimination is closely related to PAC learning.\nIt is easy to see that a proper learning algorithm for a class D can be turned into a discriminator:\nIndeed, given access to samples from two distributions p1 and p2 we can provide a learner with\nlabelled examples from a distribution p de\ufb01ned as follows: p(y = 1) = p(y = \u22121) = 1\n2 and\np(\u00b7|y = 1) = p1, and p(\u00b7|y = \u22121) = p2. Given access to samples from p1 and p2 we can clearly\ngenerate IID samples from the distribution p. If, in turn, we provide a learner with samples from p\nand it outputs a hypothesis d \u2208 D we have that (w.h.p):\n\n|Ep1 (d) \u2212 Ep2 (d)| = 2| 1\n2\n\n1\n2\n\nE\n\n(x,y)\u223cp2\u00d7{\u22121}\n\n[yd(x)]|\n\n[yd(x)] +\n\nE\n(x,y)\u223cp1\u00d7{1}\n[yd(x)]|\n\n(x,y)\u223cp\n\n= 2| E\n= 2(1 \u2212 2p(d(x) (cid:54)= y))\n\u2265 2(1 \u2212 2(min\nd\u2208D p(d(x) (cid:54)= y) + \u0001))\nyd(x)| \u2212 4\u0001)\nd\u2208D (2(| E\n(x,y)\u223cp\nd\u2208D |Ep1(d) \u2212 Ep2 (d)| \u2212 4\u0001\n\n= max\n= IPMD(p1, p2) \u2212 4\u0001\n\n= max\n\nOne can also see that a converse relation holds, if we restrict our attention to learning balanced labels\n(i.e., p(y = 1) = p(y = \u22121)). Namely, given labelled examples from some balanced distribution,\nthe output of a discriminator is a predictor that competes with the class of predictors induced by D.\nOverall, the above calculation, together with Vapnik and Chervonenkis\u2019s classical result [20] shows\n\u00012 ).3 The\nthat classes with \ufb01nite VC dimension \u03c1 are discriminable with sample complexity O( \u03c1\nnecessity of \ufb01nite VC dimension for agnostic PAC-learning was shown in [1]. Basically the same\nargument shows that given a class D, \u02dc\u2126( \u03c1\n\u00012 ) examples are necessary for discrimination. We next\nintroduce a natural extension of VC dimension to hypergraphs, which will play a similar role.\n\n2.4 VC Dimension of hypergraphs\n\nWe next de\ufb01ne the notion of graph VC dimension for hypergraphs, as we will later see this notion\nindeed characterizes the sample complexity of discriminating classes, and in that sense it is a natural\nextension of the notion of VC dimension for hypotheses classes:\nDe\ufb01nition 2. Given a family of k-hypergraphs, G: The graph VC dimension of the class G, denoted\ngVC(G), is de\ufb01ned inductively as follows: For k = 1 gVC(G) is the standard notion of VC dimension,\ni.e., gVC(G) = VC(G). For k > 1:\n\ngVC(G) = max\n\nv\u2208V {gVC(Gv)}\n\nRoughly, the graph VC dimension of a hypergraph is given by the VC dimension of the induced\nclasses of distinguishers via projections. Namely, we can think of the VC dimension of hypergraphs\nas the projected VC dimension when we \ufb01x all coordinates in an edge except for one.\n\n3Recall that the VC dimension of a class D is the largest set that can be shattered by D where a set S \u2286 V is\n\nsaid to be shattered if D restricted to S consists of 2|S| possible Boolean functions.\n\n6\n\n\f3 Main Results\n\nWe next describe the main results of this work. The results are divided into two sections: For the \ufb01rst\npart we characterize the sample complexity of graph\u2013based distinguishing class. The second part is\nconcerned with the expressive/distinguishing power of graph\u2013based discriminators. All proofs are\nprovided in the full version.\n\n3.1 The sample complexity of graph-based distinguishing class\n\nWe begin by providing upper bounds to the sample complexity for discrimination\nTheorem 1 (Sample Complexity \u2013 Upper Bound). Let G be a k\u2013distinguishing class with gVC(G) =\n\u03c1 then G has sample complexity O( \u03c1k2\n\n\u00012 log 1/\u03b4).\n\nTheorem 1 is a corollary of the following uniform convergence upper bound for graph-based distin-\nguishing classes.\nTheorem 2 (uniform convergence). Let G be a k\u2013distinguishing class with gVC(G) = \u03c1. Let\nS = {vi}m\ni=1 be an IID sample of vertices drawn from some unknown distribution P . If m =\n\u2126( \u03c1k2\n\n\u00012 log 1/\u03b4) then with probability at least (1 \u2212 \u03b4) (over the randomness of S):\n\n|ES(g) \u2212 EP (g)| \u2264 \u0001.\n\nsup\ng\u2208G\n\nWe next provide a lower bound for the sample complexity of discriminating algorithms in terms of\nthe graph VC dimension of the class\nTheorem 3 (Sample Complexity \u2013 Lower Bound). Let G be a k\u2013distinguishing class with gVC(G) =\n\u03c1. Any G-discriminating algorithm with accuracy \u0001 > 0 that succeeds with probability 1 \u2212 2\u2212k log k\n,\nmust observe at least \u2126\n\n(cid:16) \u221a\n\nsamples.\n\n(cid:17)\n\n3\n\n\u03c1\n\n27k3 \u00012\n\n\u221a\n\nOur upper bounds and lower bounds leave a gap of order O(\ncase k = 1 we can provide a tight \u03b8( \u03c1\nappropriate lower bounds[1].\n\n\u03c1): As dicussed in section 2.3, for the\n\u00012 ) bound through a reduction to agnostic PAC learning and the\n\n3.2 The expressive power of graph-based distinguishing class\n\nSo far we have characterized the discriminability of graph-based distinguishing classes. It is natural\nthough to ask if graph\u2013based distinguishing classes add any advantage over standard 1-distinguishing\nclasses. In this section we provide several results that show that indeed graph provide extra expressive\npower over standard distinguishing classes.\nWe begin by providing a result over in\ufb01nite graphs.\nTheorem 4. Let V = N. There exists a distinguishing graph class G, with sample complexity\n) (in fact |G| = 1) such that: for any 1-distinguishing class D with \ufb01nite VC\nm(\u0001, \u03b4) = O( log 1/\u03b4\ndimension, and every \u0001 > 0 there are two distributions p1, p2 such that IPMD(p1, p2) < \u0001 but\nIPMG(p1, p2) > 1/2\n\n\u00012\n\nTheorem 4 can be generalized to higher order distinguishing classes :\nTheorem 5. Let V = N. There exists a k-distinguishing class Gk, with sample complexity\n) such that: For any k \u2212 1-distinguishing class Gk\u22121 with bounded sample\nm(\u0001, \u03b4) = O( k2+log 1/\u03b4\ncomplexity, and every \u0001 > 0 there are two distributions p1, p2 such that IPMGk\u22121 (p1, p2) < \u0001 and\nIPMGk (p1, p2) > 1/4.\n\n\u00012\n\nFinite Graphs We next study the expressive power of distinguishing graphs over \ufb01nite domains.\nIt is known that, over a \ufb01nite domain V = {1, . . . , n}, we can learn with a sample complexity of\n\u00012 log 1/\u03b4) any distinguishing class. In fact, we can learn the total variation metric (indeed the\nO( n\nsample complexity of P(V) is bounded by log |P(V )| = n).\n\n7\n\n\fTherefore if we allow classes whose sample complexity scales linearly with n we cannot hope to\nshow any advantage for distinguishing graphs. However, in most natural problems n is considered to\nbe very large (for example, over the Boolean cube n is exponential in the dimension). We thus, in\ngeneral, would like to study classes that have better complexity in terms of n. In that sense, we can\nshow that indeed distinguishing graphs yield extra expressive power.\nIn particular, we show that for classes with sublogarithmic sample complexity, we can construct\ngraphs that are incomparable with a higher order distinguishing class.\nTheorem 6. Let |V| = n. There exists a k-distinguishing class Gk, with sample complexity m(\u0001, \u03b4) =\n) (in fact |G| = 1) such that: For any \u0001 > 0 and any k \u2212 1 distinguishing class Gk\u22121 if:\nO( k2+log 1/\u03b4\n\n\u00012\n\nIPMGk\u22121 (cid:31) \u0001 \u00b7 IPMGk\n\n\u221a\n\nk2\n\nlog n).\n\nthen gVC(Gk\u22121) = \u2126( \u00012\nWe can improve the bound in theorem 6 for the case k = 1 .\nTheorem 7. Let |V| = n. There exists a 2-distinguishing class G, with sample complexity m(\u0001, \u03b4) =\nO( log 1/\u03b4\n\n) (in fact |G| = 1) such that: For any \u0001 > 0 and any distinguishing class D if:\n\n\u00012\n\nIPMD (cid:31) \u0001 \u00b7 IPMG\n\nthen gVC(D) = \u02dc\u2126(\u00012 log n).\n\n4 Discussion and open problems\n\nIn this work we developed a generalization of the standard framework of discrimination to graph-based\ndistinguishers that discriminate between two distributions by considering multi-ary tests. Several\nopen question arise from our results:\n\n\u03c1\n\n\u221a\n\nImproving Sample Complexity Bounds\nIn terms of sample complexity, while we give a natural\nupper bound of O(\u03c1k2), the lower bound we provide are not tight neither in d nor in k and we provide\na lower bound of \u2126(\n2poly(k) ) This leave room for improvement both in terms of \u03c1 and in terms of k.\nImproving Expressiveness Bounds We also showed that, over \ufb01nite domains, we can construct\na graph that is incomparable with any class with VC dimension \u2126(\u00012 log n). The best upper bound\nwe can provide (the VC of a class that competes with any graph) is the naive O(n) which is the VC\ndimension of the total variation metric.\n\nAdditionally, for the k-hypergraph case, our bounds deteriorate to \u2126(\u00012\u221a\n\nlog n). The improvement in\nthe graph case follows from using an argument in the spirit of Boosting [10] and Hardcore Lemma\n[13] to construct two indistinguishable probabilities with distinct support over a small domain. It\nwould be interesting to extend these techniques in order to achieve similar bounds for the k > 2 case.\n\nRelation to GANs and Extension to Online Setting Finally, a central motivation for learning the\nsample complexity of discriminators is in the context of GANs. It then raises interesting questions as\nto the foolability of graph-based distinguishers.\nThe work of [6] suggests a framework for studying sequential games between generators and\ndiscriminators (GAM-Fooling). In a nutshell, the GAM setting considers a sequential game between\na generator G that outputs distributions and a discriminator D that has access to data from some\ndistribution p\u2217 (not known to G). At each round of the game, the generator proposes a distribution\nand the discriminator outputs a d \u2208 D which distinguishes between the distribution of G and the true\ndistribution p\u2217. The class D is said to be GAM-Foolable if the generator outputs after \ufb01nitely many\nrounds a distribution p that is D\u2013indistinguishable from p\u2217\n[6] showed that a class D is GAM\u2013foolable if and only if it has \ufb01nite Littlestone dimension. We then\nask, similarly, which classes of graph\u2013based distinguishers are GAM-Foolable? A characterization of\nsuch classes can potentially lead to a natural extension of the Littlestone notion and online prediction,\nto graph-based classes analogously to this work w.r.t VC dimension\n\nAckgnoweledgements Y.M is supported in part by a grant from the ISF\n\n8\n\n\fReferences\n[1] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations.\n\ncambridge university press, 2009.\n\n[2] Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv\n\npreprint arXiv:1706.08224, 2017.\n\n[3] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing\nthat distributions are close. In Proceedings 41st Annual Symposium on Foundations of Computer\nScience, pages 259\u2013269. IEEE, 2000.\n\n[4] Shai Ben-David. 2 notes on classes with vapnik-chervonenkis dimension 1. arXiv preprint\n\narXiv:1507.05307, 2015.\n\n[5] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability\nand the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929\u2013965, 1989.\n\n[6] Olivier Bousquet, Roi Livni, and Shay Moran. Passing tests without memorizing: Two models\n\nfor fooling discriminators. arXiv preprint arXiv:1902.03468, 2019.\n\n[7] Siu-On Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal algorithms for\ntesting closeness of discrete distributions. In Proceedings of the twenty-\ufb01fth annual ACM-SIAM\nsymposium on Discrete algorithms, pages 1193\u20131203. SIAM, 2014.\n\n[8] St\u00e9phan Cl\u00e9men\u00e7on, Igor Colin, and Aur\u00e9lien Bellet. Scaling-up empirical risk minimization:\noptimization of incomplete u-statistics. The Journal of Machine Learning Research, 17(1):2682\u2013\n2717, 2016.\n\n[9] St\u00e9phan Cl\u00e9men\u00e7on, G\u00e1bor Lugosi, Nicolas Vayatis, et al. Ranking and empirical minimization\n\nof u-statistics. The Annals of Statistics, 36(2):844\u2013874, 2008.\n\n[10] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In COLT,\n\nvolume 96, pages 325\u2013332. Citeseer, 1996.\n\n[11] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies\nin Complexity and Cryptography. Miscellanea on the Interplay between Randomness and\nComputation, pages 68\u201375. Springer, 2011.\n\n[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[13] Russell Impagliazzo. Hard-core distributions for somewhat hard problems. In Proceedings of\n\nIEEE 36th Annual Foundations of Computer Science, pages 538\u2013545. IEEE, 1995.\n\n[14] Pravesh K Kothari and Roi Livni. Agnostic learning by refuting.\n\nIn 9th Innovations in\nTheoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum\nfuer Informatik, 2018.\n\n[15] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples\nin generative adversarial networks. In Advances in Neural Information Processing Systems,\npages 1498\u20131507, 2018.\n\n[16] Alfred M\u00fcller. Integral probability metrics and their generating classes of functions. Advances\n\nin Applied Probability, 29(2):429\u2013443, 1997.\n\n[17] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete\n\ndata. IEEE Transactions on Information Theory, 54(10):4750\u20134755, 2008.\n\n[18] Eitan Richardson and Yair Weiss. On gans and gmms. In Advances in Neural Information\n\nProcessing Systems, pages 5847\u20135858, 2018.\n\n[19] Salil P. Vadhan. On learning vs. refutation. In Proceedings of the 30th Conference on Learning\n\nTheory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 1835\u20131848, 2017.\n\n9\n\n\f[20] Vladimir N Vapnik and Aleksei Yakovlevich Chervonenkis. The uniform convergence of\nfrequencies of the appearance of events to their probabilities. In Doklady Akademii Nauk,\nvolume 181, pages 781\u2013783. Russian Academy of Sciences, 1968.\n\n10\n\n\f", "award": [], "sourceid": 3634, "authors": [{"given_name": "Roi", "family_name": "Livni", "institution": "Tel Aviv University"}, {"given_name": "Yishay", "family_name": "Mansour", "institution": "Tel Aviv University / Google"}]}