submodular functions learnability , structure & optimization
DESCRIPTION
Submodular Functions Learnability , Structure & Optimization. Nick Harvey, UBC CS Maria- Florina Balcan , Georgia Tech. Who studies submodular functions?. CS, Approximation Algorithms. Machine Learning. OR, Optimization. AGT, Economics. Valuation Functions. - PowerPoint PPT PresentationTRANSCRIPT
Submodular FunctionsLearnability, Structure & Optimization
Nick Harvey, UBC CSMaria-Florina Balcan, Georgia Tech
OR,Optimization
Machine Learning
AGT,Economics
CS,ApproximationAlgorithms
Who studies submodular functions?
f( ) ! R
Valuation FunctionsA first step in economic modeling:
• individuals have valuation functions givingutility for different outcomes or events.
f( ) ! R• n items, {1,2,…,n} = [n]
• f : 2[n] ! R.
Focus on combinatorial settings:
Valuation FunctionsA first step in economic modeling:
• individuals have valuation functions givingutility for different outcomes or events.
Learning Valuation Functions
This talk: learning valuation functions from past data.
• Package travel deals
• Bundle pricing
Submodular valuations
xS Tx
++
Large improvement
Small improvement
For TµS, xS, f(T [ {x}) – f(T) ¸ f(S [ {x}) – f(S)
TS S [ TSÅ T
++ ¸
• Equivalent to decreasing marginal return:
For all S,T µ [n]: f(S)+f(T) ¸ f(S [ T) + f(S Å T)
• [n]={1,…,n}; Function f : 2[n] ! R submodular if
Submodular valuations
• Concave Functions Let h : R ! R be concave. For each S µ [n], let f(S) = h(|S|)
• Vector Spaces Let V={v1,,vn}, each vi 2 Fn.
For each S µ [n], let f(S) = rank({ vi : i 2 S})
E.g.,
xS Tx
++
Large improvement
For TµS, xS, f(T [ {x}) – f(T) ¸ f(S [ {x}) – f(S)
Small improvement
• Decreasing marginal return:
S1,…, Sk
Labeled Examples
Passive Supervised Learning
Learning Algorithm
Expert / Oracle
Data Source
Alg. outputs
Distribution D on 2[n]
f : 2[n] ! R+
(S1,f(S1)),…, (Sk,f(Sk))
g : 2[n] ! R+
S1,…, Sk
PMAC model for learning real valued functions
Distribution D on 2[n]
Labeled Examples
Learning Algorithm
Expert / Oracle
Data Source
Alg.outputsf : 2[n] ! R+
g : 2[n] ! R+
(S1,f(S1)),…, (Sk,f(Sk))
• Alg. sees (S1,f(S1)),…, (Sk,f(Sk)), Si i.i.d. from D, produces g
Probably Mostly Approximately Correct
• With probability ¸ 1-±, we have PrS[ g(S) · f(S) · ® g(S) ] ¸ 1-²
PAC Boolean
{0,1}{0,1}
Learning submodular functions
Monotone, submodular functions can be PMAC-learned(w.r.t. an arbitrary distribution) with approximation factor ®=O(n1/2).
Monotone, submodular functions cannot be PMAC-learnedwith approximation factor õ(n1/3).
Theorem: (Our general lower bound)
Theorem: (Our general upper bound)
Lipschitz, monotone submodular funtions can be PMAC-learnedunder a product distribution with approximation factor O(1).
Theorem: (Product distributions)Corollary: Gross substitutes functions do not havea concise, approximate representation.
Learning submodular functions
Monotone, submodular functions can be PMAC-learned(w.r.t. an arbitrary distribution) with approximation factor ®=O(n1/2).
Monotone, submodular functions cannot be PMAC-learnedwith approximation factor õ(n1/3).
Theorem: (Our general lower bound)
Theorem: (Our general upper bound)
Lipschitz, monotone submodular funtions can be PMAC-learnedunder a product distribution with approximation factor O(1).
Theorem: (Product distributions)
Corollary: Gross substitutes functions do not havea concise, approximate representation.
Computing Linear Separators+
– +
+
+
+–
–
–
– +
– +
+
–
– – • Given {+,–}-labeled points in Rn, find a hyperplane cTx
= b that separates the +s and –s.• Easily solved by linear programming.
Learning Linear Separators+
– +
+
+
+–
–
–
– +
– +
+
–
– – • Given random sample of {+,–}-labeled points in Rn,
find a hyperplane cTx = b that separates most ofthe +s and –s.
• Classic machine learning problem.
Error!
Learning Linear Separators+
– +
+
+
+–
–
–
– +
– +
+
–
– – • Classic Theorem: [Vapnik-Chervonenkis 1971?]
O( n/²2 ) samples suffice to get error ².
Error!
~
Submodular Functions are Approximately Linear
• Let f be non-negative, monotone and submodular• Claim: f can be approximated to within factor n
by a linear function g.• Proof Sketch: Let g(S) = §s2S f({s}).
Then f(S) · g(S) · n¢f(S).
Submodularity: f(S)+f(T)¸f(SÅT)+f(S[T) 8S,TµVMonotonicity: f(S)·f(T) 8SµTNon-negativity: f(S)¸0 8SµV
V
Submodular Functions are Approximately Linear
f
n¢f
g
V+ +
+
+
+ +
+ f
n¢f
• Randomly sample {S1,…,Sk} from distribution
• Create + for f(Si) and – for n¢f(Si)• Now just learn a linear separator!
–
––
–
– –
– g
V
f
n¢f
• Theorem: g approximates f to within a factor n on a 1-² fraction of the distribution.
g
V
f2
n¢f2
• Can improve to O(n1/2): in fact f2 and n¢f2 are separatedby a linear function [Goemans et al. ‘09]
• John’s Ellipsoid theorem: any centrally symmetric convex body is approximated by an ellipsoid to within factor n1/2
g
Learning submodular functions
Monotone, submodular functions can be PMAC-learned(w.r.t. an arbitrary distribution) with approximation factor ®=O(n1/2).
Monotone, submodular functions cannot be PMAC-learnedwith approximation factor õ(n1/3).
Theorem: (Our general lower bound)
Theorem: (Our general upper bound)
Lipschitz, monotone submodular funtions can be PMAC-learnedunder a product distribution with approximation factor O(1).
Theorem: (Product distributions)
Corollary: Gross substitutes functions do not havea concise, approximate representation.
f(S) = min{ |S|, k }f(S) = |S| (if |S| · k) k (otherwise)
;
V
;
V
f(S) =|S| (if |S| · k)
k-1 (if S=A) k (otherwise)
A
;
V
f(S) =|S| (if |S| · k) k-1 (if S 2 A) k (otherwise)
A1A2
A3
Ak
A = {A1,,Am}, |Ai|=k
Claim: f is submodular if |AiÅAj|·k-2 8ij
;
V
f(S) =|S| (if |S| · k) k-1 (if S 2 A and wasn’t deleted) k (otherwise)
A1
A3
Delete half of the bumps at random.Then f is very unconcentrated on A ) any algorithm to learn f has additive error 1
If algorithm seesonly these examples
Then f can’t bepredicted here
A2
Ak
;
V
A1
A3
Can we force a bigger error with bigger bumps?
Yes, if Ai’s are very “far apart”. This can be achieved by picking them randomly.
Ak
A2
Plan:• Choose two values High=n1/3 and Low=O(log2 n).• Choose random sets A1,…,Am µ [n],
with |Ai|=High and m = nlog n.• D is the uniform distribution on {A1,…,Am}.• Create a function f : 2[n] ! R.
For each i, randomly set f(Ai)=High or f(Ai)=Low.• Extend f to a monotone, submodular function on 2[n].
There is a distribution D and a randomly chosen function f s.t.• f is monotone, submodular• Knowing the value of f on poly(n) random samples from D
does not suffice to predict the value of f on future samples from D, even to within a factor o(n1/3).
Theorem: (Main lower bound construction)
~
Creating the function f• We choose f to be a matroid rank function– Such functions have a rich combinatorial
structure, and are always submodular• The randomly chosen Ai’s form an expander:
• The expansion property can be leveraged to ensure f(Ai)=High or f(Ai)=Low as desired.
where H = { j : f(Aj) = High }
Learning submodular functions
Monotone, submodular functions can be PMAC-learned(w.r.t. an arbitrary distribution) with approximation factor ®=O(n1/2).
Monotone, submodular functions cannot be PMAC-learnedwith approximation factor õ(n1/3).
Theorem: (Our general lower bound)
Theorem: (Our general upper bound)
Lipschitz, monotone submodular funtions can be PMAC-learnedunder a product distribution with approximation factor O(1).
Theorem: (Product distributions)
Corollary: Gross substitutes functions do not havea concise, approximate representation.
Gross Substitutes Functions• Class of utility functions commonly used in
mechanism design [Kelso-Crawford ‘82, Gul-Stacchetti ‘99, Milgrom ‘00, …]
• Intuitively, increasing the prices for some items does not decrease demand for the other items.
• Question: [Blumrosen-Nisan, Bing-Lehman-Milgrom]
Do GS functions have a concise representation?
Gross Substitutes Functions• Class of utility functions commonly used in
mechanism design [Kelso, Crawford, Gul, Stacchetti, …]
• Question: [Blumrosen-Nisan, Bing-Lehman-Milgrom]
Do GS functions have a concise representation?• Fact: Every matroid rank function is GS.
• Corollary: The answer to the question is no.
There is a distribution D and a randomly chosen function f s.t.• f is a matroid rank function• poly(n) bits of information do not suffice to predict the value
of f on samples from D, even to within a factor o(n1/3).
Theorem: (Main lower bound construction)
~
Learning submodular functions
Monotone, submodular functions can be PMAC-learned(w.r.t. an arbitrary distribution) with approximation factor ®=O(n1/2).
Monotone, submodular functions cannot be PMAC-learnedwith approximation factor õ(n1/3).
Theorem: (Our general lower bound)
Theorem: (Our general upper bound)
Lipschitz, monotone submodular funtions can be PMAC-learnedunder a product distribution with approximation factor O(1).
Theorem: (Product distributions)
Corollary: Gross substitutes functions do not havea concise, approximate representation.
Learning submodular functions
• Hypotheses:– PrX»D[ X=x ] = i Pr[ Xi = xi ] (“Product distribution”)
– f({i}) 2 [0,1] for all i 2 [n] (“Lipschitz function”)
– f({i}) 2 {0,1} for all i 2 [n] Stronger condition!
Lipschitz, monotone submodular funtions can be PMAC-learnedunder a product distribution with approximation factor O(1).
Theorem: (Product distributions)
;
V
Technical Theorem:For any ²>0, there exists a concave function h : [0,n] ! R s.t.for every k2[n], and for a 1-² fraction of SµV with |S|=k,we have:
In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k.h(k) · f(S) · O(log2(1/²))¢h(k).
Technical Theorem:For any ²>0, there exists a concave function h : [0,n] ! R s.t.for every k2[n], and for a 1-² fraction of SµV with |S|=k,we have:
In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k.
Algorithm:• Let ¹ = §i=1 f(xi) / m• Let g be the constant function with value ¹This achieves approximation factor O(log2(1/²)) ona 1-² fraction of points, with high probability.
h(k) · f(S) · O(log2(1/²))¢h(k).
Lipschitz, monotone submodular funtions can be PMAC-learnedunder a product distribution with approximation factor O(1).
Theorem: (Product distributions)
m
Technical Theorem:For any ²>0, there exists a concave function h : [0,n] ! R s.t.for every k2[n], and for a 1-² fraction of SµV with |S|=k,we have:
In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k.
Concentration Lemma: Let X have a product distribution. For any ® 2 [0,1],
Proof: Based on Talagrand’s concentration inequality.
h(k) · f(S) · O(log2(1/²))¢h(k).
Follow-up work• Subadditive & XOS functions [Badanidiyuru et al.,
Balcan et al.]– O(n1/2) approximation– (n1/2) inapproximability
• Symmetric submodular functions [Balcan et al.]– O(n1/2) approximation– (n1/3) inapproximability
Conclusions• Learning-theoretic view of submodular fns• Structural properties:– Very “bumpy” under arbitrary distributions– Very “smooth” under product distributions
• Learnability in PMAC model:– O(n1/2) approximation algorithm– (n1/3) inapproximability– O(1) approx for Lipschitz fns & product distrs
• No concise representation for gross substitutes