July 17, 2013

PODS 2013

Filed under: Uncategorized — paris koutris @ 9:49 am

There has been a long time since we last posted at the UWDB blog. However, coming back from SIGMOD/PODS 2013 in New York is a good opportunity to start posting again.

Our group participated at PODS with the paper “Communication Steps for Parallel Query Processing” by Paul Beame, Paris Koutris and Dan Suciu. Here is the link for the PODS paper and here is the link for full version.

In this paper, we study the problem of computing a relational query q on a large input database of size n, using a large number p of servers. In our model, the computation is performed in synchronous rounds, where each round includes some local computation followed by global communication between all servers. Further, we restrict the amount of data each server can receive during some computation: each server can receive only O(n/p^(1-ε)) bits, where ε is a parameter in [0,1) that controls replication and we call the space exponent. For example, if ε=1, one could send all data to one server and compute the result locally in one round. On the other end, if ε=0, each server receives O(n/p) data and so the data is equally distributed between the servers.  The restriction on communication/replication creates a trade-off between the number of rounds (r) and space exponent (ε), and computing this trade-off for Conjunctive Queries is the main subject of our work.

For a single round, r=1,  we show lower bounds and upper bounds that are matching. We prove that a conjunctive query q cannot be computed with space exponent ε < 1-1/τ*(q), where τ*(q) is the fractional vertex cover number for the hypergraph of query q. Our lower bounds are particularly strong, since we make no assumption about the way information is exchanged or about the computational capabilities of the servers. Additionally, we present an algorithm for one round that matches the lower bound, i.e. works with space exponent ε >= 1-1/τ*(q), for a specific class of skew-free databases. As an application of our 1-round result, consider the “triangle” query Q(x,y,z) = R(x,y), S(y,z), T(z,x). The hypegraph of Q has a fractional vertex cover of size 3/2, so the triangle query needs space exponent at least ε=1/3 to be computed. For multiple rounds of communication, we present lower bounds in a weaker model where the routing decisions for data are tuple-based. The lower bounds for multiple rounds are, to the best of our knowledge, the first of their kind. Interestingly, we also obtain as a corollary that transitive closure of a graph needs a non-constant number of rounds to be computed, for any space exponent.

There are many exciting questions that remain open. Can we show lower bounds for multiple rounds in the case where there is no assumptions on the kind of communication? What happens in the presence of data skew? Also, what can we say about the space exponent for other problems apart from relational queries?

April 12, 2011

PODS 2011

Filed under: Publication News — paris koutris @ 12:34 am

Apart from SIGMOD 2011, our group will also participate in PODS 2011 with the following paper:

Parallel Evaluation of Conjunctive Queries (Paraschos Koutris and Dan Suciu)

The availability of large data centers with tens of thousands of servers has led to the popular adoption of massive parallelism for data analysis on large datasets. Several query languages exist for running queries on massively parallel architectures, some based on the MapReduce infrastructure, others using proprietary implementations. Motivated by this trend, we analyze the parallel complexity of conjunctive queries. We propose a very simple model of parallel computation that captures these architectures, in which the complexity parameter is the number of parallel steps requiring synchronization of all servers. We study the complexity of conjunctive queries and give a complete characterization of the queries which can be computed in one parallel step. These form a strict subset of hierarchical queries, and include flat queries like R(x,y),S(x,z),T(x,v),U(x,w), tall queries like R(x),S(x,y),T(x,y,z),U(x,y,z,w), and combinations thereof, which we call tall-flat queries. We describe an algorithm for computing in parallel any tall-flat query, and prove that any query that is not tall-flat cannot be computed in one step in this model. Finally, we present extensions of our results to queries that are not tall-flat.

Blog at WordPress.com.