This paper proposes a class of dependencies for graphs, referred to as graph entity dependencies (GEDs). A GED is defined as a combination of a graph pattern and an attribute dependency. In a uniform format, GEDs can express graph functional dependencies with constant literals to catch inconsistencies, and keys carrying id literals to identify entities (vertices) in a graph. We revise the chase for GEDs and prove its Church-Rosser property. We characterize GED satisfiability and implication, and establish the complexity of these problems and the validation problem for GEDs, in the presence and absence of constant literals and id literals. We also develop a sound and complete axiom system for finite implication of GEDs. In addition, we extend GEDs with built-in predicates or disjunctions, to strike a balance between the expressive power and complexity. We settle the complexity of the satisfiability, implication and validation problems for the extensions.
This paper presents GRAPE, a parallel GRAPh Engine for graph computations. GRAPE differs from prior systems in its ability to parallelize existing sequential graph algorithms as a whole, without the need for recasting the entire algorithms into a new model. Underlying GRAPE are a simple programming model, and a principled approach based on fixpoint computation that starts with partial evaluation and uses an incremental function as the intermediate consequence operator. We show that sequential graph algorithms can be ``plugged into'' GRAPE with minor additions, and get parallelized. Under a monotonic condition, the GRAPE parallelization guarantees to converge at correct answers as long as the sequential algorithms are correct. Moreover, we show that algorithms in MapReduce, BSP and PRAM can be optimally simulated on GRAPE. In addition to the ease of programming, we experimentally verify that GRAPE achieves comparable performance to the state-of-the-art graph systems, using real-life and synthetic graphs.
Widespread usage of advanced mobile devices has led to the emergence of a new class of crowdsourcing called spatial crowdsourcing. Spatial crowdsourcing advances the potential of a crowd to perform tasks related to real-world scenarios involving physical locations, which were not feasible with conventional crowdsourcing methods. The main feature of spatial crowdsourcing is the presence of spatial tasks that require workers to be physically present at a particular location for the task fulfillment. Research related to this new paradigm has gained momentum in recent years, thus necessitating a comprehensive survey to offer a bird's eye view of the current state of spatial crowdsourcing literature. In this paper, we discuss the spatial crowdsourcing infrastructure and identify the fundamental differences between spatial and conventional crowdsourcing. Furthermore, we provide a comprehensive view of the existing literature by introducing a taxonomy, elucidate the issues/challenges faced by different components of spatial crowdsourcing, and suggest potential research directions for the future.
Many emerging applications are based on finding interesting subsequences from sequence data. Finding prominent streaks, a set of longest contiguous subsequences with values all above (or below) a certain threshold, from sequence data, is one of that kind that receives much attention. Motivated from real applications, we observe that prominent streaks alone are not insightful enough but require the discovery of something we coined as historic moments as companion. In this paper, we present an algorithm to efficiently compute historic moments from sequence data. The algorithm is incremental and space-optimal, meaning that when facing new data arrival, it is able to efficiently refresh the results by keeping minimal information. Case studies show that historic moments can significantly improve the insights offered by prominent streaks alone. Furthermore, experiments show that our algorithm can outperform the baseline in both time and space.
Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). This paper proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Compared with ripple join, wander join is particularly efficient for equality joins involving multiple tables, but also supports ¸- joins. Selection predicates and group-by clauses can be handled as well. To demonstrate the usefulness of wander join, we have designed and implemented XDB (approXimate DB) by integrating wander join into various systems including PostgreSQL, Spark, and a stand-alone plug-in version using PL/SQL. The design and implementation of XDB has demonstrated wander joins practicality in a full-fledged database system. Extensive experiments using the TPC-H benchmark have demonstrated the superior performance of wander join over ripple join.
Time-Decaying Bloom Filters are probabilistic structures to answer queries on inserted items. The memory of older items decays over time, causing both false positives and false negatives. Users suffer penalties for wrong responses that are both application- and item-specific. Current filters, however, are typically tuned only for static penalties. They also ignore Bayesian priors and much information latent in the filter. We address these issues by introducing Inferential Filters, which integrate Bayesian priors and information latent in filters to make penalty-optimal, query-specific decisions. We also show how to properly infer insertion times in such filters. Our methods are general, but here we apply them to Inferential time-decaying filters, and show how to support novel query types and sliding window queries with varying error penalties. We present inferential versions of the existing Timing Bloom Filter and Generalized Bloom Filter. Our experiments on real and synthetic datasets show that when penalties are dynamic and prior probabilities are considered, these filters reduce penalties for incorrect responses to sliding-window queries by up to 70%.
Today's streaming applications demand increasingly high event throughput rates and are often subject to strict latency constraints. To allow for more complex workloads, such as window-based aggregations, streaming systems need to support stateful event processing. This introduces new challenges for streaming engines as the state needs to be maintained in a consistent and durable manner and simultaneously accessed by complex queries for real-time analytics. Modern streaming systems, such as Apache Flink, do not allow for efficiently exposing the state to analytical queries. Thus, data engineers are forced to keep the state in external data stores, which significantly increases the latencies until events are visible to analytical queries. Proprietary solutions have been created to meet data freshness constraints. These solutions are expensive, error-prone, and difficult to maintain. Main-memory database systems, such as HyPer, achieve extremely low query response times while maintaining high update rates, which makes them well-suited for analytical streaming workloads. In this paper, we explore extensions to database systems to match the performance and usability of streaming systems.
Parallel dataflow engines such as Apache Hadoop, Apache Spark, and Apache Flink are an established alternative of relational databases for modern data analysis applications. Characteristic for these systems is a scalable programming model based on distributed collections and parallel transformations expressed by means of second oder functions such as map and reduce. Notable examples are Flink's DataSet and Spark's RDD programming abstractions. These programming models are realized as eDSLs -- domain specific languages embedded in a general-purpose host language such as Java, Scala, or Python. This approach has several advantages over traditional stand-alone DSLs such as SQL or XQuery. First, linguistic constructs from the host language (e.g. anonymous functions syntax, value definitions, and fluent syntax via method chaining) can be reused in the eDSL. This eases the learning curve for developers already familiar with the host language. Second, it allows for seamless integration of library methods written in the host language via the function parameters passed to the parallel dataflow operators. This reduces the effort for developing analytics dataflows that go beyond pure SQL and require domain-specific logic. At the same time, however, state-of-the-art parallel dataflow eDSLs exhibit a number of shortcomings. First, one of the main advantages of a stand-alone DSL such as SQL -- the high-level, declarative Select-From-Where syntax -- is either lost completely or mimicked in a non-standard way. Second, execution aspects such as caching, join order, and partial aggregation have to be decided by the programmer. Optimizing them automatically is not possible due to the limited program context available in the intermediate representation of the DSL. In this paper, we argue that the limitations listed above are a side effect of the type-based embedding approach adopted by these state-of-the-art eDSLs. As a solution, we propose an alternative eDSL design based on quotations. We present a DSL embedded in Scala and discuss its compiler pipeline, intermediate representation, and some of the enabled optimizations. We promote the algebraic type of bags in union representation as a model for distributed collections, and its associated structural recursion scheme and monad for parallel collection processing. At the source code level, Scala's comprehension syntax over the bags monad can be used to encode Select-From-Where expressions in a standard way. At the intermediate representation level, maintaining comprehensions as a first-class citizen can be used to simplify the design and implementation of holistic dataflow optimizations that accommodate for nesting and control-flow. The proposed DSL design therefore reconciles the benefits of embedded parallel dataflow DSLs with the declarativity and optimization potential of stand-alone DSLs like SQL.