As data volumes continue to rise, manual inspection is becoming increasingly untenable. In response, we present MacroBase, a data analytics engine that prioritizes end-user attention in high-volume fast data streams. MacroBase enables efficient, accurate, and modular analyses that highlight and aggregate important and unusual behavior, acting as a search engine for fast data. MacroBase is able to deliver order-of-magnitude speedups over alternatives by optimizing the combination of explanation and classification tasks and by leveraging a new reservoir sampler and heavy-hitters sketch specialized for fast data streams. As a result, MacroBase delivers accurate results at speeds of up to 2M events per second per query on a single core. The system has delivered meaningful results in production, including at a telematics company monitoring hundreds of thousands of vehicles.
Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs, but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.
Many emerging applications are based on finding interesting subsequences from sequence data. Finding prominent streaks, a set of longest contiguous subsequences with values all above (or below) a certain threshold, from sequence data, is one of that kind that receives much attention. Motivated from real applications, we observe that prominent streaks alone are not insightful enough but require the discovery of something we coined as historic moments as companion. In this paper, we present an algorithm to efficiently compute historic moments from sequence data. The algorithm is incremental and space-optimal, meaning that when facing new data arrival, it is able to efficiently refresh the results by keeping minimal information. Case studies show that historic moments can significantly improve the insights offered by prominent streaks alone. Furthermore, experiments show that our algorithm can outperform the baseline in both time and space.
This article studies dynamic complexity under definable change operations in the DynFO framework by Patnaik and Immerman. It is shown that for changes definable by parameter-free first-order formulas, all (uniform) $\AC^1$ queries can be maintained by first-order dynamic programs. Furthermore, previous maintenance results for single-tuple changes are extended to more powerful change operations: The undirected reachability query is first-order maintainable under single tuple changes and first-order defined insertions, likewise the directed reachability query for directed acyclic graphs under quantifier-free insertions. Both results rely on a bounded bridge property. Towards practical feasibility it is shown that for insertion queries defined by unions of conjunctive queries the bridge bounds are small, unlike in the general case. These theoretical findings are complemented by a practical study that compares dynamic programs which allow complex changes with programs allowing only single changes and with recomputation from scratch. The picture is completed by several inexpressibility results, for example, that the reachability query cannot be maintained by quantifier-free programs under definable, quantifier-free changes.
In this paper, we show that key-value stores backed by an LSM-tree exhibit an intrinsic trade-off between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune trade-off among these metrics. We pinpoint the problem to the fact that all modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters' false positive rates in each level. We present Monkey, an LSM-based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The insight is that worst-case lookup cost is proportional to the sum of the false positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize this sum. We show analytically that Monkey reduces the asymptotic complexity of the worst-case lookup I/O cost, and we verify empirically using an implementation on top of LevelDB that Monkey reduces lookup latency by an increasing margin as the data volume grows 50%-80% for the data sizes we experimented with). Furthermore, we map the LSM-tree design space onto a closed-form model that enables co-tuning the merge policy, the buffer size and the filters' false positive rates to trade among lookup cost, update cost and/or main memory, depending on the workload (proportion of lookups and updates), the dataset (number and size of entries), and the underlying hardware (main memory available, disk vs. flash). We show how to use this model to answer what-if design questions about how changes in environmental parameters impact performance and how to adapt the various LSM-tree design elements accordingly.
Tuple-independent and disjoint-independent probabilistic databases (TI-, and DI-PDBs) represent uncertain data in a factorized form as a product of independent probabilistic variables that represent tuples or sets of tuples, respectively. When the user submits a query, the database derives the marginal probabilities of each output-tuple, under this assumption of independence on the inputs. While query processing in TI- and DI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main contribution of this paper. We rst re-introduce Beta Probabilistic Databases (B-PDBs), a generalization of TI-PDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. Building on B-PDBs, we introduce our new contribution, Dirichlet Probabilistic Databases (D-PDBs), a generalization of DI-PDBs with similar properties. We use both models to provide the following key contributions: (i) we show how to scalably compute the posterior densities of the parameters given new evidence; (ii) we study the complexity of performing Bayesian belief updates, devising e cient algorithms for tractable classes of queries; (iii) we propose a soft-EM algorithm for computing maximum-likelihood estimates of the parameters; (iv) we show how to embed the proposed algorithms into a standard relational engine; (v) we support our conclusions with extensive experimental results.
In the design of analytical procedures and machine-learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. In this manuscript, we embark on the establishment of database foundations for feature engineering. We propose a formal framework for classification in the context of a relational database. The goal of this framework is to open the way to research and techniques to assist developers with the task of feature engineering by utilizing the database's modeling and understanding of data and queries, and by deploying the well studied principles of database management. As a first step, we demonstrate the usefulness of this framework by formally defining three key algorithmic challenges. The first challenge is that of separability, which is the problem of determining the existence of feature queries that agree with the training examples. The second is that of evaluating the VC dimension of the model class with respect to a given sequence of feature queries. The third challenge is identifiability, which is the task of testing for a property of independence among features that are represented as database queries. We give preliminary results on these challenges for the case where features are defined by means of conjunctive queries, and in particular, we study the implication of various traditional syntactic restrictions on the inherent computational complexity.
The problem of querying RDF data is a central issue for the development of the Semantic Web. The query language SPARQL has become the standard language for querying RDF, since its W3C standardization in 2008. However, the 2008 version of this language missed some important functionalities: reasoning capabilities to deal with RDFS and OWL vocabularies, navigational capabilities to exploit the graph structure of RDF data, and a general form of recursion much needed to express some natural queries. To overcome these limitations, a new version of SPARQL, called SPARQL 1.1, was recently released, which includes entailment regimes for RDFS and OWL vocabularies, and a mechanism to express navigation patterns through regular expressions. Unfortunately, there is a number of useful navigation patterns that cannot be expressed in SPARQL 1.1, and the language lacks a general mechanism to express recursive queries. To the best of our knowledge, no efficient RDF query language that combines the above functionalities is known. It is the aim of this work to fill this gap. Towards this direction, we focus on the OWL 2 QL profile of OWL 2, and we show that every SPARQL query enriched with the above features can be naturally translated into a query expressed in a language that is based on an extension of Datalog, which allows for value invention and stratified negation. However, the query evaluation problem for this language is highly intractable, which is not surprising since it is expressive enough to encode some inherently hard queries. We identify a natural fragment of it, and we show it to be tractable and powerful enough to define SPARQL queries enhanced with the desired functionalities.