Latest Articles

## From a Comprehensive Experimental Survey to a Cost-based Selection Strategy for Lightweight Integer Compression Algorithms

Lightweight integer compression algorithms are frequently applied in in-memory database systems to... (more)

## Interactive Mapping Specification with Exemplar Tuples

While schema mapping specification is a cumbersome task for data curation specialists, it becomes unfeasible for non-expert users, who are... (more)

## A Unified Framework for Frequent Sequence Mining with Subsequence Constraints

Frequent sequence mining methods often make use of constraints to control which subsequences should be mined. A variety of such subsequence... (more)

## Verification of Hierarchical Artifact Systems

Data-driven workflows, of which IBM’s Business Artifacts are a prime exponent, have been successfully deployed in practice, adopted in industrial standards, and have spawned a rich body of research in academia, focused primarily on static analysis. The present work represents a significant advance on the problem of artifact verification by... (more)

### Updates to the TODS Editorial Board

As of January 1, 2018, five Associate Editors—Walid Aref, Graham Cormode, Gautam Das, Sabrina De Capitani di Vimercati, and Dirk Van Gucht—ended their terms, each having served on the editorial board for six years. Walid, Graham, Gautam, Sabrina, and Dirk have provided very substantial, high-caliber service to the journal and the database community. Also five new Associate Editors have joined the editorial board: Angela Bonifati, Université Claude Bernard Lyon 1, Wolfgang Lehner, TU Dresden, Dan Olteanu, University of Oxford, Evaggelia Pitoura, University of Ioannina, and Bernhard Seeger, University of Marburg. All five are highly regarded scholars in database systems.

### Updates to the TODS Editorial Board

As of January 1, 2017, three Associate Editors, Paolo Ciaccia, Divyakant Agrawal, and Sihem Amer-Yahia, ended their terms, each having served on the editorial board for some six years. In addition, they will stay on until they complete their current loads. We are fortunate that they have donated their time and world-class expertise during these years. Also three new Associate Editors have joined the editorial board: Feifei Li, University of Utah, Kian-Lee Tan, National University of Singapore, and Jeffrey Xu Yu, Chinese University of Hong Kong. All three are highly regarded scholars in database systems.

##### Forthcoming Articles
Dichotomies for Evaluating Simple Regular Path Queries

Regular path queries (RPQs) are a central component of graph databases. We investigate decision- and enumeration problems concerning the evaluation of RPQs under several semantics that have recently been considered: arbitrary paths, shortest paths, paths without node repetitions (simple paths), and paths without edge repetitions (trails). Whereas arbitrary and shortest paths can be dealt with efficiently, simple paths and trails become computationally difficult already for very small RPQs. We study RPQ evaluation for simple paths and trails from a parameterized complexity perspective and define a class of simple transitive expressions that is prominent in practice and for which we can prove dichotomies for the evaluation problem. We observe that, even though simple path and trail semantics are intractable for RPQs in general, they are feasible for the vast majority of RPQs that are used in practice. At the heart of this study is a result of independent interest: the two disjoint paths problem in directed graphs is W[1]-hard if parameterized by the length of one of the two paths.

Computing Optimal Repairs for Functional Dependencies

We investigate the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are Functional Dependencies (FDs). We focus on two types of repairs: an optimal subset repair (optimal S-repair) that is obtained by a minimum number of tuple deletions, and an optimal update repair (optimal U-repair) that is obtained by a minimum number of value (cell) updates. For computing an optimal S-repair, we present a polynomial-time algorithm that succeeds on certain sets of FDs and fails on others. We prove the following about the algorithm. When it succeeds, it can also incorporate weighted tuples and duplicate tuples. When it fails, the problem is NP-hard, and in fact, APX-complete (hence, cannot be approximated better than some constant). Thus, we establish a dichotomy in the complexity of computing an optimal S-repair. We present general analysis techniques for the complexity of computing an optimal U-repair, some based on the dichotomy for S-repairs. We also draw a connection to a past dichotomy in the complexity of finding a "most probable database" that satisfies a set of FDs with a single attribute on the left hand side; the case of general FDs was left open, and we show how our dichotomy provides the missing generalization and thereby settles the open problem.

ChronicleDB: A High-Performance Event Store

Reactive security monitoring, self-driving cars, the Internet of Things (IoT) and many other novel applications require systems for both writing events arriving at very high and fluctuating rates to persistent storage as well as supporting analytical ad-hoc queries. As standard database systems are not capable to deliver the required write performance, log-based systems, key-value stores and other write-optimized data stores have emerged recently. However, the drawbacks of these systems are a fair query performance and the lack of suitable instant recovery mechanisms in case of system failures.In this paper, we present ChronicleDB, a novel database system with a well-designed storage layout to achieve high write-performance under fluctuating data rates and powerful indexing capabilities to support ad-hoc queries. In addition, ChronicleDB offers low-cost fault tolerance and instant recovery within milliseconds.Unlike previous work, ChronicleDB is designed either as a serverless library to be tightly integrated in an application or as a standalone database server. Our results of an experimental evaluation with real and synthetic data reveal that ChronicleDB clearly outperforms competing systems with respect to both write and query performance.

Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems

The commoditization of high-performance networking has sparked research interest in the RDMA capability of this hardware. One-sided RDMA primitives, in particular, have generated substantial excitement due to the ability to directly access remote memory from within an application without involving the TCP/IP stack or the remote CPU. This paper considers how to leverage RDMA to improve the analytical performance of parallel database systems. To shuffle data efficiently using RDMA, one needs to consider a complex design space that includes (1) the number of open connections, (2) the contention for the shared network interface, (3) the RDMA transport function, and (4) how much memory should be reserved to exchange data between nodes during query processing. We contribute eight designs that capture salient trade-offs in this design space as well as an adaptive algorithm to dynamically manage RDMA-registered memory. We comprehensively evaluate how transport-layer decisions impact the query performance of a database system for different generations of InfiniBand. We find that a shuffling operator that uses the RDMA Send/Receive transport function over the Unreliable Datagram transport service can transmit data up to 4× faster than an RDMA-capable MPI implementation in a 16-node cluster. The response time of TPC-H queries improves by as much as 2×.

Temporally-Biased Sampling Schemes for Online Model Management

To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally-biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified decay function''. We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while---unlike in a sliding-window approach---still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (T-TBS) that probabilistically maintains a target sample size and a novel reservoir-based scheme (R-TBS) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. If the decay function is exponential, then control over the decay rate is complete, and R-TBS maximizes both expected sample size and sample-size stability. For general decay functions, the actual item inclusion probabilities can be made arbitrarily close to the nominal probabilities, and we provide a scheme that allows a trade-off between sample footprint and sample-size stability. R-TBS rests on the notion of a fractional sample'' and allows for data arrival rates that are unknown and time varying (unlike T-TBS). The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.

Efficient Algorithms for Approximate Single-Source Personalized PageRank Queries

Given a graph $G$, a source node $s$ and a target node $t$, the \emph{personalized PageRank} (\emph{PPR}) of $t$ with respect to $s$ is the probability that a random walk starting from $s$ terminates at $t$. An important variant of the PPR query is \emph{single-source PPR} (\emph{SSPPR}), which enumerates all nodes in $G$, and returns the top-$k$ nodes with the highest PPR values with respect to a given source $s$. PPR in general and SSPPR in particular have important applications in web search and social networks, e.g., in Twitter's Who-To-Follow recommendation service. However, PPR computation is known to be expensive on large graphs, and resistant to indexing. Consequently, previous solutions either use heuristics, which do not guarantee result quality, or rely on the strong computing power of modern data centers, which is costly. Motivated by this, we propose effective index-free and index-based algorithms for approximate PPR processing, with rigorous guarantees on result quality. We first present {\ssppr}, an approximate SSPPR solution that combines two existing methods Forward Push (which is fast but does not guarantee quality) and Monte Carlo Random Walk (accurate but slow) in a simple and yet non-trivial way, leading to both high accuracy and efficiency. Further, {\ssppr} includes a simple and effective indexing scheme, as well as a module for top-$k$ selection with high pruning power. Extensive experiments demonstrate that the proposed solutions are orders of magnitude more efficient than their respective competitors. Notably, on a billion-edge Twitter dataset, FORA answers a top-500 approximate SSPPR query within 1 second, using a single commodity server.

On the Expressive Power of Query Languages for Matrices

We investigate the expressive power of MATLANG, a formal language for matrix manipulation based on common matrix operations and linear algebra. The language can be extended with the operation inv of inverting a matrix. In MATLANG+inv we can compute the transitive closure of directed graphs, whereas we show that this is not possible without inversion. Indeed we show that the basic language can be simulated in the relational algebra with arithmetic operations, grouping, and summation. We also consider an operation eigen for diagonalizing a matrix, which is defined so that different eigenvectors returned for a same eigenvalue are orthogonal. We show that inv can be expressed in MATLANG+eigen. We put forward the open question whether there are boolean queries about matrices, or generic queries about graphs, expressible in MATLANG+eigen but not in MATLANG+inv. The evaluation problem for MATLANG+eigen is shown to be complete for the complexity class exists-R.

Efficient enumeration algorithms for regular document spanners

Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have good evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Towards this goal, we present a practical evaluation algorithm that allows constant delay enumeration of a spanner's output after a precomputation phase that is linear in the document. While the algorithm assumes that the spanner is specified in a syntactic variant of variable set automata, we also study how it can be applied when the spanner is specified by general variable set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner, providing a fine grained analysis of the classes of document spanners that support efficient enumeration of their results.

?KTELO: A Framework for Defining Differentially-Private Computations

The adoption of differential privacy is growing but the complexity of designing private, efficient and accurate algorithms is still high. We propose a novel programming framework and system, ?ktelo, for implementing both existing and new privacy algorithms. For the task of answering linear counting queries, we show that nearly all existing algorithms can be composed from operators, each conforming to one of a small number of operator classes. While past programming frameworks have helped to ensure the privacy of programs, the novelty of our framework is its significant support for authoring accurate and efficient (as well as private) programs. After describing the design and architecture of the ?ktelo system, we show that ?ktelo is expressive, allows for safer implementations through code reuse, and that it allows both privacy novices and experts to easily design algorithms. We provide a number of novel implementation techniques to support the generality and scalability of ?ktelo operators. These include methods to automatically compute lossless reductions of the data representation, implicit matrices that avoid materialized state but still support computations, and iterative inference implementations which generalize techniques from the privacy literature. We demonstrate the utility of ?ktelo by designing several new state-of-the-art algorithms, most of which result from simple re-combinations of operators defined in the framework. We study the accuracy and scalability of ?ktelo plans in a thorough empirical evaluation.

A Game-theoretic Approach to Data Interaction

As most users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database management systems (DBMS) may interact with users and use their feedback on the returned results to learn the information needs behind their queries. Current query interfaces assume that users do not learn and modify the way way they express their information needs in form of queries during their interaction with the DBMS. Using a real-world interaction workload, we show that users learn and modify how to express their information needs during their interactions with the DBMS and their learning is accurately modeled by a well-known reinforcement learning mechanism. As current data interaction systems assume that users do not modify their strategies, they cannot discover the information needs behind users' queries effectively. We model the interaction between users and DBMS as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. We propose a reinforcement learning method that learns and answers the information needs behind queries and adapts to the changes in users' strategies and prove that it improves the effectiveness of answering queries stochastically speaking. We propose two efficient implementation of this method over large relational databases. Our extensive empirical studies over real-world query workloads indicate that our algorithms are efficient and effective.

###### All ACM Journals | See Full Journal Index

Search TODS
enter search term and/or author name