Voice-enabled intelligent assistants, such as Amazon Alexa, Google Assistant, Microsoft Cortana or Apple Siri, are on their way to revolutionize the way humans interact with machines. Their ubiquitous presence in our homes, offices, cars, etc. and their ease of use have the potential to fully democratize access to information and services, making them available to all, from young children to senior citizens. To this effect, Alexa is offering an open service available on tens of millions of devices that enables developers to build voice-enabled applications in a multitude of domains from home automation to entertainment.
One domain that Alexa is pioneering in particular is the shopping domain. Customers can ask Alexa to order garlic from their kitchen while they are crushing their last clove, and they can as easily ask her about the best surveillance camera. We see then that in the shopping domain, Alexa addresses not only transactional needs but also informational needs, thus covering two of the Web search users' needs defined by Broder in . Yet the usual Web search techniques cannot be applied "as is" in voice-driven product discovery, since as demonstrated by Ingber et al. in , users' behavior differs significantly between Web and voice.
Consequently, for Alexa to naturally interact with users, and act as the ultimate virtual shopping assistant, new methods need to be invented and a number of open research challenges across various domains need to be addressed. These domains include automatic speech recognition, natural language understanding, search and question answering, and most importantly, user experience, which is critical in such a new and still evolving interaction paradigm. In this talk, we will share with the audience our vision of an intelligent shopping assistant escorting customers in their holistic shopping journey. We will also discuss the involved technical challenges that establish voice shopping as a new area of research in the AI and search communities at large.
Modern information retrieval systems, such as search engines, recommender systems, and conversational agents, are best thought of as interactive systems, that is, systems that interact with and learn from user behavior. The ways in which people interact with information continue to change, with different devices, different presentation formats, and different information seeking scenarios.
These changes give rise to new algorithmic and conceptual questions. For instance, how can we learn to rank good results if the display preferences are not known? How might we automatically generate questions to elicit a user's preferences so that an information retrieval system can adjust its results as efficiently as possible? And how should we understand information seeking dialogues?
The talk is based on joint work with Claudio Di Ciccio, Julia Kiseleva, Harrie Oosterhuis, Filip Radlinski, Kate Revoredo, Anna Sepliarskaia, and Svitlana Vakulenko.
Recent progress in Deep Reinforcement Learning has shown that agents can be taught complex behaviour and solve difficult tasks, such as playing video games from pixel observations, or mastering the game of Go without observing human games, with relatively little prior information. Building on these successes, researchers such as Hermann and colleagues have sought to apply these methods to teach - in simulation - agents to complete a variety of tasks specified by combinatorially rich instruction languages. In this talk, we discuss some of these highlights and some of the limitations which inhibit scalability of such approaches to more complex instruction languages (including natural language). Following this, we introduce a new approach, inspired by recent work in adversarial reward modelling, which constitutes a first step towards scaling instruction-conditional agent training to "real world" language, unlocking the possibility of applying these techniques within a wide range of industrial applications.
In this paper, we study the product title summarization problem in E-commerce applications for display on mobile devices. Comparing with conventional sentence summarization, product title summarization has some extra and essential constraints. For example, factual errors or loss of the key information are intolerable for E-commerce applications. Therefore, we abstract two more constraints for product title summarization: (i) do not introduce irrelevant information; (ii) retain the key information (e.g., brand name and commodity name). To address these issues, we propose a novel multi-source pointer network by adding a new knowledge encoder for pointer network. The first constraint is handled by pointer mechanism. For the second constraint, we restore the key information by copying words from the knowledge encoder with the help of the soft gating mechanism. For evaluation, we build a large collection of real-world product titles along with human-written short titles. Experimental results demonstrate that our model significantly outperforms the other baselines. Finally, online deployment of our proposed model has yielded a significant business impact, as measured by the click-through rate.
Unavoidable noise in real-world categorical data presents significant challenges to existing outlier detection methods because they normally fail to separate noisy values from outlying values. Feature subspace-based methods inevitably mix noisy values when retaining an entire feature because a feature may contain both outlying values and noisy values. Pattern-based methods are normally based on frequency and are easily misled by noisy values, resulting in many faulty patterns. This paper introduces a novel unsupervised framework termed OUVAS, and its parameter-free instantiation RHAC to explore a high-quality outlying value set for detecting outliers in noisy categorical data. Based on the observation that the relations between values reflect their essence, OUVAS investigates value similarities to cluster values into different groups and combines cluster-level analysis and value-level refinement to identify an outlying value set. RHAC instantiates OUVAS by three successive modules (i.e., the combination of Ochiai coefficient and LOUVAIN algorithm to cluster values, hierarchical value coupling learning to perform cluster-level analysis, and a threshold to divide fake and real outlying values in value-level refinement). We show that (i) RHAC-based outlier detector significantly outperforms five state-of-the-art outlier detection methods; (ii) Extended RHAC-based feature selection method successfully improves the performance of existing outlier detectors and performs better than two latest outlying feature selection methods.
Topic modelling and citation recommendation of scientific articles are important yet challenging research problems in scientific article analysis. In particular, the inference on coherent topics can be easily affected by irrelevant contents in articles. Meanwhile, the extreme sparsity of citation networks brings difficulty to a valid citation recommendation. Intuitively, articles with similar topics are more likely to cite each other, and cited articles tend to share similar themes. Motivated from this intuition, we aim to boost the performance of both topic modelling and citation recommendation by effectively leverage this underlying correlation between latent topics and citation networks. To this end, we propose a novel Bayesian deep generative model termed as Neural Relational Topic Model (NRTM), which is composed with a Stacked Variational Auto-Encoder (SVAE) and a multilayer perception (MLP). Specifically, the SVAE utilizes an inference network to learn more representative topics of document contents, which can help to enrich the latent factors in collaborative filtering of citations. Furthermore, the MLP network conducts nonlinear collaborative filtering of citations, which can further benefit the inference of topics by leveraging the knowledge of citation networks. Extensive experiments on two real-world datasets demonstrate that our model can effectively take advantages of the coherence between topic learning and citation recommendation, and significantly outperform the state-of-the-art methods on both tasks.
Although the scientific digital library is growing at a rapid pace, scholars/students often find reading Science, Technology, Engineering, and Mathematics (STEM) literature daunting, especially for the math-content/formula. In this paper, we propose a novel problem, "mathematics content understanding", for cyberlearning and cyberreading. To address this problem, we create a Formula Evolution Map (FEM) offline and implement a novel online learning/reading environment, PDF Reader with Math-Assistant (PRMA), which incorporates innovative math-scaffolding methods. The proposed algorithm/system can auto-characterize student emerging math-information need while reading a paper and enable students to readily explore the formula evolution trajectory in FEM. Based on a math-information need, PRMA utilizes innovative joint embedding, formula evolution mining, and heterogeneous graph mining algorithms to recommend high quality Open Educational Resources (OERs), e.g., video, Wikipedia page, or slides, to help students better understand the math-content in the paper. Evaluation and exit surveys show that the PRMA system and the proposed formula understanding algorithm can effectively assist master and PhD students better understand the complex math-content in the class readings.
The range skyline query retrieves the dynamic skyline for every individual query point in a range by generalizing the point-based dynamic skyline query. Its wide-ranging applications enable users to submit their preferences within an interval of 'ideally sought' values across every dimension, instead of being limited to submit their preference in relation to a single sought value. This paper considers the query as a hyper-rectangle iso-oriented towards the axes of the multi-dimensional space and proposes: (i) main-memory algorithmic strategies, which are simple to implement and (ii) secondary-memory pruning mechanisms for processing range skyline queries efficiently. The proposed approach is progressive and I/O optimal. A performance evaluation of the proposed technique demonstrates its robustness and practicability.
The problem of aggregating scores, so as to provide a ranking of objects in a dataset according to different evaluation criteria, is central to many modern data-intensive applications. Although efficient (instance optimal) algorithms exist to this purpose (such as the Threshold Algorithm TA and its variants) none of them is able to deal with scenarios in which the function used to aggregate scores is only partially specified. This is the typical case when the function is a weighted sum, and the user is unable to provide precise values for the weights. In this paper, we consider the problem of processing multi-source top-k queries, when only constraints, rather than precise values, are available for the weights. After observing that the so-called Fagin's Algorithm (FA) can be adapted to solve the problem, yet only when no constraints at all are present (a case in which our queries will return the k-skyband of the dataset), we introduce the novel FSA algorithm, which we prove to be instance optimal for any set of constraints on the weights. We also propose several optimizations to the basic FSA logic so as to improve execution times. Experimental analysis on both real and synthetic datasets shows that our optimizations are indeed highly effective and that the increased flexibility provided by FSA introduces little overhead with respect to the case of classical top-k queries.
Citation KNN is an important but compute-intensive algorithm for multiple instance learning (MIL). This paper presents FALCON, a fast replacement of Citation KNN. FALCON accelerates Citation KNN by removing unnecessary distance calculations through two novel optimizations, \em multi-level triangle inequality-based distance filtering and \em heap optimization. The careful design allows it to produce the same results as the original Citation KNN does while avoiding 84--99.8% distance calculations. On seven datasets of various sizes and dimensions, FALCON consistently outperforms Citation KNN by one or two orders of magnitude, making it a promising drop-in replacement of Citation KNN for multiple instance learning.
Secure top-k inner product retrieval allows the users to outsource encrypted data vectors to a cloud server and at some later time find the k vectors producing largest inner products giving an encrypted query vector. Existing solutions suffer poor performance raised by the client's filtering out top-k results. To enable the server-side filtering, we introduce an asymmetric inner product encryption AIPE that allows the server to compute inner products from encrypted data and query vectors. To solve AIPE's vulnerability under known plaintext attack, we present a packing approach IP Packing that allows the server to obtain the entire set of inner products between the query and all data vectors but prevents the server from associating any data vector with its inner product. Based on IP Packing, we present our solution SKIP to secure top-k inner product retrieval that further speeds up retrieval process using sequential scan. Experiments on real recommendation datasets demonstrate that our protocols outperform alternatives by several orders of magnitude.
Semi-supervised learning (SSL) is effectively used for numerous classification problems, thanks to its ability to make use of abundant unlabeled data. The main assumption of various SSL algorithms is that the nearby points on the data manifold are likely to share a label. Graph-based SSL constructs a graph from point-cloud data as an approximation to the underlying manifold, followed by label inference. It is no surprise that the quality of the constructed graph in capturing the essential structure of the data is critical to the accuracy of the subsequent inference step .
How should one construct a graph from the input point-cloud data for graph-based SSL? In this work we introduce a new, parallel graph learning framework (called PG-learn) for the graph construction step of SSL. Our solution has two main ingredients: (1) a gradient-based optimization of the edge weights (more specifically, different kernel bandwidths in each dimension) based on a validation loss function, and (2) a parallel hyperparameter search algorithm with an adaptive resource allocation scheme. In essence, (1) allows us to search around a (random) initial hyperparameter configuration for a better one with lower validation loss. Since the search space of hyperparameters is huge for high-dimensional problems, (2) empowers our gradient-based search to go through as many different initial configurations as possible, where runs for relatively unpromising starting configurations are terminated early to allocate the time for others. As such, PG-learn is a carefully-designed hybrid of random and adaptive search. Through experiments on multi-class classification problems, we show that PG-learn significantly outperforms a variety of existing graph construction schemes in accuracy (per fixed time budget for hyperparameter tuning), and scales more effectively to high dimensional problems.
Node ranking in temporal networks are often impacted by heterogeneous context from node content, temporal, and structural dimensions. This paper introduces TGNet , a deep learning framework for node ranking in heterogeneous temporal graphs. TGNet utilizes a variant of Recurrent Neural Network to adapt context evolution and extract context features for nodes. It incorporates a novel influence network to dynamically estimate temporal and structural influence among nodes over time. To cope with label sparsity, it integrates graph smoothness constraints as a weak form of supervision. We show that the application of TGNet is feasible for large-scale networks by developing efficient learning and inference algorithms with optimization techniques. Using real-life data, we experimentally verify the effectiveness and efficiency of TGNet techniques. We also show that TGNet yields intuitive explanations for applications such as alert detection and academic impact ranking, as verified by our case study.
When analyzing temporal networks, a fundamental task is the identification of dense structures (i.e., groups of vertices that exhibit a large number of links), together with their temporal span (i.e., the period of time for which the high density holds). We tackle this task by introducing a notion of temporal core decomposition where each core is associated with its span: we call such cores span-cores. As the total number of time intervals is quadratic in the size of the temporal domain T under analysis, the total number of span-cores is quadratic in $|T|$ as well. Our first contribution is an algorithm that, by exploiting containment properties among span-cores, computes all the span-cores efficiently. Then, we focus on the problem of finding only the maximal span-cores, i.e., span-cores that are not dominated by any other span-core by both the coreness property and the span. We devise a very efficient algorithm that exploits theoretical findings on the maximality condition to directly compute the maximal ones without computing all span-cores. Experimentation on several real-world temporal networks confirms the efficiency and scalability of our methods. Applications on temporal networks, gathered by a proximity-sensing infrastructure recording face-to-face interactions in schools, highlight the relevance of the notion of (maximal) span-core in analyzing social dynamics and detecting/correcting anomalies in the data.
Problems involving multiple networks are prevalent in many scientific and other domains. In particular, network alignment, or the task of identifying corresponding nodes in different networks, has applications across the social and natural sciences. Motivated by recent advancements in node representation learning for single-graph tasks, we propose REGAL (REpresentation learning-based Graph ALignment), a framework that leverages the power of automatically-learned node representations to match nodes across different graphs. Within REGAL we devise xNetMF, an elegant and principled node embedding formulation that uniquely generalizes to multi-network problems. Our results demonstrate the utility and promise of unsupervised representation learning-based network alignment in terms of both speed and accuracy. REGAL runs up to 30x faster in the representation learning stage than comparable methods, outperforms existing network alignment methods by 20 to 30% accuracy on average, and scales to networks with millions of nodes each.
Nowadays, recommender systems provide essential web services on the Internet. There are mainly two categories of traditional recommendation algorithms: Content-Based (CB) and Collaborative Filtering (CF). CF methods make recommendations mainly according to the historical feedback information. They usually perform better when there is sufficient feedback information but less successful on new users and items, which is called the "cold-start'' problem. However, CB methods help in this scenario because of using content information. To take both advantages of CF and CB, how to combine them is a challenging issue. To the best of our knowledge, little previous work has been done to solve the problem in one unified recommendation model. In this work, we study how to integrate CF and CB, which utilizes both types of information in model-level but not in result-level and makes recommendations adaptively. A novel attention-based model named Attentional Content&Collaborate Model (ACCM) is proposed. Attention mechanism helps adaptively adjust for each user-item pair from which source information the recommendation is made. Especially, a "cold sampling'' learning strategy is designed to handle the cold-start problem. Experimental results on two benchmark datasets show that the ACCM performs better on both warm and cold tests compared to the state-of-the-art algorithms.
Generative Adversarial Networks (GAN) have achieved big success in various domains such as image generation, music generation, and natural language generation. In this paper, we propose a novel GAN-based collaborative filtering (CF) framework to provide higher accuracy in recommendation. We first identify a fundamental problem of existing GAN-based methods in CF and highlight it quantitatively via a series of experiments. Next, we suggest a new direction of vector-wise adversarial training to solve the problem and propose our GAN-based CF framework, called CFGAN, based on the direction. We identify a unique challenge that arises when vector-wise adversarial training is employed in CF. We then propose three CF methods realized on top of our CFGAN that are able to address the challenge. Finally, via extensive experiments on real-world datasets, we validate that vector-wise adversarial training employed in CFGAN is really effective to solve the problem of existing GAN-based CF methods. Furthermore, we demonstrate that our proposed CF methods on CFGAN provide recommendation accuracy consistently and universally higher than those of the state-of-the-art recommenders.
Textual reviews, which are readily available on many e-commerce and review websites such as Amazon and Yelp, serve as an invaluable source of information for recommender systems. However, not all parts of the reviews are equally important, and the same choice of words may reflect a different meaning based on its context. In this paper, we propose a novel end-to-end Aspect-based Neural Recommender (ANR) to perform aspect-based representation learning for both users and items via an attention-based component. Furthermore, we model the multi-faceted process behind how users rate items by estimating the aspect-level user and item importance by adapting the neural co-attention mechanism. Our proposed model concurrently address several shortcomings of existing recommender systems, and a thorough experimental study on 25 benchmark datasets from Amazon and Yelp shows that ANR significantly outperforms recently proposed state-of-the-art baselines such as DeepCoNN, D-Attn and ALFM.
Context-aware Recommendations (CARS) have attracted a lot of attention recently because of the impact of contextual information on user behaviors. Recent state-of-the-art methods represent the relations between users/items and contexts as a tensor, with which it is difficult to distinguish the impacts of different contextual factors and to model complex, non-linear interactions between contexts and users/items. In this paper, we propose a novel neural model, named Attentive Interaction Network (AIN), to enhance CARS through adaptively capturing the interactions between contexts and users/items. Specifically, AIN contains an Interaction-Centric Module to capture the interaction effects of contexts on users/items; a User-Centric Module and an Item-Centric Module to model respectively how the interaction effects influence the user and item representations. The user and item representations under interaction effects are combined to predict the recommendation scores. We further employ effect-level attention mechanism to aggregate multiple interaction effects. Extensive experiments on two rating datasets and one ranking dataset show that the proposed AIN outperforms state-of-the-art CARS methods. In addition, we also find that AIN provides recommendations with better explanation ability with respect to contexts than the existing approaches.
The field of Search as Learning addresses questions surrounding human learning during the search process. Existing research has largely focused on observing how users with learning-oriented information needs behave and interact with search engines. What is not yet quantified is the extent to which search is a viable learning activity compared to instructor-designed learning. Can a search session be as effective as a lecture video - our instructor-designed learning artefact - or learning? To answer this question, we designed a user study that pits instructor-designed learning (a short high-quality video lecture as commonly found in online learning platforms) against three instances of search, specifically (i) single-user search, (ii) search as a support tool for instructor-designed learning, and, (iii) collaborative search. We measured the learning gains of 151 study participants in a vocabulary learning task and report three main results: (i) lecture video watching yields up to 24% higher learning gains than single-user search, (ii) collaborative search for learning does not lead to increased learning, and (iii) lecture video watching supported by search leads up to a 41% improvement in learning gains over instructor-designed learning without a subsequent search phase.
Conversational search and recommendation based on user-system dialogs exhibit major differences from conventional search and recommendation tasks in that 1) the user and system can interact for multiple semantically coherent rounds on a task through natural language dialog, and 2) it becomes possible for the system to understand the user needs or to help users clarify their needs by asking appropriate questions from the users directly. We believe the ability to ask questions so as to actively clarify the user needs is one of the most important advantages of conversational search and recommendation. In this paper, we propose and evaluate a unified conversational search/recommendation framework, in an attempt to make the research problem doable under a standard formalization. Specifically, we propose a System Ask -- User Respond (SAUR) paradigm for conversational search, define the major components of the paradigm, and design a unified implementation of the framework for product search and recommendation in e-commerce. To accomplish this, we propose the Multi-Memory Network (MMN) architecture, which can be trained based on large-scale collections of user reviews in e-commerce. The system is capable of asking aspect-based questions in the right order so as to understand the user needs, while (personalized) search is conducted during the conversation, and results are provided when the system feels confident. Experiments on real-world user purchasing data verified the advantages of conversational search and recommendation against conventional search and recommendation algorithms in terms of standard evaluation measures such as NDCG.
High-recall retrieval --- finding all or nearly all relevant documents --- is critical to applications such as electronic discovery, systematic review, and the construction of test collections for information retrieval tasks. The effectiveness of current methods for high-recall information retrieval is limited by their reliance on human input, either to generate queries, or to assess the relevance of documents. Past research has shown that humans can assess the relevance of documents faster and with little loss in accuracy by judging shorter document surrogates, e.g.\ extractive summaries, in place of full documents. To test the hypothesis that short document surrogates can reduce assessment time and effort for high-recall retrieval, we conducted a 50-person, controlled, user study. We designed a high-recall retrieval system using continuous active learning (CAL) that could display either full documents or short document excerpts for relevance assessment. In addition, we tested the value of integrating a search engine with CAL. In the experiment, we asked participants to try to find as many relevant documents as possible within one hour. We observed that our study participants were able to find significantly more relevant documents when they used the system with document excerpts as opposed to full documents. We also found that allowing participants to compose and execute their own search queries did not improve their ability to find relevant documents and, by some measures, impaired performance. These results suggest that for high-recall systems to maximize performance, system designers should think carefully about the amount and nature of user interaction incorporated into the system.
Search engine users always endeavor to reformulate queries during search sessions for articulating their information needs because it is not always easy to articulate the search intents. To further ameliorate the reformulation process, search engines may provide some query suggestions based on previous queries. In this paper, we propose Reformulation Inference Network (RIN) to learn how users reformulate queries, thereby benefiting context-aware query suggestion. Instead of categorizing reformulations into predefined patterns, we represent queries and reformulations in a homomorphic hidden space through heterogeneous network embedding. To capture the structure of the session context, a recurrent neural network (RNN) with the attention mechanism is employed to encode the search session by reading the homomorphic query and reformulation embeddings. It enables the model to explicitly captures the former reformulation for each query in the search session and directly learn user reformulation behaviors, from which query suggestion may benefit as shown in previous studies. To generate query suggestions, a binary classifier and an RNN-based decoder are introduced as the query discriminator and the query generator. Inspired by the intuition that model accurately predicting the next reformulation can also correctly infer the next intended query, a reformulation inferencer is then designed for inferring the next reformulation in the latent space of homomorphic embeddings. Therefore, both question suggestion and reformulation prediction can be simultaneously optimized by multi-task learning. Extensive experiments are conducted on publicly available AOL search engine logs. The experimental results demonstrate that RIN outperforms competitive baselines across various situations for both discriminative and generative tasks of context-aware query suggestion.
The detection of all inclusion dependencies (INDs) in an unknown dataset is at the core of any data profiling effort. Apart from the discovery of foreign key relationships, INDs can help perform data integration, integrity checking, schema (re-)design, and query optimization. With the advent of Big Data, the demand increases for efficient INDs discovery algorithms that can scale with the input data size. To this end, we propose S-indd++ as a scalable system for detecting unary INDs in large datasets. S-indd++ applies a new stepwise partitioning technique that helps discard a large number of attributes in early phases of the detection by processing the first partitions of smaller sizes. S-indd++ also extends the concept of the attribute clustering to decide which attributes to be discarded based on the clustering result of each partition. Moreover, in contrast to the state-of-the-art, S-indd++ does not require the partition to fit into the main memory- which is a highly appreciable property in the face of the ever growing datasets. We conducted an exhaustive evaluation of S-indd ++ by applying it to large datasets with thousands attributes and more than 266 million tuples. The results show the high superiority of S-indd++ over the state-of-the-art. S-indd++ reduced up to 50~% of the runtime in comparison with Binder, and up to 98~% in comparison with S-indd.
Web tables have become very popular and important in many real applications, such as search engines and knowledge base enrichment. Due to its benefit, it is very urgent to understand web tables. An important task in web table understanding is the column-type detection, which detects the most likely types (categories) to describe the columns in the web table. Some existing studies use knowledge bases to determine the column types. However, this problem has three challenges. (i) Web tables are too dirty to be understood. (ii) Knowledge bases are not comprehensive enough to cover all the columns. (iii) The size of both knowledge bases and web tables are extremely huge. Thus, traditional approaches encounter the limitations with low quality and poor scalability. Also, they cannot extract the best type from top-k types automatically. To address these limitations, we propose a collective inference approach (CIA) based on Topic Sensitive PageRank, which considers not only the types of detected columns, but also the collective information of web tables to automatically produce more accurate top-k types, especially the top-1 type, for both incorrectly detected columns and undetectable columns whose cells do not exist in the knowledge base. We also propose three methods to improve the inference performance and implemented techniques of CIA in MapReduce. Experimental results on real-world datasets show that our CIA achieves much higher quality in top-1 type detection as well as the entity enrichment, and outperforms state-of-the-art approaches significantly.
The plethora of publicly available data sources has given birth to a wealth of new needs and opportunities. The ever increasing amount of data has shifted the analysts' attention from optimizing the operators for specific business cases, to focusing on datasets per se, selecting the ones that are most suitable for specific operators, i.e., they make an operator produce a specific output. Yet, predicting the output of a given operator executed for different input datasets is not an easy task: It entails executing the operator for all of them, something that requires excessive computational power and time. To tackle this challenge, we propose a novel dataset profiling methodology that infers an operator's outcome based on examining the similarity of the available input datasets in specific attributes. Our methodology quantifies dataset similarities and projects them into a low-dimensional space. The operator is then executed for a mere subset of the available datasets and its output for the rest of them is approximated using Neural Networks trained using this space as input. Our experimental evaluation thoroughly examines the performance of our scheme using both synthetic and real-world datasets, indicating that the suggested approach is capable of predicting an operator's output with high accuracy. Moreover, it massively accelerates operator profiling in comparison to approaches that require an exhaustive operator execution, rendering our work ideal for cases where a multitude of operators need to be executed to a set of given datasets.
Rich high-quality annotated data is critical for semantic segmentation learning, yet acquiring dense and pixel-wise ground-truth is both labor- and time-consuming. Coarse annotations (e.g., scribbles, coarse polygons) offer an economical alternative, with which training phase could hardly generate satisfactory performance unfortunately. In order to generate high-quality annotated data with a low time cost for accurate segmentation, in this paper, we propose a novel annotation enrichment strategy, which expands existing coarse annotations of training data to a finer scale. Extensive experiments on the Cityscapes and PASCAL VOC 2012 benchmarks have shown that the neural networks trained with the enriched annotations from our framework yield a significant improvement over that trained with the original coarse labels. It is highly competitive to the performance obtained by using human annotated dense annotations. The proposed method also outperforms among other state-of-the-art weakly-supervised segmentation methods.
Knowledge Graphs (KGs) have facilitated many real-world applications (e.g., vertical search and intelligent question answering). However, they are usually incomplete, which affects the performance of such KG based applications. To alleviate this problem, a number of Knowledge Graph Completion (KGC) methods have been developed to predict those implicit triples. Tensor/matrix based methods and translation based methods have attracted great attention for a long time. Recently, neural network has been introduced into KGC due to its extensive superiority in many fields (e.g., natural language processing and computer vision), and achieves promising results. In this paper, we propose a Shared Embedding based Neural Network (SENN) model for KGC. It integrates the prediction tasks of head entities, relations and tail entities into a neural network based framework with shared embeddings of entities and relations, while explicitly considering the differences among these prediction tasks. Moreover, we propose an adaptively weighted loss mechanism, which dynamically adjusts the weights of losses according to the mapping properties of relations, and the prediction tasks. Since relation prediction usually performs better than head and tail entity predictions, we further extend SENN to SENN+ by employing it to assist head and tail entity predictions. Experiments on benchmark datasets validate the effectiveness and merits of the proposed SENN and SENN+ methods. The shared embeddings and the adaptively weighted loss mechanism are also testified to be effective.
The main focus of relational learning for knowledge graph completion (KGC) lies in exploiting rich contextual information for facts. Many state-of-the-art models incorporate fact sequences, entity types, and even textual information. Unfortunately, most of them do not fully take advantage of rich structural information in a KG, i.e., connectivity patterns around each entity. In this paper, we propose a context-aware convolutional learning (CACL) model which jointly learns from entities and their multi-hop neighborhoods. Since we directly utilize the connectivity patterns contained in each multi-hop neighborhood, the structural role similarity among entities can be better captured, resulting in more informative entity and relation embeddings. Specifically, CACL collects entities and relations from the multi-hop neighborhood as contextual information according to their relative importance and uniquely maps them to a linear vector space. Our convolutional architecture leverages a deep learning technique to represent each entity along with its linearly mapped contextual information. Thus, we can elaborately extract the features of key connectivity patterns from the context and incorporate them into a score function which evaluates the validity of facts. Experimental results on the newest datasets show that CACL outperforms existing approaches by successfully enriching embeddings with neighborhood information.
We propose smooth q-gram, the first variant of q-gram that captures q-gram pair within a small edit distance. We apply smooth q-gram to the problem of detecting overlapping pairs of error-prone reads produced by single molecule real time sequencing (SMRT), which is the first and most critical step of the de novo fragment assembly of SMRT reads. We have implemented and tested our algorithm on a set of real world benchmarks. Our empirical results demonstrated the significant superiority of our algorithm over the existing q-gram based algorithms in accuracy.
Multi-view anomaly detection is a challenging issue due to diverse data generation mechanisms and inconsistent cluster structures of different views. Existing methods of point anomaly detection are ineffective for scenarios where individual instances are normal, but their collective behavior as a group is abnormal. In this paper, we formalize this group anomaly detection issue, and propose a novel non-parametric bayesian model, named Multi-view Group Anomaly Detection (MGAD). By representing the multi-view data with different latent group and topic structures, MGAD first discovers the distribution of groups or topics in each view, then detects group anomalies effectively. In order to solve the proposed model, we conduct the collapsed Gibbs sampling algorithm for model inference. We evaluate our model on both synthetic and real-world datasets with different anomaly settings. The experimental results demonstrate the effectiveness of the proposed approach on detecting multi-view group anomalies.
Advances in sensor technology have enabled the collection of large-scale datasets. Such datasets can be extremely noisy and often contain a significant amount of outliers that result from sensor malfunction or human operation faults. In order to utilize such data for real-world applications, it is critical to detect outliers so that models built from these datasets will not be skewed by outliers. In this paper, we propose a new outlier detection method that utilizes the correlations in the data (e.g., taxi trip distance vs. trip time). Different from existing outlier detection methods, we build a robust regression model that explicitly models the outliers and detects outliers simultaneously with the model fitting. We validate our approach on real-world datasets against methods specifically designed for each dataset as well as the state of the art outlier detectors. Our outlier detection method achieves better performances, demonstrating the robustness and generality of our method. Last, we report interesting case studies on some outliers that result from atypical events.
This paper proposes an approach to learn robust behavior representations in online platforms by addressing the challenges of user behavior skew and sparse participation. Latent behavior models are important in a wide variety of applications: recommender systems; prediction; user profiling; community characterization. Our framework is the first to jointly address skew and sparsity across graphical behavior models. We propose a generalizable bayesian approach to partition users in the presence of skew while simultaneously learning latent behavior profiles over these partitions to address user-level sparsity. Our behavior profiles incorporate the temporal activity and links between participants, although the proposed framework is flexible to introduce other definitions of participant behavior. Our approach explicitly discounts frequent behaviors and learns variable size partitions capturing diverse behavior trends. The partitioning approach is data-driven with no rigid assumptions, adapting to varying degrees of skew and sparsity.
A qualitative analysis indicates our ability to discover niche and informative user groups on large online platforms. Results on User Characterization (+6-22% AUC); Content Recommendation (+6-43% AUC) and Future Activity Prediction (+12-25% RMSE) indicate significant gains over state-of-the-art baselines. Furthermore, user cluster quality is validated with magnified gains in the characterization of users with sparse activity.
Low rank tensor completion is a well studied problem and has applications in various fields. However, in many real world applications the data is dynamic, i.e., new data arrives at different time intervals. As a result, the tensors used to represent the data grow in size. Besides the tensors, in many real world scenarios, side information is also available in the form of matrices which also grow in size with time. The problem of predicting missing values in the dynamically growing tensor is called dynamic tensor completion. Most of the previous work in dynamic tensor completion make an assumption that the tensor grows only in one mode. To the best of our Knowledge, there is no previous work which incorporates side information with dynamic tensor completion. We bridge this gap in this paper by proposing a dynamic tensor completion framework called Side Information infused Incremental Tensor Analysis (SIITA), which incorporates side information and works for general incremental tensors. We also show how non-negative constraints can be incorporated with SIITA, which is essential for mining interpretable latent clusters. We carry out extensive experiments on multiple real world datasets to demonstrate the effectiveness of SIITA in various different settings.
The rollout of new versions of a feature in modern applications is a manual multi-stage process, as the feature is released to ever larger groups of users, while its performance is carefully monitored. This kind of A/B testing is ubiquitous, but suboptimal, as the monitoring requires heavy human intervention, is not guaranteed to capture consistent, but short-term fluctuations in performance, and is inefficient, as better versions take a long time to reach the full population.
In this work we formulate this question as that of expert learning, and give a new algorithm Follow-The-Best-Interval, FTBI, that works in dynamic, non-stationary environments. Our approach is practical, simple, and efficient, and has rigorous guarantees on its performance. Finally, we perform a thorough evaluation on synthetic and real world datasets and show that our approach outperforms current state-of-the-art methods.
Aligning users across multiple heterogeneous social networks is a fundamental issue in many data mining applications. Methods that incorporate user attributes and network structure have received much attention. However, most of them suffer from error propagation or the noise from diverse neighbors in the network. To effectively model the influence from neighbors, we propose a graph neural network to directly represent the ego networks of two users to be aligned into an embedding, based on which we predict the alignment label. Three major mechanisms in the model are designed to unitedly represent different attributes, distinguish different neighbors and capture the structure information of the ego networks respectively.
Systematically, we evaluate the proposed model on a number of academia and social networking datasets with collected alignment labels. Experimental results show that the proposed model achieves significantly better performance than the state-of-the-art comparison methods (+3.12-30.57% in terms of F1 score).
With online calendar services gaining popularity worldwide, calendar data has become one of the richest context sources for understanding human behavior. However, event scheduling is still time-consuming even with the development of online calendars. Although machine learning based event scheduling models have automated scheduling processes to some extent, they often fail to understand subtle user preferences and complex calendar contexts with event titles written in natural language. In this paper, we propose Neural Event Scheduling Assistant (NESA) which learns user preferences and understands calendar contexts, directly from raw online calendars for fully automated and highly effective event scheduling. We leverage over 593K calendar events for NESA to learn scheduling personal events, and we further utilize NESA for multi-attendee event scheduling. NESA successfully incorporates deep neural networks such as Bidirectional Long Short-Term Memory, Convolutional Neural Network, and Highway Network for learning the preferences of each user and understanding calendar context based on natural languages. The experimental results show that NESA significantly outperforms previous baseline models in terms of various evaluation metrics on both personal and multi-attendee event scheduling tasks. Our qualitative analysis demonstrates the effectiveness of each layer in NESA and learned user preferences.
Search results personalization has become an effective way to improve the quality of search engines. Previous studies extracted information such as past clicks, user topical interests, query click entropy and so on to tailor the original ranking. However, few studies have taken into account the sequential information underlying previous queries and sessions. Intuitively, the order of issued queries is important in inferring the real user interests. And more recent sessions should provide more reliable personal signals than older sessions. In addition, the previous search history and user behaviors should influence the personalization of the current query depending on their relatedness. To implement these intuitions, in this paper we employ a hierarchical recurrent neural network to exploit such sequential information and automatically generate user profile from historical data. We propose a query-aware attention model to generate a dynamic user profile based on the input query. Significant improvement is observed in the experiment with data from a commercial search engine when compared with several traditional personalization models. Our analysis reveals that the attention model is able to attribute higher weights to more related past sessions after fine training.
The explicitly observed social relations from online social platforms have been widely incorporated into recommender systems to mitigate the data sparsity issue. However, the direct usage of explicit social relations may lead to an inferior performance due to the unreliability (e.g., noises) of observed links. To this end, the discovery of reliable relations among users plays a central role in advancing social recommendation. In this paper, we propose a novel approach to adaptively identify implicit friends toward discovering more credible user relations. Particularly, implicit friends are those who share similar tastes but could be distant from each other on the network topology of social relations. Methodologically, to find the implicit friends for each user, we first model the whole system as a heterogeneous information network, and then capture the similarity of users through the meta-path based embedding representation learning. Finally, based on the intuition that social relations have varying degrees of impact on different users, our approach adaptively incorporates different numbers of similar users as implicit friends for each user to alleviate the adverse impact of unreliable social relations for a more effective recommendation. Experimental analysis on three real-world datasets demonstrates the superiority of our method and explain why implicit friends are helpful in improving social recommendation.
Modelling user voting intention in social media is an important research area, with applications in analysing electorate behaviour, online political campaigning and advertising. Previous approaches mainly focus on predicting national general elections, which are regularly scheduled and where data of past results and opinion polls are available. However, there is no evidence of how such models would perform during a sudden vote under time-constrained circumstances. That poses a more challenging task compared to traditional elections, due to its spontaneous nature. In this paper, we focus on the 2015 Greek bailout referendum, aiming to nowcast on a daily basis the voting intention of 2,197 Twitter users. We propose a semi-supervised multiple convolution kernel learning approach, leveraging temporally sensitive text and network information. Our evaluation under a real-time simulation framework demonstrates the effectiveness and robustness of our approach against competitive baselines, achieving a significant 20% increase in F-score compared to solely text-based models.
The problem of inferring unknown parameters of a networked social system is of considerable practical importance. We consider this problem for the independent cascade model using an active query framework. More specifically, given a network whose edge probabilities are unknown, the goal is to infer the probability value on each edge by querying the system. The optimization objective is to use as few queries as possible in carrying out the inference. We present approximation algorithms that provide provably good estimates of edge probabilities. We also present results from an experimental evaluation of our algorithms on several real-world networks.
Failure to accurately measure the outcomes of an experiment can lead to bias and incorrect conclusions. Online controlled experiments (aka AB tests) are increasingly being used to make decisions to improve websites as well as mobile and desktop applications. We argue that loss of telemetry data (during upload or post-processing) can skew the results of experiments, leading to loss of statistical power and inaccurate or erroneous conclusions. By systematically investigating the causes of telemetry loss, we argue that it is not practical to entirely eliminate it. Consequently, experimentation systems need to be robust to its effects. Furthermore, we note that it is nontrivial to measure the absolute level of telemetry loss in an experimentation system. In this paper, we take a top-down approach towards solving this problem. We motivate the impact of loss qualitatively using experiments in real applications deployed at scale, and formalize the problem by presenting a theoretical breakdown of the bias introduced by loss. Based on this foundation, we present a general framework for quantitatively evaluating the impact of telemetry loss, and present two solutions to measure the absolute levels of loss. This framework is used by well-known applications at Microsoft, with millions of users and billions of sessions. These general principles can be adopted by any application to improve the overall trustworthiness of experimentation and data-driven decision making.
Because it is expensive to construct test collections for Cranfield-based evaluation of information retrieval systems, a variety of lower-cost methods have been proposed. The reliability of these methods is often validated by measuring rank correlation (e.g., Kendall's tau) between known system rankings on the full test collection vs. observed system rankings on the lower-cost one. However, existing rank correlation measures do not consider the statistical significance of score differences between systems in the observed rankings. To address this, we propose two statistical-significance-aware rank correlation measures, one of which is a head-weighted version of the other. We first show empirical differences between our proposed measures and existing ones. We then compare the measures while benchmarking four system evaluation methods: pooling, crowdsourcing, evaluation with incomplete judgments, and automatic system ranking. We show that use of our measures can lead to different experimental conclusions regarding reliability of alternative low-cost evaluation methods.
While test collections are a vital piece of the research infrastructure for information retrieval, constructing fair, reusable test collections for large data sets is challenging because of the number of human relevance assessments required. Various approaches for minimizing the number of judgments required have been proposed including a suite of methods based on multi-arm bandit optimization techniques. However, most of these approaches look to maximize the total number of relevant documents found, which is not necessarily fair, and they have only been demonstrated in simulation on existing test collections. The TREC 2017 Common Core track provided the opportunity to build a collection de novo using a bandit method. Doing so required addressing two problems not encountered in simulation: giving the human judges time to learn a topic and allocating the overall judgment budget across topics. The resulting modified bandit technique was used to build the 2017 Common Core test collection consisting of approximately 1.8 million news articles, 50 topics, and 30,030 judgments. Unfortunately, the constructed collection is of lower quality than anticipated: a large percentage of the known relevant documents were retrieved by only one team, and for 21 topics, more than a third of the judged documents are relevant. As such the collection is less reusable than desired. Further analysis demonstrates that the greedy approach common to most bandit methods can be unfair even to the runs participating in the collection-building process when the judgment budget is small relative to the (unknown) number of relevant documents.
To address the sparsity and cold start problem of collaborative filtering, researchers usually make use of side information, such as social networks or item attributes, to improve recommendation performance. This paper considers the knowledge graph as the source of side information. To address the limitations of existing embedding-based and path-based methods for knowledge-graph-aware recommendation, we propose RippleNet, an end-to-end framework that naturally incorporates the knowledge graph into recommender systems. Similar to actual ripples propagating on the water, RippleNet stimulates the propagation of user preferences over the set of knowledge entities by automatically and iteratively extending a user's potential interests along links in the knowledge graph. The multiple "ripples" activated by a user's historically clicked items are thus superposed to form the preference distribution of the user with respect to a candidate item, which could be used for predicting the final clicking probability. Through extensive experiments on real-world datasets, we demonstrate that RippleNet achieves substantial gains in a variety of scenarios, including movie, book and news recommendation, over several state-of-the-art baselines.
Link prediction in dynamic networks is an important task with many real-life applications in different domains, such as social networks, cyber-physical systems, and bioinformatics. There are two key processes in dynamic networks: network structural evolution and network temporal evolution, where the former represents interdependency between entities and their neighbors in the network at each timestamp, while the latter captures the evolving behavior of the entire network from the current timestamp to the next. Structural evolution generally assumes that a node is more likely to co-evolve with its neighbors in the near future. Temporal evolution focuses on the trend of network evolution as a whole, based on the accumulation of historical data. It is thus essential to use characteristics of both structural and temporal evolutions to emulate complex behaviors of dynamic networks. However, very few existing work considered both processes. In addition, real-life networks are often very sparse with limited observed links. A missing link between two nodes does not always imply that the two nodes do not have a relation in reality, especially when they share many common neighbors. Most existing methods only focus on the first-order proximity of networks, which is usually insufficient to capture the relationships among nodes.
In this work, we propose a novel framework named STEP, to simultaneously integrate both structural and temporal information in link prediction in dynamic networks. STEP first constructs a sequence of higher-order proximity matrices to better capture the implicit relationships among nodes. A regularized optimization problem is then formulated to model those higher-order proximity matrices along with additional structural and temporal constraints. Given the large scale of modern networks, we also develop an efficient block coordinate gradient descent approach to solve the optimization problem efficiently. STEP can be used to solve the link prediction problem in directed or undirected, weighted or unweighted dynamic networks. Extensive experiments on several real world datasets demonstrate that STEP can effectively model link propagation over entire time-varying networks and its superiority over some state-of-the-art algorithms.
The graph embedding paradigm projects nodes of a graph into a vector space, which can facilitate various downstream graph analysis tasks such as node classification and clustering. To efficiently learn node embeddings from a graph, graph embedding techniques usually preserve the proximity between node pairs sampled from the graph using random walks. In the context of a heterogeneous graph, which contains nodes from different domains, classical random walks are biased towards highly visible domains where nodes are associated with a dominant number of paths. To overcome this bias, existing heterogeneous graph embedding techniques typically rely on meta-paths (i.e., fixed sequences of node types) to guide random walks. However, using these meta-paths either requires prior knowledge from domain experts for optimal meta-path selection, or requires extended computations to combine all meta-paths shorter than a predefined length. In this paper, we propose an alternative solution that does not involve any meta-path. Specifically, we propose JUST, a heterogeneous graph embedding technique using random walks with JUmp and STay strategies to overcome the aforementioned bias in an more efficient manner. JUST can not only gracefully balance between homogeneous and heterogeneous edges, it can also balance the node distribution over different domains (i.e., node types). By conducting a thorough empirical evaluation of our method on three heterogeneous graph datasets, we show the superiority of our proposed technique. In particular, compared to a state-of-the-art heterogeneous graph embedding technique Hin2vec, which tries to optimally combine all meta-paths shorter than a predefined length, our technique yields better results in most experiments, with a dramatically reduced embedding learning time (about 3x speedup).
Nowadays, it is common for one natural person to join multiple social networks to enjoy different services. Linking identical users across different social networks, also known as the User Identity Linkage (UIL), is an important problem of great research challenges and practical value. Most existing UIL models are supervised or semi-supervised and a considerable number of manually matched user identity pairs are required, which is costly in terms of labor and time. In addition, existing methods generally rely heavily on some discriminative common user attributes, and thus are hard to be generalized. Motivated by the isomorphism across social networks, in this paper we consider all the users in a social network as a whole and perform UIL from the user space distribution level. The insight is that we convert the unsupervised UIL problem to the learning of a projection function to minimize the distance between the distributions of user identities in two social networks. We propose to use the earth mover's distance (EMD) as the measure of distribution closeness, and propose two models UUIL$_gan $ and UUIL$_omt $ to efficiently learn the distribution projection function. Empirically, we evaluate the proposed models over multiple social network datasets, and the results demonstrate that our proposal significantly outperforms state-of-the-art methods.
A wide range of web corpora are in the form of short text, such as QA queries, search queries and news titles. Entity linking for these short texts is quite important. Most of supervised approaches are not effective for short text entity linking. The training data for supervised approaches are not suitable for short text and insufficient for low-resourced languages. Previous unsupervised methods are incapable of handling the sparsity and noisy problem of short text. We try to solve the problem by mapping the sparse short text to a topic space. We notice that the concepts of entities have rich topic information and characterize entities in a very fine-grained granularity. Hence, we use the concepts of entities as topics to explicitly represent the context, which helps improve the performance of entity linking for short text. We leverage our linking approach to segment the short text semantically, and build a system for short entity text recognition and linking. Our entity linking approach exhibits the state-of-the-art performance on several datasets for the realistic short text entity linking problem.
Recent knowledge extraction methods are moving towards ternary and higher-arity relations to capture more information about binary facts. An example is to include the time, the location, and the duration of a specific fact. These relations can be even more complex to extract in advanced domains such as news, where events typically come with different facets including reasons, consequences, purposes, involved parties, and related events. The main challenge consists in first finding the set of facets related to each fact, and second tagging those facets to the relevant category.
In this paper, we tackle the above problems by proposing StuffIE, a fine-grained information extraction approach which is facet-centric. We exploit the Stanford dependency parsing enhanced by lexical databases such as WordNet to extract nested triple relations. Then, we exploit the syntactical dependencies to semantically tag facets using distant learning based on Oxford dictionary. We have tested the accuracy of the extracted facets and their semantic tags using DUC'04 dataset. The results show the high accuracy and coverage of our approach with respect to ClausIE, OLLIE, SEMAFOR SRL and Illinois SRL.
Efficient implementations of range and nearest neighbor queries are critical in many large multimedia applications. Locality Sensitive Hashing (LSH) is a popular technique for performing approximate searches in high-dimensional multimedia, such as image or sensory data. Often times, these multimedia data are represented as a collection of important spatio-temporal features which are extracted by using localized feature extraction algorithms. When a user wants to search for a given entity (object, event, or observation), individual similarity search queries, which collectively form a set query, need to be performed on the features that represent the particular search entity. Existing LSH techniques require that users provide an accuracy guarantee for each query in the set query, instead of an overall guarantee for the entire set query, which can lead to misses or wasteful work. We propose a novel index structure, Point Set LSH (PSLSH), which is able to execute a similarity search for a given set of search points in the high-dimensional space with a user-provided guarantee for the entire set query. Experimental evaluation shows significant gains in efficiency and accuracy trade-offs for executing set queries in high-dimensional spaces.
In this work, we describe GYANI (gyan stands for knowledge in Hindi), an indexing infrastructure for search and analysis of large semantically annotated document collections. To facilitate the search for sentences or text regions for many knowledge-centric tasks such as information extraction, question answering, and relationship extraction, it is required that one can query large annotated document collections interactively. However, currently such an indexing infrastructure that scales to millions of documents and provides fast query execution times does not exist. To alleviate this problem, we describe how we can effectively index layers of annotations (e.g., part-of-speech, named entities, temporal expressions, and numerical values) that can be attached to sequences of words. Furthermore, we describe a query language that provides the ability to express regular expressions between word sequences and semantic annotations to ease search for sentences and text regions for enabling knowledge acquisition at scale. We build our infrastructure on a state-of-the-art distributed extensible record store. We extensively evaluate GYANI over two large news archives and the entire Wikipedia amounting to more than fifteen million documents. We observe that using GYANI we can achieve significant speed ups of more than 95x in information extraction, 53x on extracting answer candidates for questions, and 12x on relationship extraction task.
The availability of massive data and computing power allowing for effective data driven neural approaches is having a major impact on machine learning and information retrieval research, but these models have a basic problem with efficiency. Current neural ranking models are implemented as multistage rankers: for efficiency reasons, the neural model only re-ranks the top ranked documents retrieved by a first-stage efficient ranker in response to a given query. Neural ranking models learn dense representations causing essentially every query term to match every document term, making it highly inefficient or intractable to rank the whole collection. The reliance on a first stage ranker creates a dual problem: First, the interaction and combination effects are not well understood. Second, the first stage ranker serves as a "gate-keeper" or filter, effectively blocking the potential of neural models to uncover new relevant documents. In this work, we propose a standalone neural ranking model (SNRM) by introducing a sparsity property to learn a latent sparse representation for each query and document. This representation captures the semantic relationship between the query and documents, but is also sparse enough to enable constructing an inverted index for the whole collection. We parameterize the sparsity of the model to yield a retrieval model as efficient as conventional term based models. Our model gains in efficiency without loss of effectiveness: it not only outperforms the existing term matching baselines, but also performs similarly to the recent re-ranking based neural models with dense representations. Our model can also take advantage of pseudo-relevance feedback for further improvements. More generally, our results demonstrate the importance of sparsity in neural IR models and show that dense representations can be pruned effectively, giving new insights about essential semantic features and their distributions.
Given electrical sensors placed on the power grid, how can we automatically determine when electrical components (e.g. power lines) fail? Or, given traffic sensors which measure the speed of vehicles passing over them, how can we determine when traffic accidents occur? Both these problems involve detecting change points in a set of sensors on the nodes or edges of a graph. To this end, we propose ChangeDAR (Change Detection And Resolution), which detects changes in an online manner, and reports when and where the change occurred in the graph.
Our contributions are: 1) Algorithm : we propose novel information-theoretic optimization objectives for scoring and detecting localized changes, and propose two algorithms, ChangeDAR-S and ChangeDAR-D respectively, to optimize them. 2) Theoretical Guarantees : we show that both methods provide constant-factor approximation guarantees (Theorems 5.2 and 6.2). 3) Effectiveness : in experiments, ChangeDAR detects traffic accidents and power line failures with 75% higher F-measure than comparable baselines. 4) Scalability : ChangeDAR is online and near-linear in the graph size and the number of time ticks.
This paper targets a general popularity prediction problem for event sequence, which has recently gained great attention due to its extensive applications in various domains. Feature driven method and point process method are two basic thinking paradigms to tackle the prediction problem, but both of them suffer from limitations. In this paper, we propose PreNets unifying the two thinking paradigms in an adversarial manner. On one side, feature driven model acts like a 'critic' who aims to discriminate the predicted popularity from the real one based on a set of temporal features from the sequence. On the other side, point process model acts like an 'interpreter' who recognizes the dynamic patterns in sequence to generate a predicted popularity that can fool the 'critic'. Through a Wasserstein learning based two-player game, the training loss of the 'critic' guides the 'interpreter' to better exploit the sequence patterns and enhance prediction, while the 'interpreter' pushes the 'critic' to select effective early features that helps discrimination. This mechanism enables the framework to absorb the advantages of both feature driven and point process methods. Empirical results show that PreNets achieves significant MAPE improvement for both Twitter cascade and Amazon review prediction.
Huge amounts of textual streams are generated nowadays, especially in social networks like Twitter and Facebook. As the discussion topics and user opinions on those topics change drastically with time, those streams undergo changes in data distribution, leading to changes in the concept to be learned, a phenomenon called concept drift. One particular type of drift, that has not yet attracted a lot of attention is feature drift, i.e., changes in the features that are relevant for the learning task at hand. In this work, we propose an approach for handling feature drifts in textual streams. Our approach integrates i) an ensemble-based mechanism to accurately predict the feature/word values for the next time-point by taking into account the different features might be subject to different temporal trends and ii) a sketch-based feature space maintenance mechanism that allows for a memory-bounded maintenance of the feature space over the stream. Experiments with textual streams from the sentiment analysis, email preference and spam detection demonstrate that our approach achieves significantly better or competitive performance compared to baselines.
Accurate modeling of how the visibility of a piece of information varies across time has a wide variety of applications. For example, in an e-commerce site like Amazon, it can help to identify which product is preferred over others; in Twitter, it can predict which hashtag may go viral against others. Visibility of a piece of information, therefore, indicates the ability of a piece of information to attract the attention of the users, against the rest. Therefore, apart from the individual information diffusion processes, the information visibility dynamics also involves a competition process, where each information diffusion process competes against each other to draw the attention of users. Despite models of individual information diffusion processes abounding in literature, modeling the competition process is left unaddressed. In this paper, we propose Competing Recurrent Point Process (CRPP), a probabilistic deep machinery that unifies the nonlinear generative dynamics of a collection of diffusion processes, and inter-process competition - the two ingredients of visibility dynamics. To design this model, we rely on a recurrent neural network (RNN) guided generative framework, where the recurrent unit captures the joint temporal dynamics of a group of processes. This is aided by a discriminative model which captures the underlying competition process by discriminating among the various processes using several ranking functions. On ten diverse datasets crawled from Amazon and Twitter, CRPP offers a substantial performance boost in predicting item visibility against several baselines, thereby achieving significant accuracy in predicting both the collective diffusion mechanism and the underlying competition processes.
Community detection is one of the fundamental tasks in graph mining, which has many real-world applications in diverse domains. In this study, we propose an optimization model for finding a community that is densely connected internally but sparsely connected to the rest of the graph. The model extends the densest subgraph problem, in which we maximize the density while minimizing the average cut size. We first show that our proposed model can be solved efficiently. Then we design two polynomial-time exact algorithms based on linear programming and a maximum flow algorithm, respectively. Moreover, to deal with larger-sized graphs in practice, we present a scalable greedy algorithm that runs in almost linear time with theoretical performance guarantee of the output. In addition, as our model is closely related to a quality function called the modularity density, we show that our algorithms can also be used to find global community structure in a graph. With thorough experiments using well-known real-world graphs, we demonstrate that our algorithms are highly effective in finding a suitable community in a graph. For example, for web-Google, our algorithm finds a solution with more than 99.1% density and less than 3.1% cut size, compared with a solution obtained by a baseline algorithm for the densest subgraph problem.
The modeling of networks, specifically generative models, has been shown to provide a plethora of information about the underlying network structures, as well as many other benefits behind their construction. There has been a considerable increase in interest for the better understanding and modeling of networks, and the vast majority of existing work has been for unsigned networks. However, many networks can have positive and negative links (or signed networks), especially in online social media. It is evident from recent work that signed networks present unique properties and principles from unsigned networks due to the added complexity, which pose tremendous challenges on existing unsigned network models. Hence, in this paper, we investigate the problem of modeling signed networks. In particular, we provide a principled approach to capture important properties and principles of signed networks and propose a novel signed network model guided by Structural Balance Theory. Empirical experiments on three real-world signed networks demonstrate the effectiveness of the proposed model.
Many scale-free networks exhibit a "rich club" structure, where high degree vertices form tightly interconnected subgraphs. In this paper, we explore the emergence of "rich clubs" in the context of shortest path based centrality metrics. We term these subgraphs of connected high closeness or high betweeness vertices as rich centrality clubs (RCC). Our experiments on real world and synthetic networks high- light the inter-relations between RCCs, expander graphs, and the core-periphery structure of the network. We show empirically and theoretically that RCCs exist, if the core-periphery structure of the network is such that each shell is an expander graph, and their density decreases from inner to outer shells. We further demonstrate that in addition to being an interesting topological feature, the presence of RCCs is useful in several appli- cations. The vertices in the subgraph forming the RCC are effective seed nodes for spreading information. Moreover, networks with RCCs are robust under perturbations to their structure. Given these useful properties of RCCs, we present a network modification model that can efficiently create a RCC within net- works where they are not present, while retaining other structural properties of the original network. The main contributions of our paper are: (i) we demonstrate that the formation of RCC is related to the core-periphery structure and particularly the expander like properties of each shell, (ii) we show that the RCC property can be used to find effective seed nodes for spreading information and for improving the resilience of the network under perturbation and, finally, (iii) we present a modification algorithm that can insert RCC within networks, while not affecting their other structural properties. Taken together, these contributions present one of the first comprehensive studies of the properties and applications of rich clubs for path based centralities.
Dynamic networks are very common in urban systems today. As data are acquired, unfortunately, they are rarely complete observations of the whole system. It is important to reliably infer the unobserved attribute values anywhere in the graphs, at certain times---either in the past or in the future. Previous work does not sufficiently capture the correlations inherent with graph topology and with time. We propose a machine learning approach using a novel probabilistic graphical model. We devise a series of algorithms to efficiently group the vertices, to learn the model parameters, and to infer the unobserved values for query processing. Furthermore, we propose a method to incrementally and automatically update the model. Finally, we perform an extensive experimental study using two real-world dynamic graph datasets to evaluate our approach.
With the rapid growth of online information services, a sheer volume of news data becomes available. To help people quickly digest the explosive information, we define a new problem - schema-based news event profiling - profiling events reported in open-domain news corpora, with a set of slots and slot-value pairs for each event, where the set of slots forms the schema of an event type. Such profiling not only provides readers with concise views of events, but also facilitates various applications such as information retrieval, knowledge graph construction and question answering. It is however a quite challenging task. The first challenge is to find out events and event types because they are both initially unknown. The second difficulty is the lack of pre-defined event-type schemas. Lastly, even with the schemas extracted, to generate event profiles from them is still essential yet demanding.
To address these challenges, we propose a fully automatic, unsupervised, three-step framework to obtain event profiles. First, we develop a Bayesian non-parametric model to detect events and event types by exploiting the slot expressions of the entities mentioned in news articles. Second, we propose an unsupervised embedding model for schema induction that encodes the insight: an entity may serve as the values of multiple slots in an event, but if it appears in more sentences along with the same set of more entities in the event, its slots in these sentences tend to be similar. Finally, we build event profiles by extracting slot values for each event based on the slots' expression patterns. To the best of our knowledge, this is the first work on schema-based profiling for news events. Experimental results on a large news corpus demonstrate the superior performance of our method against the state-of-the-art baselines on event detection, schema induction and event profiling.
While the popularity of online social network (OSN) apps continues to grow, little attention has been drawn to the increasing cases of Social Network Addictions (SNAs). In this paper, we argue that by mining OSN data in support of online intervention treatment, data scientists may assist mental healthcare professionals to alleviate the symptoms of users with SNA in early stages. Our idea, based on behavioral therapy, is to incrementally substitute highly addictive newsfeeds with safer, less addictive, and more supportive newsfeeds. To realize this idea, we propose a novel framework, called Newsfeed Substituting and Supporting System (N3S), for newsfeed filtering and dissemination in support of SNA interventions. New research challenges arise in 1) measuring the addictive degree of a newsfeed to an SNA patient, and 2) properly substituting addictive newsfeeds with safe ones based on psychological theories. To address these issues, we first propose the Additive Degree Model (ADM) to measure the addictive degrees of newsfeeds to different users. We then formulate a new optimization problem aiming to maximize the efficacy of behavioral therapy without sacrificing user preferences. Accordingly, we design a randomized algorithm with a theoretical bound. A user study with 716 Facebook users and 11 mental healthcare professionals around the world manifests that the addictive scores can be reduced by more than 30%. Moreover, experiments show that the correlation between the SNA scores and the addictive degrees quantified by the proposed model is much greater than that of state-of-the-art preference based models.
An important metric of users' satisfaction and engagement within on-line streaming services is the user session length, i.e. the amount of time they spend on a service continuously without interruption. Being able to predict this value directly benefits the recommendation and ad pacing contexts in music and video streaming services. Recent research has shown that predicting the exact amount of time spent is highly nontrivial due to many external factors for which a user can end a session, and the lack of predictive covariates. Most of the other related literature on duration based user engagement has focused on dwell time for websites, for search and display ads, mainly for post-click satisfaction prediction or ad ranking. In this work we present a novel framework inspired by hierarchical Bayesian modeling to predict, at the moment of login, the amount of time a user will spend in the streaming service. The time spent by a user on a platform depends upon user-specific latent variables which are learned via hierarchical shrinkage. Our framework enjoys theoretical guarantees and naturally incorporates flexible parametric/nonparametric models on the covariates, including models robust to outliers. Our proposal is found to outperform state-of-the-art estimators in terms of efficiency and predictive performance on real world public and private datasets.
In this paper, we introduce and tackle the Question Headline Generation (QHG) task. The motivation comes from the investigation of a real-world news portal where we find that news articles with question headlines often receive much higher click-through ratio than those with non-question headlines. The QHG task can be viewed as a specific form of the Question Generation (QG) task, with the emphasis on creating a natural question from a given news article by taking the entire article as the answer. A good QHG model thus should be able to generate a question by summarizing the essential topics of an article. Based on this idea, we propose a novel dual-attention sequence-to-sequence model (DASeq2Seq) for the QHG task. Unlike traditional sequence-to-sequence models which only employ the attention mechanism in the decoding phase for better generation, our DASeq2Seq further introduces a self-attention mechanism in the encoding phase to help generate a good summary of the article. We investigate two ways of the self-attention mechanism, namely global self-attention and distributed self-attention. Besides, we employ a vocabulary gate over both generic and question vocabularies to better capture the question patterns. Through the offline experiments, we show that our approach can significantly outperform the state-of-the-art question generation or headline generation models. Furthermore, we also conduct online evaluation to demonstrate the effectiveness of our approach using A/B test.
Relevance estimation is among the most important tasks in the ranking of search results because most search engines follow the Probability Ranking Principle. Current relevance estimation methodologies mainly concentrate on text matching between the query and Web documents, link analysis and user behavior models. However, users judge the relevance of search results directly from Search Engine Result Pages (SERPs), which provide valuable signals for reranking. Morden search engines aggregate heterogeneous information items (such as images, news, and hyperlinks) to a single ranking list on SERPs. The aggregated search results have different visual patterns, textual semantics and presentation structures, and a better strategy should rely on all these information sources to improve ranking performance. In this paper, we propose a novel framework named Joint Relevance Estimation model (JRE), which learns the visual patterns from screenshots of search results, explores the presentation structures from HTML source codes and also adopts the semantic information of textual contents. To evaluate the performance of the proposed model, we construct a large scale practical Search Result Relevance (SRR) dataset which consists of multiple information sources and 4-grade relevance scores of over 60,000 search results. Experimental results show that the proposed JRE model achieves better performance than state-of-the-art ranking solutions as well as the original ranking of commercial search engines.
Previous work has shown that popular trending events are important external factors which pose significant influence on user search behavior and also provided a way to computationally model this influence. However, their problem formulation was based on the strong assumption that each event poses its influence independently. This assumption is unrealistic as there are many correlated events in the real world which influence each other and thus, would pose a joint influence on the user search behavior rather than posing influence independently. In this paper, we study this novel problem of Modeling the Joint Influences posed by multiple correlated events on user search behavior. We propose a Joint Influence Model based on the Multivariate Hawkes Process which captures the inter-dependency among multiple events in terms of their influence upon user search behavior. We evaluate the proposed Joint Influence Model using two months query-log data from https://search.yahoo.com/. Experimental results show that the model can indeed capture the temporal dynamics of the joint influence over time and also achieves superior performance over different baseline methods when applied to solve various interesting prediction problems as well as real-word application scenarios, e.g., query auto-completion.
This study considers the task of machine reading at scale (MRS) wherein, given a question, a system first performs the information retrieval (IR) task of finding relevant passages in a knowledge source and then carries out the reading comprehension (RC) task of extracting an answer span from the passages. Previous MRS studies, in which the IR component was trained without considering answer spans, struggled to accurately find a small number of relevant passages from a large set of passages. In this paper, we propose a simple and effective approach that incorporates the IR and RC tasks by using supervised multi-task learning in order that the IR component can be trained by considering answer spans. Experimental results on the standard benchmark, answering SQuAD questions using the full Wikipedia as the knowledge source, showed that our model achieved state-of-the-art performance. Moreover, we thoroughly evaluated the individual contributions of our model components with our new Japanese dataset and SQuAD. The results showed significant improvements in the IR task and provided a new perspective on IR for RC: it is effective to teach which part of the passage answers the question rather than to give only a relevance score to the whole passage.
In software development, bug localization is the process finding portions of source code associated to a submitted bug report. This task has been modeled as an information retrieval task at source code file, where the report is the query. In this work, we propose a model that, instead of working at file level, learns feature representations from source changes extracted from the project history at both syntactic and code change dependency perspectives to support bug localization.
To that end, we structured an end-to-end architecture able to integrate feature learning and ranking between sets of bug reports and source code changes.
We evaluated our model against the state of the art of bug localization on several real world software projects obtaining competitive results in both intra-project and cross-project settings. Besides the positive results in terms of model accuracy, as we are giving the developer not only the location of the bug associated to the report, but also the change that introduced, we believe this could give a broader context for supporting fixing tasks.
The cross-domain recommendation technique is an effective way of alleviating the data sparse issue in recommender systems by leveraging the knowledge from relevant domains. Transfer learning is a class of algorithms underlying these techniques. In this paper, we propose a novel transfer learning approach for cross-domain recommendation by using neural networks as the base model. In contrast to the matrix factorization based cross-domain techniques, our method is deep transfer learning, which can learn complex user-item interaction relationships. We assume that hidden layers in two base networks are connected by cross mappings, leading to the collaborative cross networks (CoNet). CoNet enables dual knowledge transfer across domains by introducing cross connections from one base network to another and vice versa. CoNet is achieved in multi-layer feedforward networks by adding dual connections and joint loss functions, which can be trained efficiently by back-propagation. The proposed model is thoroughly evaluated on two large real-world datasets. It outperforms baselines by relative improvements of 7.84% in NDCG. We demonstrate the necessity of adaptively selecting representations to transfer. Our model can reduce tens of thousands training examples comparing with non-transfer methods and still has the competitive performance with them.
Review-based methods are one of the dominant methods to address the data sparsity problem of recommender system. However, the performance of most existing review-based methods will degrade when the review is also sparse. To this end, we propose a method to exploit user-item p air-dependent features from a uxiliary r eviews written by l ike-minded users (PARL) to address such problem. That is, both the reviews written by the user and the reviews written for the item are incorporated to highlight the useful features covered by the auxiliary reviews. PARL not only alleviates the sparsity problem of reviews but also produce extra informative features to further improve the accuracy of rating prediction. More importantly, it is designed as a plug-and-play model which can be plugged into various deep recommender systems to improve recommendations provided by them. Extensive experiments on five real-world datasets show that PARL achieves better prediction accuracy than other state-of-the-art alternatives. Also, with the exploitation of auxiliary reviews, the performance of PARL is robust on datasets with different characteristics.
Following recent successes in exploiting both latent factor and word embedding models in recommendation, we propose a novel Regularized Multi-Embedding (RME) based recommendation model that simultaneously encapsulates the following ideas via decomposition: (1) which items a user likes, (2) which two users co-like the same items, (3) which two items users often co-liked, and (4) which two items users often co-disliked. In experimental validation, the RME outperforms competing state-of-the-art models in both explicit and implicit feedback datasets, significantly improving [email protected] by 5.9~7.0%, [email protected] by 4.3~5.6%, and [email protected] by 7.9~8.9%. In addition, under the cold-start scenario for users with the lowest number of interactions, against the competing models, the RME outperforms [email protected] by 20.2% and 29.4% in MovieLens-10M and MovieLens-20M datasets, respectively. Our datasets and source code are available at: https://github.com/thanhdtran/RME.git.
The rapid growth of Location-based Social Networks (LBSNs) provides a great opportunity to satisfy the strong demand for personalized Point-of-Interest (POI) recommendation services. However, with the tremendous increase of users and POIs, POI recommender systems still face several challenging problems: (1) the hardness of modeling complex user-POI interactions from sparse implicit feedback; (2) the difficulty of incorporating the geographical context information. To cope with these challenges, we propose a novel autoencoder-based model to learn the complex user-POI relations, namely SAE-NAD, which consists of a self-attentive encoder (SAE) and a neighbor-aware decoder (NAD). In particular, unlike previous works equally treat users' checked-in POIs, our self-attentive encoder adaptively differentiates the user preference degrees in multiple aspects, by adopting a multi-dimensional attention mechanism. To incorporate the geographical context information, we propose a neighbor-aware decoder to make users' reachability higher on the similar and nearby neighbors of checked-in POIs, which is achieved by the inner product of POI embeddings together with the radial basis function (RBF) kernel. To evaluate the proposed model, we conduct extensive experiments on three real-world datasets with many state-of-the-art methods and evaluation metrics. The experimental results demonstrate the effectiveness of our model.
Traditional practice recommends that information retrieval experiments be run over multiple test collections, to support, if not prove, that gains in performance are likely to generalize to other collections or tasks. However, because of the pooling assumptions, evaluation scores are not directly comparable across different test collections. We present a widely-used statistical tool, \em meta-analysis, as a framework for reporting results from IR experiments using multiple test collections. We demonstrate the meta-analytical approach through two standard experiments on stemming and pseudo-relevance feedback, and compare the results to those obtained from score standardization. Meta-analysis incorporates several recent recommendations in the literature, including score standardization, reporting effect sizes rather than score differences, and avoiding a reliance on null-hypothesis statistical testing, in a unified approach. It therefore represents an important methodological improvement over using these techniques in isolation.
Consistency of relevance judgments is a vital issue for the construction of test collections in information retrieval. As human relevance assessments are costly, and large collections can contain many documents of varying relevance, collecting reliable judgments is a critical component to building reusable test collections. We explore the impact of document presentation order on human relevance assessments. Our primary goal is to determine if assessor disagreement can be minimized through the order in which documents are presented to assessors. To achieve this goal, we compare two commonly used presentation orderings with a new ordering designed to aid assessors to more easily discriminate between relevant and non-relevant documents. By carefully controlling the presentation ordering, assessors can more quickly converge on a consistent notion of relevance during the assessment exercise, leading to higher overall judging agreement. In addition, important interactions between presentation ordering and topic difficulty on assessor agreement are highlighted. Our findings suggest that document presentation order does indeed have a substantial impact on assessor agreement , and that our new ordering is more robust than previous approaches across a variety of different topic types.
Reading is a complex cognitive activity in many information retrieval related scenarios, such as relevance judgement and question answering. There exists plenty of works which model these processes as a matching problem, which focuses on how to estimate the relevance score between a document and a query. However, little is known about what happened during the reading process, i.e., how users allocate their attention while reading a document during a specific information retrieval task. We believe that a better understanding of this process can help us design better weighting functions inside the document and contributes to the improvement of ranking performance. In this paper, we focus on the reading process during relevance judgement task. We designed a lab-based user study to investigate human reading patterns in assessing a document, where users' eye movements and their labeled relevant text were collected, respectively. Through a systematic analysis into the collected data, we propose a two-stage reading model which consists of a preliminary relevance judgement stage (Stage 1) and a reading with preliminary relevance stage (Stage 2). In addition, we investigate how different behavior biases affect users' reading behaviors in these two stages. Taking these biases into consideration, we further build prediction models for user's reading attention. Experiment results show that query independent features outperform query dependent features, which indicates that users allocate attentions based on many signals other than query terms in this process. Our study sheds light on the understanding of users' attention allocation during relevance judgement and provides implications for improving the design of existing ranking models.
The goal of diagnosis prediction task is to predict the future health information of patients from their historical Electronic Healthcare Records (EHR). The most important and challenging problem of diagnosis prediction is to design an accurate, robust and interpretable predictive model. Existing work solves this problem by employing recurrent neural networks (RNNs) with attention mechanisms, but these approaches suffer from the data sufficiency problem. To obtain good performance with insufficient data, graph-based attention models are proposed. However, when the training data are sufficient, they do not offer any improvement in performance compared with ordinary attention-based models. To address these issues, we propose KAME, an end-to-end, accurate and robust model for predicting patients' future health information. KAME not only learns reasonable embeddings for nodes in the knowledge graph, but also exploits general knowledge to improve the prediction accuracy with the proposed knowledge attention mechanism. With the learned attention weights, KAME allows us to interpret the importance of each piece of knowledge in the graph. Experimental results on three real world datasets show that the proposed KAME significantly improves the prediction performance compared with the state-of-the-art approaches, guarantees the robustness with both sufficient and insufficient data, and learns interpretable disease representations.
Social media platforms are increasingly being used to share and seek advice on mental health issues. In particular, Reddit users freely discuss such issues on various subreddits, whose structure and content can be leveraged to formally interpret and relate subreddits and their posts in terms of mental health diagnostic categories. There is prior research on the extraction of mental health-related information, including symptoms, diagnosis, and treatments from social media; however, our approach can additionally provide actionable information to clinicians about the mental health of a patient in diagnostic terms for web-based intervention. Specifically, we provide a detailed analysis of the nature of subreddit content from domain expert's perspective and introduce a novel approach to map each subreddit to the best matching DSM-5 (Diagnostic and Statistical Manual of Mental Disorders - 5th Edition) category using multi-class classifier. Our classification algorithm analyzes all the posts of a subreddit by adapting topic modeling and word-embedding techniques, and utilizing curated medical knowledge bases to quantify relationship to DSM-5 categories. Our semantic encoding-decoding optimization approach reduces the false-alarm-rate from 30% to 2.5% over a comparable heuristic baseline, and our mapping results have been verified by domain experts achieving a kappa score of 0.84.
With the recent availability of Electronic Health Records (EHR) and great opportunities they offer for advancing medical informatics, there has been growing interest in mining EHR for improving quality of care. Disease diagnosis due to its sensitive nature, huge costs of error, and complexity has become an increasingly important focus of research in past years. Existing studies model EHR by capturing co-occurrence of clinical events to learn their latent embeddings. However, relations among clinical events carry various semantics and contribute differently to disease diagnosis which gives precedence to a more advanced modeling of heterogeneous data types and relations in EHR data than existing solutions. To address these issues, we represent how high-dimensional EHR data and its rich relationships can be suitably translated into HeteroMed, a heterogeneous information network for robust medical diagnosis. Our modeling approach allows for straightforward handling of missing values and heterogeneity of data. HeteroMed exploits metapaths to capture higher level and semantically important relations contributing to disease diagnosis. Furthermore, it employs a joint embedding framework to tailor clinical event representations to the disease diagnosis goal. To the best of our knowledge, this is the first study to use Heterogeneous Information Network for modeling clinical data and disease diagnosis. Experimental results of our study show superior performance of HeteroMed compared to prior methods in prediction of exact diagnosis codes and general disease cohorts. Moreover, HeteroMed outperforms baseline models in capturing similarities of clinical events which are examined qualitatively through case studies.
Signed directed networks with positive or negative links convey rich information such as like or dislike, trust or distrust. Existing work of sign prediction mainly focuses on triangles (triadic nodes) motivated by balance theory to predict positive and negative links. However, real-world signed directed networks can contain a good number of "bridge'' edges which, by definition, are not included in any triangles. Such edges are ignored in previous work, but may play an important role in signed directed network analysis.%Such edges serve as fundamental building blocks and may play an important role in signed network analysis.
In this paper, we investigate the problem of learning representations for signed directed networks. We present a novel deep learning approach to incorporating two social-psychologic theories, balance and status theories, to model both triangles and "bridge'' edges in a complementary manner. The proposed framework learns effective embeddings for nodes and edges which can be applied to diverse tasks such as sign prediction and node ranking. Experimental results on three real-world datasets of signed directed social networks verify the essential role of "bridge" edges in signed directed network analysis by achieving the state-of-the-art performance.
Semi-supervised learning and multi-label learning pose different challenges for feature selection, which is one of the core techniques for dimension reduction, and the exploration of reducing feature space for multi-label learning with incomplete label information is far from satisfactory. Existing feature selection approaches devote attention to either of two issues, namely, alleviating negative effects of imperfectly predicted labels and quantitatively evaluating label correlations, exclusively for semi-supervised or multi-label scenarios. A unified framework to extract label correlation information with incomplete prior knowledge and embed this information in feature selection however, is rarely touched. In this paper, we propose a space consistency-based feature selection model to address this issue. Specifically, correlation information in feature space is learned based on the probabilistic neighborhood similarities, and correlation information in label space is optimized by preserving feature-label space consistency. This mechanism contributes to appropriately extracting label information in semi-supervised multi-label learning scenario and effectively employing this information to select discriminative features. An extensive experimental evaluation on real-world data shows the superiority of the proposed approach under various evaluation metrics.
PARAFAC2 has demonstrated success in modeling irregular tensors, where the tensor dimensions vary across one of the modes. An example scenario is modeling treatments across a set of patients with the varying number of medical encounters over time. Despite recent improvements on unconstrained PARAFAC2, its model factors are usually dense and sensitive to noise which limits their interpretability. As a result, the following open challenges remain: a) various modeling constraints, such as temporal smoothness, sparsity and non-negativity, are needed to be imposed for interpretable temporal modeling and b) a scalable approach is required to support those constraints efficiently for large datasets. To tackle these challenges, we propose a COnstrained PARAFAC2 (COPA) method, which carefully incorporates optimization constraints such as temporal smoothness, sparsity, and non-negativity in the resulting factors. To efficiently support all those constraints, COPA adopts a hybrid optimization framework using alternating optimization and alternating direction method of multiplier (AO-ADMM). As evaluated on large electronic health record (EHR) datasets with hundreds of thousands of patients, COPA achieves significant speedups (up to 36 times faster) over prior PARAFAC2 approaches that only attempt to handle a subset of the constraints that COPA enables. Overall, our method outperforms all the baselines attempting to handle a subset of the constraints in terms of speed, while achieving the same level of accuracy. Through a case study on temporal phenotyping of medically complex children, we demonstrate how the constraints imposed by COPA reveal concise phenotypes and meaningful temporal profiles of patients. The clinical interpretation of both the phenotypes and the temporal profiles was confirmed by a medical expert.
In this work, we address the problem of blocking in the context of author name disambiguation. We describe a framework that formalizes different ways of name-matching to determine which names could potentially refer to the same author. We focus on name variations that follow from specifying a name with different completeness (i.e. full first name or only initial). We extend this framework by a simple way to define traditional, new and custom blocking schemes. Then, we evaluate different old and new schemes in the Web of Science. In this context we define and compare a new type of blocking schemes. Based on these results, we discuss the question whether name-matching can be used in blocking evaluation as a replacement of annotated author identifiers. Finally, we argue that blocking can have a strong impact on the application and evaluation of author disambiguation.
The goal of hashing is to learn a low-dimensional binary representation of high-dimensional information, leading to a tremendous reduction of computational cost. Previous studies usually achieved this goal by applying projection or quantization methods. However, the projection method fails to capture the intrinsic data structures, and the quantization method cannot make full use of complete information by its strategy of partitioning original space. To combine their advantages and avoid their drawbacks, we propose a novel algorithm, termed as representative points quantization (RPQ), by using the representative points defined as the barycenters of points in the hyperoctants. To settle the problem of exponential time complexity with the growth of the coding length, for long hashing codes, we further propose a parallel RPQ (PRPQ) algorithm, by separating a long code into several short codes, re-coding the short codes in different low dimensional subspaces, and then concatenating them to a long code. Experiments on image retrieval tasks demonstrate that RPQ and PRPQ can well capture the main topology structure of data, showing that our algorithm achieves better performance than state-of-the-art methods.
Attributed network embedding focuses on learning low-dimensional latent representations of nodes which can well preserve the original topological and node attributed proximity at the same time. Existing works usually assume that nodes with similar topology or similar attributes should also be close in the embedding space. This assumption ignores the phenomenon of partial correlation between network topological and node attributed similarities i.e. nodes with similar topology may be dissimilar in their attributes and vice versa. Partial correlation between the two information sources should be considered especially when there exist fraudulent edges (i.e., information from one source is vague) or unbalanced data distributions (i.e, topology structure similarity and node attribute similarity have different distributions). However, it is very challenging to consider the partial correlation between topology and attributes due to the heterogeneity of these two information sources. In this paper, we take partial correlation between topology and attributes into account and propose the Personalized Relation Ranking Embedding (PRRE) method for attributed networks which is capable of exploiting the partial correlation between node topology and attributes. The proposed PRRE model utilizes two thresholds to define different node relations and employs the Expectation-Maximization (EM) algorithm to learn these thresholds as well as other embedding parameters. Extensive experiments results on multiple real-world datasets show that the proposed PRRE model significantly outperforms the state-of-the-art methods in terms of various evaluation metrics.
Heterogeneous Information Network(HIN) has been employed in recommender system to represent heterogeneous types of data, and meta path has been proposed to capture semantic relationship among objects. When applying HIN to the recommendation, there are two problems: how to extract features from meta paths and how to properly fuse these features to further improve recommendations. Some recent work has employed deep neural network to learn user and item representation, and attention mechanism has been explored to integrate information for recommendation. Inspired by these work, in this paper, we propose Heterogeneous Neural Attentive Factorization Machine(HNAFM) to solve above problems. Specifically, we first calculate the commuting matrices based on meta paths and use multilayer perceptrons to learn user and item features. A hierarchical attention mechanism is employed to find the meta path that best describes user's preference and item's property. Comprehensive experiments based on real-world datasets demonstrate that the proposed HNAFM significantly outperforms state-of-the-art rating prediction methods.
RNNs have been shown to be excellent models for sequential data and in particular for data that is generated by users in an session-based manner. The use of RNNs provides impressive performance benefits over classical methods in session-based recommendations. In this work we introduce novel ranking loss functions tailored to RNNs in the recommendation setting. The improved performance of these losses over alternatives, along with further tricks and refinements described in this work, allow for an overall improvement of up to 35% in terms of MRR and [email protected] over previous session-based RNN solutions and up to 53% over classical collaborative filtering approaches. Unlike data augmentation-based improvements, our method does not increase training times significantly. We further demonstrate the performance gain of the RNN over baselines in an online A/B test.
Multi-task Multi-view (MTMV) learning has recently undergone noticeable development for dealing with heterogeneous data. To exploit information from both related tasks and related views, a common strategy is to model task relatedness and view consistency separately. The drawback of this strategy is that it did not consider the interactions between tasks and views. To remedy this, we propose a novel method, racBFA, by adding rank constraints to asymmetric bilinear factor analyzers (aBFA). We then adapt racBFA to our MTMV learning problem and design a new MTMV learning algorithm, racMTMV. We evaluated racMTMV on 3 real-world data sets. The experimental results demonstrated the effectiveness of our proposed method.
We consider a scenario that occurs often in the auto insurance industry. We are given a large collection of trajectories that stem from many different drivers. Only a small number of the trajectories are labeled with driver identifiers, and only some drivers are used in labels. The problem is to label correctly the unlabeled trajectories with driver identifiers. This is important in auto insurance to detect possible fraud and to identify the driver in, e.g., pay-as-you-drive settings when a vehicle has been involved in an incident. To solve the problem, we first propose a Trajectory-to-Image( T2I) encoding scheme that captures both geographic features and driving behavior features of trajectories in 3D images. Next, we propose a multi-task, deep learning model called T2INet for estimating the total number of drivers in the unlabeled trajectories, and then we partition the unlabeled trajectories into groups so that the trajectories in a group belong to the same driver. Experimental results on a large trajectory data set offer insight into the design properties of T2INet and demonstrate that T2INet is capable of outperforming baselines and the state-of-the-art method.
The massively available data about user engagement with online information service systems provides a gold mine about users' latent intents. It calls for quantitative user behavior modeling. In this paper, we study the problem by looking into users' sequential interactive behaviors. Inspired by the concepts of episodic memory and semantic memory in cognitive psychology, which describe how users' behaviors are differently influenced by past experience, we propose a Long- and Short-term Hawkes Process model. It models the short-term dependency between users' actions within a period of time via a multi-dimensional Hawkes process and the long-term dependency between actions across different periods of time via a one dimensional Hawkes process. Experiments on two real-world user activity log datasets (one from an e-commerce website and one from a MOOC website) demonstrate the effectiveness of our model in capturing the temporal dependency between actions in a sequence of user behaviors. It directly leads to improved accuracy in predicting the type and the time of the next action. Interestingly, the inferred dependency between actions in a sequence sheds light on the underlying user intent behind direct observations and provides insights for downstream applications.
An Open Information Extraction (OIE) system processes textual data to extract assertions, which are structured data typically represented in the form of (subject;relation; object) triples. An Open Knowledge Base (OKB) is a collection of such assertions. We study the problem of canonicalizing an OKB, which is defined as the problem of mapping each name (a textual term such as "the rockies", "colorado rockies") to a canonical form (such as "rockies"). Galárraga et al.  proposed a hierarchical agglomerative clustering algorithm using canopy clustering to tackle the canonicalization problem. The algorithm was shown to be very effective. However, it is not efficient enough to practically handle large OKBs due to the large number of similarity score computations. We propose the FAC algorithm for solving the canonicalization problem. FAC employs pruning techniques to avoid unnecessary similarity computations, and bounding techniques to efficiently approximate and identify small similarities. In our experiments, FAC registers ordersof-magnitude speedups over other approaches.
In this paper, we advance the state-of-the-art in topic modeling by means of the design and development of a novel (semi-formal) general topic modeling framework. The novel contributions of our solution include: (i) the introduction of new semantically-enhanced data representations for topic modeling based on pooling, and (ii) the proposal of a novel topic extraction strategy - ASToC - that solves the difficulty in representing topics in our semantically-enhanced information space. In our extensive experimentation evaluation, covering 12 datasets and 12 state-of-the-art baselines, totalizing 108 tests, we exceed (with a few ties) in almost 100 cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). We provide qualitative and quantitative statistical analyses of why our solutions work so well. Finally, we show that our method is able to improve document representation in automatic text classification.
This paper addresses the problem ofmulti-instance entity typing from corpus. Current approaches mainly rely on the structured features (\textitattributes, attribute-value pairs andtags ) of the entities. However, their effectiveness is largely dependent on the completeness of structured features, which unfortunately is not guaranteed in KBs. In this paper, we therefore propose to use the text corpus of an entity to infer its types, and propose a multi-instance method to tackle this problem. We take each mention of an entity in KBs as an instance of the entity, and learn the types of these entities from multiple instances. Specifically, we first use an end-to-end neural network model to type each instance of an entity, and then use an integer linear programming (ILP) method to aggregate the predicted type results from multiple instances. Experimental results show the effectiveness of our method.
We investigate how generative adversarial nets (GANs) can help semi-supervised learning on graphs. We first provide insights on working principles of adversarial learning over graphs and then present GraphSGAN, a novel approach to semi-supervised learning on graphs. In GraphSGAN, generator and classifier networks play a novel competitive game. At equilibrium, generator generates fake samples in low-density areas between subgraphs. In order to discriminate fake samples from the real, classifier implicitly takes the density property of subgraph into consideration. An efficient adversarial learning algorithm has been developed to improve traditional normalized graph Laplacian regularization with a theoretical guarantee. Experimental results on several different genres of datasets show that the proposed GraphSGAN significantly outperforms several state-of-the-art methods. GraphSGAN can be also trained using mini-batch, thus enjoys the scalability advantage.
Given a labeled graph, the frequent-subgraph mining (FSM) problem asks to find all the k-vertex subgraphs that appear with frequency greater than a given threshold. FSM has numerous applications ranging from biology to network science, as it provides a compact summary of the characteristics of the graph. However, the task is challenging, even more so for evolving graphs due to the streaming nature of the input and the exponential time complexity of the problem.
In this paper, we initiate the study of the approximate FSM problem in both incremental and fully-dynamic streaming settings, where arbitrary edges can be added or removed from the graph. For each streaming setting, we propose algorithms that can extract a high-quality approximation of the frequent k-vertex subgraphs for a given threshold, at any given time instance, with high probability. In contrast to the existing state-of-the-art solutions that require iterating over the entire set of subgraphs for any update, our algorithms operate by maintaining a uniform sample of k-vertex subgraphs with optimized neighborhood-exploration procedures local to the updates. We provide theoretical analysis of the proposed algorithms and empirically demonstrate that the proposed algorithms generate high-quality results compared to baselines.
A large number of deep learning models have been proposed for the text matching problem, which is at the core of various typical natural language processing (NLP) tasks. However, existing deep models are mainly designed for the semantic matching between a pair of short texts, such as paraphrase identification and question answering, and do not perform well on the task of relevance matching between short-long text pairs. This is partially due to the fact that the essential characteristics of short-long text matching have not been well considered in these deep models. More specifically, these methods fail to handle extreme length discrepancy between text pieces and neither can they fully characterize the underlying structural information in long text documents. In this paper, we are especially interested in relevance matching between a piece of short text and a long document, which is critical to problems like query-document matching in information retrieval and web searching. To extract the structural information of documents, an undirected graph is constructed, with each vertex representing a keyword and the weight of an edge indicating the degree of interaction between keywords. Based on the keyword graph, we further propose a Multiresolution Graph Attention Network to learn multi-layered representations of vertices through a Graph Convolutional Network (GCN), and then match the short text snippet with the graphical representation of the document with an attention mechanism applied over each layer of the GCN. Experimental results on two datasets demonstrate that our graph approach outperforms other state-of-the-art deep matching models.
Microblogs have become one of the most popular platforms for news sharing. However, due to its openness and lack of supervision, rumors could also be easily posted and propagated on social networks, which could cause huge panic and threat during its propagation. In this paper, we detect rumors by leveraging hierarchical representations at different levels and the social contexts. Specifically, we propose a novel hierarchical neural network combined with social information (HSA-BLSTM). We first build a hierarchical bidirectional long short-term memory model for representation learning. Then, the social contexts are incorporated into the network via attention mechanism, such that important semantic information is introduced to the framework for more robust rumor detection. Experimental results on two real world datasets demonstrate that the proposed method outperforms several state-of-the-arts in both rumor detection and early detection scenarios.
Users' consumption behaviors are affected by both their personal preference and their exposure to items (i.e. whether a user knows the items).Most of the recent works in social recommendation assume that people share similar preference with their socially connected friends. However, this assumption may not hold due to the diversity of social relations, and modeling social influence on users' preference may not be suitable for implicit feedback data (i.e. whether a user has consumed certain items). Since users often share item information with their social relations, it will be less restrictive to model social influence on users' exposure to items. We notice that a user's exposure is affected by the exposure of the other users in his social communities and by the consumption of his connected friends. In this paper, we propose a novel social exposure-based recommendation model SoEXBMF by integrating two kinds of social influence on users' exposure, i.e. social knowledge influence and social consumption influence, into basic EXMF model for better recommendation performance. Furthermore, SoEXBMF uses Bernoulli distribution instead of Gaussian distribution in EXMF to better model the binary implicit feedback data. A variational inference method has been developed for the proposed SoEXBMF model to infer the posterior and make the recommendations. Extensive experiments on three real-world datasets demonstrate the superiority of our method over existing methods in various evaluation metrics.
This study investigates how people carefully search for the Web to obtain credible and accurate information. The goal of this study is to better understand people's attitudes toward careful information seeking via Web search, and the relationship between such attitudes and their daily search behaviors. To this end, we conducted two experiments. We first administrated an online questionnaire to investigate how people's attitudes toward using the strategies for verifying information in the Web search process differ based on various factors such as their credulity toward Web information, individual thinking styles, educational background, and search expertise. We then analyzed their one-year and one-month query logs of a commercial Web search engine to explore how their daily search behaviors are different according to their attitudes. The analysis of the questionnaire and the query logs obtained from ¥subjects participants revealed that (i) the people's attitudes toward using the verification strategies in Web search are positively correlated to their Need for Cognition (NFC), educational background, and search expertise; (ii) people with strong attitudes are likely to click lower-ranked search results than those with intermediate levels of attitude; (iii) people with strong attitudes are more likely to use the terms such as "evidence'' or "truth'' in their queries, possibly to scrutinize the uncertain or incredible information; and (iv) the behavioral differences found in (ii) and (iii) are not identified from the differences in the participants' educational backgrounds. These findings help us explore future directions for a new Web search system that encourages people to be more careful in Web search, and suggest the need for an educational program or training to facilitate the attitudes and skills for using Web search engines to obtain accurate information.
Recently, dataless text classification has attracted increasing attention. It trains a classifier using seed words of categories, rather than labeled documents that are expensive to obtain. However, a small set of seed words may provide very limited and noisy supervision information, because many documents contain no seed words or only irrelevant seed words. In this paper, we address these issues using document manifold, assuming that neighboring documents tend to be assigned to a same category label. Following this idea, we propose a novel Laplacian seed word topic model (LapSWTM). In LapSWTM, we model each document as a mixture of hidden category topics, each of which corresponds to a distinctive category. Also, we assume that neighboring documents tend to have similar category topic distributions. This is achieved by incorporating a manifold regularizer into the log-likelihood function of the model, and then maximizing this regularized objective. Experimental results show that our LapSWTM significantly outperforms the existing dataless text classification algorithms and is even competitive with supervised algorithms to some extent. More importantly, it performs extremely well when the seed words are scarce.
Deep neural networks are gaining increasing popularity for the classic text classification task, due to their strong expressive power and less requirement for feature engineering. Despite such attractiveness, neural text classification models suffer from the lack of training data in many real-world applications. Although many semi-supervised and weakly-supervised text classification models exist, they cannot be easily applied to deep neural models and meanwhile support limited supervision types. In this paper, we propose a weakly-supervised method that addresses the lack of training data in neural text classification. Our method consists of two modules: (1) a pseudo-document generator that leverages seed information to generate pseudo-labeled documents for model pre-training, and (2) a self-training module that bootstraps on real unlabeled data for model refinement. Our method has the flexibility to handle different types of weak supervision and can be easily integrated into existing deep neural models for text classification. We have performed extensive experiments on three real-world datasets from different domains. The results demonstrate that our proposed method achieves inspiring performance without requiring excessive training data and outperforms baseline methods significantly.
Automatic short answer grading remains one of the key challenges of any dialog-based tutoring system due to the variability in the student answers. Typically, each question may have no or few expert authored exemplary answers which make it difficult to (1) generalize to all correct ways of answering the question, or (2) represent answers which are either partially correct or incorrect. In this paper, we propose an affinity propagation based clustering technique to obtain class-specific representative answers from the graded student answers. Our novelty lies in formulating the Scoring Rubric by incorporating class-specific representatives obtained after proposed clustering, selecting, and ranking of graded student answers. We experiment with baseline as well as stateof-the-art sentence-embedding based features to demonstrate the feature-agnostic utility of class-specific representative answers. Experimental evaluations on our large-scale industry dataset and a benchmarking dataset show that the Scoring Rubric significantly improves the classification performance of short answer grading.
Mastering the dynamics of social influence requires separating, in a database of information propagation traces, the genuine causal processes from temporal correlation, i.e., homophily and other spurious causes. However, most studies to characterize social influence, and, in general, most data-science analyses focus on correlations, statistical independence, or conditional independence. Only recently, there has been a resurgence of interest in "causal data science,'' e.g., grounded on causality theories. In this paper we adopt a principled causal approach to the analysis of social influence from information-propagation data, rooted in the theory of probabilistic causation. Our approach consists of two phases. In the first one, in order to avoid the pitfalls of misinterpreting causation when the data spans a mixture of several subtypes ("Simpson's paradox''), we partition the set of propagation traces into groups, in such a way that each group is as less contradictory as possible in terms of the hierarchical structure of information propagation. To achieve this goal, we borrow the notion of "agony'' and define the Agony-bounded Partitioning problem, which we prove being hard, and for which we develop two efficient algorithms with approximation guarantees. In the second phase, for each group from the first phase, we apply a constrained MLE approach to ultimately learn a minimal causal topology. Experiments on synthetic data show that our method is able to retrieve the genuine causal arcs w.r.t. a ground-truth generative model. Experiments on real data show that, by focusing only on the extracted causal structures instead of the whole social graph, the effectiveness of predicting influence spread is significantly improved.
Visual Question Answering (VQA) aims to learn a joint embedding of the question sentence and the corresponding image to infer the answer. Existing approaches learn the joint embedding don't consider the answer-related information, which results in that the learned representation is not effective to reflect the answer of the question. To address this problem, this paper proposes a novel method, i.e., Adversarial Learning of Answer-Related Representation (ALARR) for visual question answering, which seeks an effective answer-related representation for the question-image pair based on adversarial learning between two processes. The embedding learning process aims to generate modality-invariant joint representations for the question-image and question-answer pairs, respectively. Meanwhile, it tries to confuse the other process, embedding discriminator, which tries to discriminate the two representations from different modalities of pairs. Specifically, the joint embedding of the question-image pair is learned by a three-level attention model, and the joint representation of the question-answer pair is learned by a semantic integration model. Through the adversarial leaning, the answer-related representation are better preserved. Then an answer predictor is proposed to infer the answer from the answer-related representation. Experiments conducted on two widely used VQA benchmark datasets demonstrate that the proposed model outperforms the state-of-the-art approaches.
CMO Council reports that 71% of internet users in the U.S. were influenced by coupons and discounts when making their purchase decisions. It has also been shown that offering coupons to a small fraction of users may affect the purchase decisions of many other users in a social network. This motivates us to study stochastic coupon probing problem in social networks. Assume there is a social network and a set of coupons. We can offer coupons to some users adaptively and those users who accept the offer will act as seeds and influence their friends in the social network. There are two constraints which are called the inner and outer constraints, respectively. The set of coupons redeemed by users must satisfy inner constraints, and the set of all probed users must satisfy outer constraints. One seeks to develop a coupon probing policy that achieves the maximum influence while satisfying both inner and outer constraints. Our main result is a constant approximation policy for the stochastic coupon probing problem for any monotone submodular utility function.
Linked Open Data (LOD) and social media often contain the representations of the same real-world entities, such as persons and organizations. These representations are increasingly interlinked, making it possible to combine and leverage both LOD and social media data in prediction problems, complementing their relative strengths: while LOD knowledge is highly structured but also scarce and obsolete for some entities, social media data provide real-time updates and increased coverage, albeit being mostly unstructured. In this paper, we investigate the feasibility of using social media data to perform type prediction for entities in a LOD knowledge graph. We discuss how to gather training data for such a task, and how to build an efficient domain-independent vector representation of entities based on social media data. Our experiments on several type prediction tasks using DBpedia and Twitter data show the effectiveness of this representation, both alone and combined with knowledge graph-based features, suggesting its potential for ontology population.
Sequences of microscopic images feature the dynamics of developing embryos. Automatically tracking the cells from such sequences of images allows understanding the dynamics which a living element demands to know its cells movement, which ideally should take place in real-time. The traditional tracking pipeline starts with image acquisition, data transfer, image segmentation to separate cells from the background, and then the actual tracking step. To speed up this pipeline, we hypothesize that a process capable of predicting the cell motion according to previous observations is useful. The solution must be accurate, fast and lightweight, and be able to iterate between the various components. In this work we propose CM-Predictor, which takes advantage of previous positions of cells to estimate their motion. When estimation takes place, we can omit costly acquisition, transfer and process of images, speeding up the tracking pipeline. The designed solution monitors the error of prediction, adapting the model whenever needed. For validation, we use four different datasets with sequences of images with developing embryos. Then we compare the estimated motion vectors of CM-Predictor with traditional tracking methods. Experimental results show that CM-Predictor is able to accurately estimate the motion vectors. In fact, CM-Predictor maintains the prediction quality of other algorithms and performs faster than them.
In large and medium-sized cities, detecting unusual changes of crowds of people on the streets is needed for public security, transportation management, emergency control, and terrorism prevention. As public transportation has the capability to bring a large number of people to an area in a short amount of time, real-time discovery of anomalies in passenger numbers is an effective way to detect crowd anomalies. In this paper, we devise an approach called Kochab. Kochab adopts a generative model and combines the prior knowledge about passenger flows. Hence, it can detect anomalies in the numbers of incoming and outgoing passengers within a certain time and spatial area, including anomalous events along with their durations and severities. Through well-designed inference algorithms, Kochab requires only a moderate amount of historical data to be sample data. As such, Kochab shows good performance in real time and makes prompt responses to user' s interactive analysis requests. In particular, based on the recognized anomalous events, we capture event patterns which give us hints to link to activities or status in cities. In addition, for the convenience of method evaluation and comparison, we create an open Stream Anomaly Benchmark on the basis of large-scale real-world data. This benchmark will prove useful for other researchers too. Using this benchmark, we compare Kochab with four other methods. The experimental results show that Kochab is sensitive to population flow anomalies and has superior accuracy in detecting anomalies in terms of precision, recall and the F1 score.
As data streams become more prevalent, the necessity for online algorithms that mine this transient and dynamic data becomes clearer. Multi-label data stream classification is a supervised learning problem where each instance in the data stream is classified into one or more pre-defined sets of labels. Many methods have been proposed to tackle this problem, including but not limited to ensemble-based methods. Some of these ensemble-based methods are specifically designed to work with certain multi-label base classifiers; some others employ online bagging schemes to build their ensembles. In this study, we introduce a novel online and dynamically-weighted stacked ensemble for multi-label classification, called GOOWE-ML, that utilizes spatial modeling to assign optimal weights to its component classifiers. Our model can be used with any existing incremental multi-label classification algorithm as its base classifier. We conduct experiments with 4 GOOWE-ML-based multi-label ensembles and 7 baseline models on 7 real-world datasets from diverse areas of interest. Our experiments show that GOOWE-ML ensembles yield consistently better results in terms of predictive performance in almost all of the datasets, with respect to the other prominent ensemble models.
Leveraging historical behavioral data (e.g., sales volume and email communication) for future prediction is of fundamental importance for practical domains ranging from sales to temporal link prediction. Current forecasting approaches often use only a single time resolution (e.g., daily or weekly), which truncates the range of observable temporal patterns. However, real-world behavioral time series typically exhibit patterns across multi-dimensional temporal patterns, yielding dependencies at each level. To fully exploit these underlying dynamics, this paper studies the forecasting problem for behavioral time series data with the consideration of multiple time resolutions and proposes a multi-resolution time series forecasting framework, RESolution-aware Time series Forecasting (RESTFul). In particular, we first develop a recurrent framework to encode the temporal patterns at each resolution. In the fusion process, a convolutional fusion framework is proposed, which is capable of learning conclusive temporal patterns for modeling behavioral time series data to predict future time steps. Our extensive experiments demonstrate that the RESTFul model significantly outperforms the state-of-the-art time series prediction techniques on both numerical and categorical behavioral time series data.
Given multiple time series data, how can we efficiently find latent patterns in an arbitrary time range? Singular value decomposition (SVD) is a crucial tool to discover hidden factors in multiple time series data, and has been used in many data mining applications including dimensionality reduction, principal component analysis, recommender systems, etc. Along with its static version, incremental SVD has been used to deal with multiple semi-infinite time series data and to identify patterns of the data. However, existing SVD methods for the multiple time series data analysis do not provide functionality for detecting patterns of data in an arbitrary time range: standard SVD requires data for all intervals corresponding to a time range query, and incremental SVD does not consider an arbitrary time range.
In this paper, we propose Zoom-SVD, a fast and memory efficient method for finding latent factors of time series data in an arbitrary time range. Zoom-SVD incrementally compresses multiple time series data block by block to reduce the space cost in storage phase, and efficiently computes singular value decomposition (SVD) for a given time range query in query phase by carefully stitching stored SVD results. Through extensive experiments, we demonstrate that Zoom-SVD is up to 15x faster, and requires 15x less space than existing methods. Our case study shows that Zoom-SVD is useful for capturing past time ranges whose patterns are similar to a query time range.
More and more data need to be processed or analyzed within mobile devices for efficiency or privacy reasons, but performing machine learning tasks with large data within the devices is challenging because of their limited memory resources. For this reason, disk-based machine learning methods have been actively researched, which utilize storage resources without holding all the data in memory. This paper proposes D-MC2, a novel disk-based matrix completion method that (1) supports incremental data update (i.e., data insertion and deletion) and (2) spills both data and model to disk when necessary; these functionalities are not supported by existing methods. First, D-MC2 builds a two-layered index to efficiently support incremental data update; there exists a trade-off relationship between model learning and data update costs, and our two-layered index simultaneously optimizes the two costs. Second, we develop a window-based stochastic gradient descent (SGD) scheduler to efficiently support the dual spilling; a huge amount of disk I/O is incurred when the size of model is larger than that of memory, and our new scheduler substantially reduces it. Our evaluation results show that D-MC2 is significantly more scalable and faster than other disk-based competitors under the limited memory environment. In terms of the co-optimization, D-MC2 outperforms the baselines that only optimize one of the two costs up to 48x. Furthermore, the window-based scheduler improves the training speed 12.4x faster compared to a naive scheduler.
It is well known that a direct parallelization of sequential optimization methods (e.g., coordinate descent and stochastic gradient methods) is often not effective. The reason is that at each iteration, the number of operations may be too small. We point out that this common understanding may not be true if the algorithm sequentially accesses the data in a feature-wise manner. For almost all real-world sparse sets we have examined, some features are much denser than others. Thus a direct parallelization of loops in a sequential method may result in excellent speedup. This approach possesses an advantage of retaining all convergence results because the algorithm is not changed at all. We apply this idea on coordinate descent (CD) methods, which are effective single-thread technique for L1-regularized classification. Further, an investigation on the shrinking technique commonly used to remove some features in the training process shows that this technique helps the parallelization of CD methods. Experiments indicate that a naive parallelization achieves better speedup than existing methods that laboriously modify the algorithm to achieve parallelism. Though a bit ironic, we conclude that the naive parallelization of the CD method is a highly competitive and robust multi-core implementation for L1-regularized classification.
Given a high-order, large-scale and sparse data from big data and industrial applications, how can we acquire useful patterns in a real-time and low memory overhead manner? Sparse Non-negative tensor factorization (SNTF) possesses high-order representation, non-negativity and dimension reduction inherence. Thus, SNTF has become a useful tool to represent and analyze the sparse data, which has been incorporated with extra contextual information, i.e., time and location, etc, more than the matrix, which can only model the 2 ways data. However, current SNTF techniques suffer from a) non-linear time and space overhead, b) intermediate data explosion, and c) inability on GPU and multi-GPU. To address these issues, a single-thread-based SNTF is proposed, which involves the feature elements rather than on the whole factor matrices, and can avoid the forming of large-scale intermediate matrices. Then, a CUDA parallelizing single-thread-based SNTF (CUSNTF) model is proposed for industrial applications on GPU and multi-GPU (MCUSNTF). Thus, CUSNTF has linear computing and space complexity, and linear communication cost on multi-GPU. We implement CUSNTF and MCUSNTF on 8 P100 GPUs, and compare it with state-of-the-art parallel and distributed methods. Experimental results from several industrial datasets demonstrate that the linear scalability and efficiency of CUSNTF.
To support effective data exploration, there has been a growing interest in developing solutions that can automatically recommend data visualizations that reveal interesting and useful data-driven insights. In such solutions, a large number of possible data visualization views are generated and ranked according to some metric of importance (e.g., a deviation-based metric), then the top-k most important views are recommended. However, one drawback of that approach is that it often recommends similar views, leaving the data analyst with a limited amount of gained insights. To address that limitation, in this work we posit that employing diversification techniques in the process of view recommendation allows eliminating that redundancy and provides a good and concise coverage of the possible insights to be discovered. To that end, we propose a hybrid objective utility function, which captures both the importance, as well as the diversity of the insights revealed by the recommended views. While in principle, traditional diversification methods (e.g., Greedy Construction) provide plausible solutions under our proposed utility function, they suffer from a significantly high query processing cost. In particular, directly applying such methods leads to a "process-first-diversify-next" approach, in which all possible data visualization are generated first via executing a large number of aggregate queries. To address that challenge, we propose an integrated scheme called DiVE, which efficiently selects the top-k recommended view based on our hybrid utility function. DiVE leverages the properties of both the importance and diversity metrics to prune a large number of query executions without compromising the quality of recommendations. Our experimental evaluation on real datasets shows the performance gains provided by DiVE.
We study the problem of representing and recommending products for grocery shopping. We carefully investigate grocery transaction data and observe three important patterns: products within the same basket complement each other in terms of functionality (complementarity); users tend to purchase products that match their preferences (compatibility); and a significant fraction of users repeatedly purchase the same products over time (loyalty). Unlike conventional e-commerce settings, complementarity and loyalty are particularly predominant in the grocery shopping domain. This motivates a new representation learning approach to leverage complementarity and compatibility holistically, as well as a new recommendation approach to explicitly account for users' 'must-buy' purchases in addition to their overall preferences and needs. Doing so not only improves product classification and recommendation performance on both public and proprietary transaction data covering various grocery store types, but also reveals interesting findings about the relationships between preferences, necessity, and loyalty in consumer purchases.
Recommender Systems have proliferated as general-purpose approaches to model a wide variety of consumer interaction data. Specific instances make use of signals ranging from user feedback, item relationships, geographic locality, social influence (etc.). Typically, research proceeds by showing that making use of a specific signal (within a carefully designed model) allows for higher-fidelity recommendations on a particular dataset. Of course, the real situation is more nuanced, in which a combination of many signals may be at play, or favored in different proportion by individual users. Here we seek to develop a framework that is capable of combining such heterogeneous item relationships by simultaneously modeling (a) what modality of recommendation is a user likely to be susceptible to at a particular point in time; and (b) what is the best recommendation from each modality. Our method borrows ideas from mixtures-of-experts approaches as well as knowledge graph embeddings. We find that our approach naturally yields more accurate recommendations than alternatives, while also providing intuitive 'explanations' behind the recommendations it provides.
Tensor-based methods have shown promise in improving upon traditional matrix factorization methods for recommender systems. But tensors may achieve improved recommendation quality while worsening the fairness of the recommendations. Hence, we propose a novel fairness-aware tensor recommendation framework that is designed to maintain quality while dramatically improving fairness. Four key aspects of the proposed framework are: (i) a new sensitive latent factor matrix for isolating sensitive features; (ii) a sensitive information regularizer that extracts sensitive information which can taint other latent factors; (iii) an effective algorithm to solve the proposed optimization model; and (iv) extension to multi-feature and multi-category cases which previous efforts have not addressed. Extensive experiments on real-world and synthetic datasets show that the framework enhances recommendation fairness while preserving recommendation quality in comparison with state-of-the-art alternatives.
In recent years, the task of reformulating natural language queries has received considerable attention from both industry and academic communities. Because of the lexical chasm problem between natural language queries and web documents, if we directly use natural language queries as inputs for retrieval, the results are usually unsatisfactory. In this work, we formulated the task as a translation problem to convert natural language queries into keyword queries. Since the nature language queries users input are diverse and multi-faceted, general encoder-decoder models cannot effectively handle low-frequency words and out-of-vocabulary words. We propose a novel encoder-decoder method with two decoders: the pointer decoder firstly extracts query terms directly from the source text via copying mechanism, then the generator decoder generates query terms using two attention modules simultaneously considering the source text and extracted query terms. For evaluation and training, we also proposed a semi-automatic method to construct a large-scale dataset about natural language query-keyword query pairs. Experimental results on this dataset demonstrated that our model could achieve better performance than the previous state-of-the-art methods.
The problem of ad-hoc structured document retrieval arises in many information access scenarios, from Web to product search. Yet neither deep neural networks, which have been successfully applied to ad-hoc information retrieval and Web search, nor the attention mechanism, which has been shown to significantly improve the performance of deep neural networks on natural language processing tasks, have been explored in the context of this problem. In this paper, we propose a deep neural architecture for ad-hoc structured document retrieval, which utilizes attention mechanism to determine important phrases in keyword queries as well as the relative importance of matching those phrases in different fields of structured documents. Experimental evaluation on publicly available collections for Web document, product and entity retrieval from knowledge graphs indicates superior retrieval accuracy of the proposed neural architecture relative to both state-of-the-art neural architectures for ad-hoc document retrieval and probabilistic models for ad-hoc structured document retrieval.
Intelligent assistants are increasingly being used on smart speaker devices, such as Amazon Echo, Google Home, Apple Homepod, and Harmon Kardon Invoke with Cortana. Typically, user satisfaction measurement relies on user interaction signals, such as clicks and scroll movements, in order to determine if a user was satisfied. However, these signals do not exist for smart speakers, which creates a challenge for user satisfaction evaluation on these devices. In this paper, we propose a new signal, user intent, as a means to measure user satisfaction. We propose to use this signal to model user satisfaction in two ways: 1) by developing intent sensitive word embeddings and then using sequences of these intent sensitive query representations to measure user satisfaction; 2) by representing a user's interactions with a smart speaker as a sequence of user intents and thus using this sequence to identify user satisfaction. Our experimental results indicate that our proposed user satisfaction models based on the intent-sensitive query representations have statistically significant improvements over several baselines in terms of common classification evaluation metrics. In particular, our proposed task satisfaction prediction model based on intent-sensitive word embeddings has a 11.81% improvement over a generative model baseline and 6.63% improvement over a user satisfaction prediction model based on Skip-gram word embeddings in terms of the F1 metric. Our findings have implications for the evaluation of Intelligent Assistant systems.
Task and session identification is a key element of system evaluation and user behavior modeling in Intelligent Assistant (IA) systems. However, identifying task and sessions for IAs is challenging due to the multi-task nature of IAs and the differences in the ways they are used on different platforms, such as smart-phones, cars, and smart speakers. Furthermore, usage behavior may differ among users depending on their expertise with the system and the tasks they are interested in performing. In this study, we investigate how to identify tasks and sessions in IAs given these differences. To do this, we analyze data based on the interaction logs of two IAs integrated with smart-speakers. We fit Gaussian Mixture Models to estimate task and session boundaries and show how a model with 3 components models user interactivity time better than a model with 2 components. We then show how session boundaries differ for users depending on whether they are in a learning-phase or not. Finally, we study how user inter-activity times differs depending on the task that the user is trying to perform. Our findings show that there is no single task or session boundary that can be used for IA evaluation. Instead, these boundaries are influenced by the experience of the user and the task they are trying to perform. Our findings have implications for the study and evaluation of Intelligent Agent Systems.
The Min-Hashing approach to sketching has become an important tool in data analysis, information retrial, and classification. To apply it to real-valued datasets, the ICWS algorithm has become a seminal approach that is widely used, and provides state-of-the-art performance for this problem space. However, ICWS suffers a computational burden as the sketch size K increases. We develop a new Simplified approach to the ICWS algorithm, that enables us to obtain over 20x speedups compared to the standard algorithm. The veracity of our approach is demonstrated empirically on multiple datasets and scenarios, showing that our new Simplified CWS obtains the same quality of results while being an order of magnitude faster.
Linear models have been widely used in the industry for their low computation time, small memory footprint and interpretability. However, linear models are not capable of leveraging non-linear feature interactions in predicting the target. This limits their performance. A classical approach to overcome this limitation is to use combinations of the original features, referred as higher-order features, to capture non-linearity. The number of higher-order features can be very large. Selecting the informative ones among them that are predictive of the target is essential for scalability. This is computationally expensive, requiring large memory footprint. In this paper, we propose a novel scalable MinHash based scheme to select informative higher-order features. Unlike typical use of MinHash for near-duplicate entity detection and association-rule mining, we use MinHash signature of features to approximate mutual information between higher-order features and target to enable their selection. By analyzing the running time and memory requirements, we show that our proposal is highly efficient in terms of running time and storage compared to existing alternatives. We demonstrate through experiments on multiple benchmark datasets that our proposed approach is not only scalable, but also able to identify the most important feature interactions resulting in improved model performance.
Determining the similarity between two objects is pertinent to many applications. When the basis for similarity is a set of object-to-object relationships, it is natural to rely on graph-theoretic measures. One seminal technique for measuring the structural-context similarity between a pair of graph vertices is SimRank, whose underlying intuition is that two objects are similar if they are connected by similar objects. However, by design, SimRank as well as its variants capture only a single view or perspective of similarity. Meanwhile, in many real-world scenarios, there emerge multiple perspectives of similarity, i.e., two objects may be similar from one perspective, but dissimilar from another. For instance, human subjects may generate varied, yet valid, clusterings of objects. In this work, we propose a graph-theoretic similarity measure that is natively multiperspective. In our approach, the observed object-to-object relationships due to various perspectives are integrated into a unified graph-based representation, stylised as a hypergraph to retain the distinct perspectives. We then introduce a novel model for learning and reflecting diverse similarity perceptions given the hypergraph, yielding the similarity score between any pair of objects from any perspective. In addition to proposing an algorithm for computing the similarity scores, we also provide theoretical guarantees on the convergence of the algorithm. Experiments on public datasets show that the proposed model deals better with multiperspectivity than the baselines.
Current crowdsourcing platforms provide little support for worker feedback. Workers are sometimes invited to post free text describing their experience and preferences in completing tasks. They can also use forums such as Turker Nation1 to exchange preferences on tasks and requesters. In fact, crowdsourcing platforms rely heavily on observing workers and inferring their preferences implicitly. On the contrary, we believe that asking workers to indicate their preferences explicitly will allow us to improve different processes in crowdsourcing platforms. We initiate a study that leverages explicit elicitation from workers to capture the evolving nature of worker preferences and we propose an optimization framework to better understand and estimate task completion time. We design a Worker model to estimate task completion time whose accuracy is improved iteratively by requesting worker preferences for task factors, such as, required skills, task payment, and task relevance. We develop efficient solutions with guarantees, run extensive experiments with large-scale real-world data that show the benefit of explicit preference elicitation over implicit ones with statistical significance.
Crowd Flow Prediction (CFP) is one major challenge in the intelligent transportation systems of the Sydney Trains Network. However, most advanced CFP methods only focus on entrance and exit flows at the major stations or a few subway lines, neglecting Crowd Flow Distribution (CFD) forecasting problem across the entire city network. CFD prediction plays an irreplaceable role in metro management as a tool that can help authorities plan route schedules and avoid congestion. In this paper, we propose three online non-negative matrix factorization (ONMF) models. ONMF-AO incorporates an Average Optimization strategy that adapts to stable passenger flows. ONMF-MR captures the Most Recent trends to achieve better performance when sudden changes in crowd flow occur. The Hybrid model, ONMF-H, integrates both ONMF-AO and ONMF-MR to exploit the strengths of each model in different scenarios and enhance the models' applicability to real-world situations. Given a series of CFD snapshots, both models learn the latent attributes of the train stations and, therefore, are able to capture transition patterns from one timestamp to the next by combining historic guidance. Intensive experiments on a large-scale, real-world dataset containing transactional data demonstrate the superiority of our ONMF models.
Information Retrieval systems rely on large test collections to measure their effectiveness in retrieving relevant documents. While the demand is high, the task of creating such test collections is laborious due to the large amounts of data that need to be annotated, and due to the intrinsic subjectivity of the task itself. In this paper we study the topical relevance from a user perspective by addressing the problems of subjectivity and ambiguity. We compare our approach and results with the established TREC annotation guidelines and results. The comparison is based on a series of crowdsourcing pilots experimenting with variables, such as relevance scale, document granularity, annotation template and the number of workers. Our results show correlation between relevance assessment accuracy and smaller document granularity, i.e., aggregation of relevance on paragraph level results in a better relevance accuracy, compared to assessment done at the level of the full document. As expected, our results also show that collecting binary relevance judgments results in a higher accuracy compared to the ternary scale used in the TREC annotation guidelines. Finally, the crowdsourced annotation tasks provided a more accurate document relevance ranking than a single assessor relevance label. This work resulted is a reliable test collection around the TREC Common Core track.
Recently, many methods have been proposed to prevent privacy leakage in record linkage by encoding record pair data into another anonymous space. Nevertheless, they cannot perform well in some circumstances due to high computational complexities, low privacy guarantees or loss of data utility.
In this paper, we propose distance-aware encoding mechanisms to compare numerical values in the anonymous space. We first embed numerical values into Hamming space by a low-computational encoding algorithm with randomized bit vector. To provide rigorous privacy guarantees, we use the random response based on differential privacy to keep global indistinguishability of original data and use Laplace noises via pufferfish mechanism to provide local indistinguishability. Besides, we provide an approach for embedding and privacy-related parameters selection to improve data utility. Experiments on datasets from different data distributions and application contexts validate that our approaches can be used efficiently in privacy-preserving record linkage tasks compared with previous works and have excellent performance even under very small privacy budgets.
Data privacy is a major concern in modern society. In this work, we propose two solutions to the privacy-preserving problem of regression models on medical data. We focus on flexible parametric models which are powerful alternatives to the well-known Cox regression model. For the first approach, we propose a sampling mechanism which guarantees differential privacy for flexible parametric survival models. We first transform the likelihood function of the models to guarantee that likelihood values are bounded. We then use a Hamiltonian Monte-Carlo sampler to sample a random parameter vector from the posterior distribution. As a result, this random vector satisfies the requirement for differential privacy. For the second approach, as predictions with high accuracy and high confidence are very important for medical applications, we propose a mechanism which protects privacy by randomly perturbing the posterior distribution. We can then use the sampler to draw multiple random samples of the perturbed posterior to estimate the credible intervals of the parameters. The proposed mechanism does not guarantee differential privacy for the perturbed posterior. However, it allows controlling the contribution of each individual data record to the posterior. In the worst case scenario, when all data records are revealed except the target data record, the random noise added to the posterior would make it extremely difficult to obtain the target data record. The experiments conducted on two real datasets show that our proposed approaches outperform state-of-the-art methods in predicting the survival rate of individuals.
Triangle count is a critical parameter in mining relationships among people in social networks. However, directly publishing the findings obtained from triangle counts may bring potential privacy concern, which raises great challenges and opportunities for privacy-preserving triangle counting. In this paper, we choose to use differential privacy to protect triangle counting for large scale graphs. To reduce the large sensitivity caused in large graphs, we propose a novel graph projection method that can be used to obtain an upper bound for sensitivity in different distributions. In particular, we publish the triangle counts satisfying the node-differential privacy with two kinds of histograms: the triangle count distribution and the cumulative distribution. Moreover, we extend the research on privacy preserving triangle counting to one of its applications, the local clustering coefficient. Experimental results show that the cumulative distribution can fit the real statistical information better, and our proposed mechanism has achieved better accuracy for triangle counts while maintaining the requirement of differential privacy.
Online Learning to Rank (OLTR) methods optimize rankers based on user interactions. State-of-the-art OLTR methods are built specifically for linear models. Their approaches do not extend well to non-linear models such as neural networks. We introduce an entirely novel approach to OLTR that constructs a weighted differentiable pairwise loss after each interaction: Pairwise Differentiable Gradient Descent (PDGD). PDGD breaks away from the traditional approach that relies on interleaving or multileaving and extensive sampling of models to estimate gradients. Instead, its gradient is based on inferring preferences between document pairs from user clicks and can optimize any differentiable model. We prove that the gradient of PDGD is unbiased w.r.t. user document pair preferences. Our experiments on the largest publicly available Learning to Rank (LTR) datasets show considerable and significant improvements under all levels of interaction noise. PDGD outperforms existing OLTR methods both in terms of learning speed as well as final convergence. Furthermore, unlike previous OLTR methods, PDGD also allows for non-linear models to be optimized effectively. Our results show that using a neural network leads to even better performance at convergence than a linear model. In summary, PDGD is an efficient and unbiased OLTR approach that provides a better user experience than previously possible.
The recently proposed quantum language model (QLM) aimed at a principled approach to modeling term dependency by applying the quantum probability theory. The latest development for a more effective QLM has adopted word embeddings as a kind of global dependency information and integrated the quantum-inspired idea in a neural network architecture. While these quantum-inspired LMs are theoretically more general and also practically effective, they have two major limitations. First, they have not taken into account the interaction among words with multiple meanings, which is common and important in understanding natural language text. Second, the integration of the quantum-inspired LM with the neural network was mainly for effective training of parameters, yet lacking a theoretical foundation accounting for such integration. To address these two issues, in this paper, we propose a Quantum Many-body Wave Function (QMWF) inspired language modeling approach. The QMWF inspired LM can adopt the tensor product to model the aforesaid interaction among words. It also enables us to reveal the inherent necessity of using Convolutional Neural Network (CNN) in QMWF language modeling. Furthermore, our approach delivers a simple algorithm to represent and match text/sentence pairs. Systematic evaluation shows the effectiveness of the proposed QMWF-LM algorithm, in comparison with the state of the art quantum-inspired LMs and a couple of CNN-based methods, on three typical Question Answering (QA) datasets.
How to optimize ranking metrics such as Normalized Discounted Cumulative Gain (NDCG) is an important but challenging problem, because ranking metrics are either flat or discontinuous everywhere, which makes them hard to be optimized directly. Among existing approaches, LambdaRank is a novel algorithm that incorporates ranking metrics into its learning procedure. Though empirically effective, it still lacks theoretical justification. For example, the underlying loss that LambdaRank optimizes for remains unknown until now. Due to this, there is no principled way to advance the LambdaRank algorithm further. In this paper, we present LambdaLoss, a probabilistic framework for ranking metric optimization. We show that LambdaRank is a special configuration with a well-defined loss in the LambdaLoss framework, and thus provide theoretical justification for it. More importantly, the LambdaLoss framework allows us to define metric-driven loss functions that have clear connection to different ranking metrics. We show a few cases in this paper and evaluate them on three publicly available data sets. Experimental results show that our metric-driven loss functions can significantly improve the state-of-the-art learning-to-rank algorithms.
Relational join is a central data management operation that influences the performance of almost every database query. In this paper, we show that different input features and hardware settings necessitate different main-memory hash join models. Subsequently, we identify four particular models by which hash-based join algorithms can be executed and propose a novel polymorphic paradigm that dynamically subscribes to the best model given workload and hardware characteristics. We refer to our polymorphic paradigm as PolyHJ and suggest a corresponding implementation, which consists of two mechanisms, namely, in-place, cache-aware partitioning (ICP) and collaborative building and probing (ColBP). ICP and ColBP serve substantially in reducing multi-core cache misses, memory bandwidth usage, and cross-socket traffic. Our experimental results demonstrate that PolyHJ can successfully select the right models for the tested workloads and significantly outperform the current state-of-the-art hash-based join schemes.
Recent studies show that table scans are increasingly more common than using secondary indices. Given that the optimizer may choose table scans when the selectivity is as low as 0.5% with large data, it is important to make initial query results faster for interactive data explorations. We formulate it as a query result timeliness problem, and propose two complementary approaches. The first approach builds lightweight statistics and judiciously determines an access order to data blocks for a given query. The second approach performs adaptive microscopic tuple reordering online without relying on pre-built statistics. Our systematic experimental evaluation further verifies the efficiency and efficacy of our approaches.
This paper presents a new method for constructing an optimal feature set from sequential data. It creates a dictionary of n-grams of variable length (we call them v-grams), based on the minimum description length principle. The proposed method is a dictionary coder and works simultaneously as both a compression algorithm and as unsupervised feature extraction. The length of constructed v-grams is not limited by any bound and exceeds 100 characters in provided experiments. Constructed v-grams can be used for any sequential data analysis and allows transfer bag-of-word techniques to non-text data types. The method demonstrates a high compression rate on various real-life datasets. Extracted features generate a practical basis for text classification, that shows competitive results on standard text classification collections without using the text structure. Combining extracted character v-grams with the words from the original text we achieved substantially better classification quality than on words or v-grams alone.
Recommender systems are aimed at generating a personalized ranked list of items that an end user might be interested in. With the unprecedented success of deep learning in computer vision and speech recognition, recently it has been a hot topic to bridge the gap between recommender systems and deep neural network. And deep learning methods have been shown to achieve state-of-the-art on many recommendation tasks. For example, a recent model, NeuMF, first projects users and items into some shared low-dimensional latent feature space, and then employs neural nets to model the interaction between the user and item latent features to obtain state-of-the-art performance on the recommendation tasks. NeuMF assumes that the non-interacted items are inherent negative and uses negative sampling to relax this assumption. In this paper, we examine an alternative approach which does not assume that the non-interacted items are necessarily negative, just that they are less preferred than interacted items. Specifically, we develop a new classification strategy based on the widely used pairwise ranking assumption. We combine our classification strategy with the recently proposed neural collaborative filtering framework, and propose a general collaborative ranking framework called Neural Network based Collaborative Ranking (NCR). We resort to a neural network architecture to model a user's pairwise preference between items, with the belief that neural network will effectively capture the latent structure of latent factors. The experimental results on two real-world datasets show the superior performance of our models in comparison with several state-of-the-art approaches.
Two products are substitutes if both can satisfy the same consumer need. Intrinsic incorporation of product substitutability - where substitutability is integrated within latent vector space models - is in contrast to the extrinsic re-ranking of result lists. The fusion of text matching and product substitutability objectives allows latent vector space models to mix and match regularities contained within text descriptions and substitution relations. We introduce a method for intrinsically incorporating product substitutability within latent vector space models for product search that are estimated using gradient descent; it integrates flawlessly with state-of-the-art vector space models. We compare our method to existing methods for incorporating structural entity relations, where product substitutability is incorporated extrinsically by re-ranking. Our method outperforms the best extrinsic method on four benchmarks. We investigate the effect of different levels of text matching and product similarity objectives, and provide an analysis of the effect of incorporating product substitutability on product search ranking diversity. Incorporating product substitutability information improves search relevance at the cost of diversity.
With the recent growth in the use of conversational systems and intelligent assistants such as Google Assistant and Microsoft Cortana, mobile devices are becoming even more pervasive in our lives. As a consequence, users are getting engaged with mobile apps and frequently search for an information need using different apps. Recent work has stated the need for a unified mobile search system that would act as meta search on users' mobile devices: it would identify the target apps for the user's query, submit the query to the apps, and present the results to the user. Moreover, mobile devices provide rich contextual information about users and their whereabouts. In this paper, we introduce the task of context-aware target apps selection as part of a unified mobile search framework. To this aim, we designed an in situ study to collect thousands of mobile queries enriched with mobile sensor data from 255 users during a three month period. With the aid of this dataset, we were able to study user behavior as they performed cross-app search. We finally study the performance of state-of-the-art retrieval models for this task and propose a simple yet effective neural model that significantly outperforms the baselines. Our neural approach is based on learning high-dimensional representations for mobile apps and contextual information. Furthermore, we show that incorporating context improves the performance by 20% in terms of [email protected], enabling the model to perform better for 57% of users. Our data is publicly available for research purposes.
Community structure is ubiquitous in real-world complex networks. The task of community detection over these networks is of paramount importance in a variety of applications. Recently, nonnegative matrix factorization (NMF) has been widely adopted for community detection due to its great interpretability and its natural fitness for capturing the community membership of nodes. However, the existing NMF-based community detection approaches are shallow methods. They learn the community assignment by mapping the original network to the community membership space directly. Considering the complicated and diversified topology structures of real-world networks, it is highly possible that the mapping between the original network and the community membership space contains rather complex hierarchical information, which cannot be interpreted by classic shallow NMF-based approaches. Inspired by the unique feature representation learning capability of deep autoencoder, we propose a novel model, named Deep Autoencoder-like NMF (DANMF), for community detection. Similar to deep autoencoder, DANMF consists of an encoder component and a decoder component. This architecture empowers DANMF to learn the hierarchical mappings between the original network and the final community assignment with implicit low-to-high level hidden attributes of the original network learnt in the intermediate layers. Thus, DANMF should be better suited to the community detection task. Extensive experiments on benchmark datasets demonstrate that DANMF can achieve better performance than the state-of-the-art NMF-based community detection approaches.
The task of dialogue generation aims to automatically provide responses given previous utterances. Tracking dialogue states is an important ingredient in dialogue generation for estimating users' intention. However, the expensive nature of state labeling and the weak interpretability make the dialogue state tracking a challenging problem for both task-oriented and non-task-oriented dialogue generation: For generating responses in task-oriented dialogues, state tracking is usually learned from manually annotated corpora, where the human annotation is expensive for training; for generating responses in non-task-oriented dialogues, most of existing work neglects the explicit state tracking due to the unlimited number of dialogue states.
In this paper, we propose the semi-supervised explicit dialogue state tracker (SEDST) for neural dialogue generation. To this end, our approach has two core ingredients: CopyFlowNet and posterior regularization. Specifically, we propose an encoder-decoder architecture, named CopyFlowNet, to represent an explicit dialogue state with a probabilistic distribution over the vocabulary space. To optimize the training procedure, we apply a posterior regularization strategy to integrate indirect supervision. Extensive experiments conducted on both task-oriented and non-task-oriented dialogue corpora demonstrate the effectiveness of our proposed model. Moreover, we find that our proposed semi-supervised dialogue state tracker achieves a comparable performance as state-of-the-art supervised learning baselines in state tracking procedure.
Destination prediction is known as an important problem for many location based services (LBSs). Existing solutions generally apply probabilistic models to predict destinations over a sub-trajectory, but their accuracies in fine-granularity prediction are always not satisfactory due to the data sparsity problem. This paper presents a carefully designed deep learning model called TALL model for destination prediction. It not only takes advantage of the bidirectional Long Short-Term Memory (LSTM) network for sequence modeling, but also gives more attention to meaningful locations that have strong correlations w.r.t. destination by adopting attention mechanism. Furthermore, a hierarchical model that explores the fusion of multi-granularity learning capability is further proposed to improve the accuracy of prediction. Extensive experiments on Beijing and Chengdu real datasets finally demonstrate that our proposed models outperform existing methods without considering external features.
As urban crimes (e.g., burglary and robbery) negatively impact our everyday life and must be addressed in a timely manner, predicting crime occurrences is of great importance for public safety and urban sustainability. However, existing methods do not fully explore dynamic crime patterns as factors underlying crimes may change over time. In this paper, we develop a new crime prediction framework--DeepCrime, a deep neural network architecture that uncovers dynamic crime patterns and carefully explores the evolving inter-dependencies between crimes and other ubiquitous data in urban space. Furthermore, our DeepCrime framework is capable of automatically capturing the relevance of crime occurrences across different time periods. In particular, our DeepCrime framework enables predicting crime occurrences of different categories in each region of a city by i) jointly embedding all spatial, temporal, and categorical signals into hidden representation vectors, and ii) capturing crime dynamics with an attentive hierarchical recurrent network. Extensive experiments on real-world datasets demonstrate the superiority of our framework over many competitive baselines across various settings.
In online advertising, the Internet users may be exposed to a sequence of different ad campaigns, i.e., display ads, search, or referrals from multiple channels, before led up to any final sales conversion and transaction. For both campaigners and publishers, it is fundamentally critical to estimate the contribution from ad campaign touch-points during the customer journey (conversion funnel) and assign the right credit to the right ad exposure accordingly. However, the existing research on the multi-touch attribution problem lacks a principled way of utilizing the users' pre-conversion actions (i.e., clicks), and quite often fails to model the sequential patterns among the touch points from a user's behavior data. To make it worse, the current industry practice is merely employing a set of arbitrary rules as the attribution model, e.g., the popular last-touch model assigns 100% credit to the final touch-point regardless of actual attributions. In this paper, we propose a Dual-attention Recurrent Neural Network (DARNN) for the multi-touch attribution problem. It learns the attribution values through an attention mechanism directly from the conversion estimation objective. To achieve this, we utilize sequence-to-sequence prediction for user clicks, and combine both post-view and post-click attribution patterns together for the final conversion estimation. To quantitatively benchmark attribution models, we also propose a novel yet practical attribution evaluation scheme through the proxy of budget allocation (under the estimated attributions) over ad channels. The experimental results on two real datasets demonstrate the significant performance gains of our attribution model against the state of the art.
Real-time bidding (RTB) is an important mechanism in online display advertising, where a proper bid for each page view plays an essential role for good marketing results. Budget constrained bidding is a typical scenario in RTB where the advertisers hope to maximize the total value of the winning impressions under a pre-set budget constraint. However, the optimal bidding strategy is hard to be derived due to the complexity and volatility of the auction environment. To address these challenges, in this paper, we formulate budget constrained bidding as a Markov Decision Process and propose a model-free reinforcement learning framework to resolve the optimization problem. Our analysis shows that the immediate reward from environment is misleading under a critical resource constraint. Therefore, we innovate a reward function design methodology for the reinforcement learning problems with constraints. Based on the new reward design, we employ a deep neural network to learn the appropriate reward so that the optimal policy can be learned effectively. Different from the prior model-based work, which suffers from the scalability problem, our framework is easy to be deployed in large-scale industrial applications. The experimental evaluations demonstrate the effectiveness of our framework on large-scale real datasets.
With the amount of data has been rapidly growing over recent decades, binary hashing has become an attractive approach for fast search over large databases, in which the high-dimensional data such as image, video or text is mapped into a low-dimensional binary code. Searching in this hamming space is extremely efficient which is independent of the data size. A lot of methods have been proposed to learn this binary mapping. However, to make the binary codes conserves the input information, previous works mostly resort to mean squared error, which is prone to lose a lot of input information . On the other hand, most of the previous works adopt the norm constraint or approximation on the hidden representation to make it as close as possible to binary, but the norm constraint is too strict that harms the expressiveness and flexibility of the code.
In this paper, to generate desirable binary codes, we introduce two adversarial training procedures to the hashing process. We replace the L2 reconstruction error with an adversarial training process to make the codes reserve its input information, and we apply another adversarial learning discriminator on the hidden codes to make it proximate to binary. With the adversarial training process, the generated codes are getting close to binary while also conserves the input information. We conduct comprehensive experiments on both supervised and unsupervised hashing applications and achieves a new state of the arts result on many image hashing benchmarks.
Deep metric learning is widely used in extreme classification and image retrieval because of its powerful ability to learn the semantic low-dimensional embedding of high-dimensional data. However, the heavy computational cost of mining valuable pair or triplet of training data and updating models frequently in existing deep metric learning approaches becomes a barrier to apply such methods to a large-scale real-world context in a distributed environment. Moreover, existing distributed deep learning framework is not designed for deep metric learning tasks, because it is difficult to implement a smart mining policy of valuable training data. In this paper, we introduce a novel distributed framework to speed up the training process of the deep metric learning using multiple machines. Specifically, we first design a distributed sampling method to find the hard-negative samples from a broader scope of candidate samples compared to the single-machine solution. Then, we design a hybrid communication pattern and implement a decentralized data-parallel framework to reduce the communication workload while the quality of the trained deep metric models is preserved. In experiments, we show excellent performance gain compared to a full spectrum of state-of-the-art deep metric learning models on multiple datasets in terms of image clustering and image retrieval tasks.
This paper poses a non-linear dynamical system on bipartite graphs and shows its stability under certain conditions. The dynamical system changes the weights on the nodes of the graph in each time step. The underlying weight transformation is non-linear, motivated by information gain in a document retrieval setting. Stability analysis of this problem is therefore more involved than that of PageRank-like algorithms. We show convergence using methods from Lyapunov theory and also provide some examples of how the algorithm performs when ranking keywords and sentences in a set of documents.
Heteregeneous information networks (HINs) are ubiquitous in the real world, and discovering the abnormal events plays an important role in understanding and analyzing the HIN. The abnormal event usually implies that the number of co-occurrences of entities in a HIN are very rare, so most of the existing works are based on detecting the rare patterns of events. However, we find that the number of co-occurrences of majority entities in events are the same, which brings great challenge to distinguish the normal and abnormal events. Therefore, we argue that considering the heterogeneous information structure only is not sufficient for abnormal event detection and introducing additional valuable information is necessary. In this paper, we propose a novel deep heterogeneous network embedding method which incorporates the entity attributes and second-order structures simultaneously to address this problem. Specifically, we utilize type-aware Multilayer Perceptron (MLP) component to learn the attribute embedding, and adopt the autoencoder framework to learn the second-order aware embedding. Then based on the mixed embeddings, we are able to model the pairwise interactions of different entities, such that the events with small entity compatibilities have large abnormal event score. The experimental results on real world network demonstrate the effectiveness of our proposed method.
Most existing knowledge graphs (KGs) in academic domains suffer from problems of insufficient multi-relational information, name ambiguity and improper data format for large-scale machine processing. In this paper, we present AceKG, a new large-scale KG in academic domain. AceKG not only provides clean academic information, but also offers a large-scale benchmark dataset for researchers to conduct challenging data mining projects including link prediction, community detection and scholar classification. Specifically, AceKG describes 3.13 billion triples of academic facts based on a consistent ontology, including necessary properties of papers, authors, fields of study, venues and institutes, as well as the relations among them. To enrich the proposed knowledge graph, we also perform entity alignment with existing databases and rule-based inference. Based on AceKG, we conduct experiments of three typical academic data mining tasks and evaluate several state-of-the-art knowledge embedding and network representation learning approaches on the benchmark datasets built from AceKG. Finally, we discuss promising research directions that benefit from AceKG.
In recent times, deep neural networks have found success in Collaborative Filtering (CF) based recommendation tasks. By parametrizing latent factor interactions of users and items with neural architectures, they achieve significant gains in scalability and performance over matrix factorization. However, the long-tail phenomenon in recommender performance persists on the massive inventories of online media or retail platforms. Given the diversity of neural architectures and applications, there is a need to develop a generalizable and principled strategy to enhance long-tail item coverage.
In this paper, we propose a novel adversarial training strategy to enhance long-tail recommendations for users with Neural CF (NCF) models. The adversary network learns the implicit association structure of entities in the feedback data while the NCF model is simultaneously trained to reproduce these associations and avoid the adversarial penalty, resulting in enhanced long-tail performance. Experimental results show that even without auxiliary data, adversarial training can boost long-tail recall of state-of-the-art NCF models by up to 25%, without trading-off overall performance. We evaluate our approach on two diverse platforms, content tag recommendation in Q&A forums and movie recommendation.
Search queries issued over the Web increasingly look like questions, especially as the domain becomes more specific. Finding good response to such queries amounts to finding relevant passages from Web documents. Traditional information retrieval based Web search still matches the query to the words in the entire document. With the advent of machine reading comprehension techniques, Web search is moving more towards identifying the best sentence / group of sentences in the document. We present AQuPR an A ttention based Qu ery P assage R etrieval system to find human acceptable answer containing passages to technology queries issued over the Web. We train character level embeddings for the query and passage pairs, train a deep recurrent network with a novel simplified attention mechanism and incorporate additional signals present in Web documents to improve the performance of such a system. We collect a database of human issued queries along with their answer passages and learn an end to end system to enable automated query resolution. We present results for answering human issued search queries which show considerable promise against basic versions of current generation question answering systems.
In previous work on text summarization, encoder-decoder architectures and attention mechanisms have both been widely used. Attention-based encoder-decoder approaches typically focus on taking the sentences preceding a given sentence in a document into account for document representation, failing to capture the relationships between a sentence and sentences that follow it in a document in the encoder. We propose an attentive encoder-based summarization (AES) model to generate article summaries. AES can generate a rich document representation by considering both the global information of a document and the relationships of sentences in the document. A unidirectional recurrent neural network (RNN) and a bidirectional RNN are considered to construct the encoders, giving rise to unidirectional attentive encoder-based summarization (Uni-AES) and bidirectional attentive encoder-based summarization (Bi-AES), respectively. Our experimental results show that Bi-AES outperforms Uni-AES. We obtain substantial improvements over a relevant start-of-the-art baseline.
We show that click models trained with suboptimal hyperparameters suffer from the issue of bad calibration. This means that their predicted click probabilities do not agree with the observed proportions of clicks in the held-out data. To repair this discrepancy, we adapt a non-parametric calibration method called isotonic regression. Our experimental results show that isotonic regression significantly improves click models trained with suboptimal hyperparameters in terms of perplexity, and that it makes click models less sensitive to the choice of hyperparameters. Interestingly, the relative ranking of existing click models in terms of their predictive performance changes depending on whether or not their predictions are calibrated. Therefore, we advocate that calibration becomes a mandatory part of the click model evaluation protocol.
Serendipity is highly valued as a process for developing original solutions to problems and for innovation. However, it is difficult to capture and thus difficult to measure, but novelty is a key and critical indicator. In this work, we investigate the relationship between user behavioural actions and perceived novelty in the context of browsing. 180 participants completed an open-ended browsing task, while their behaviour actions were tracked. Each seven-action sequence was analysed with respect to the participant's perception of Novelty. Results showed that 6 of the 7 actions map to a sub-sequence that discriminates between high and low novelty. Notably, switching between exploration and immersion, and checking SERPs about the same request in-depth are indicative of highly perceived novelty. The results show that analysing behavioural action sequences leads to better prediction of novelty, and thus the potential for serendipity, than individual browsing actions.
The accurate prediction of users' future topics of interests on social networks can facilitate content recommendation and platform engagement. However, researchers have found that future interest prediction, especially on social networks such as Twitter, is quite challenging due to the rapid changes in community topics and evolution of user interactions. In this context, temporal collaborative filtering methods have already been used to perform user interest prediction, which benefit from similar user behavioral patterns over time to predict how a user's interests might evolve in the future. In this paper, we propose that instead of considering the whole user base within a collaborative filtering framework to predict user interests, it is possible to much more accurately predict such interests by only considering the behavioral patterns of the most influential users related to the user of interest. We model influence as a form of causal dependency between users. To this end, we employ the concept of Granger causality to identify causal dependencies. We show through extensive experimentation that the consideration of only one causally dependent user leads to much more accurate prediction of users' future interests in a host of measures including ranking and rating accuracy metrics.
This paper discusses challenges of an online evaluation technique, multileaved comparison, based on the analysis of evaluation results in a community question-answering (cQA) search service. NTCIR-13 OpenLiveQ task offered a shared task in which participants addressed an ad-hoc retrieval task in a cQA service, and evaluated their rankers by multileaved comparison, which combines multiple rankings to generate a single search result page, and simultaneously evaluates the different rankings based on users' clicks on the search result page. Since the number of search result impressions during the evaluation period might not suffice to evaluate a hundred of rankers, we conducted the online evaluation only for rankers that achieved high performance in offline evaluation. The analysis of evaluation results showed that offline and online evaluation results did not fully agree, and a large number of users' clicks were necessary to find a statistically significant difference for every ranker pair. To cope with these problems in large-scale multileaved comparison, we propose a new experimental design that evaluates all the rankers online but intensively tests only the top-k rankers. Simulation-based experiments demonstrated that Copeland counting algorithm could achieve high top-k recall in the top-k identification problem for multileaved comparison.
In the educational framework, knowledge assessment is a critical component, and quizzes (sets of questions with concise answers) are a popular tool for this purpose. This paper focuses on the generation of balanced quizzes, i.e., quizzes that relate to a given set of documents, and to the central concepts described by the documents, in an evenly distributed manner. Our approach leverages a graph representing the relationships between questions, documents, and concepts, and phrases quiz construction as a node selection problem in this graph. We provide algorithms for constructing the graph and for selecting a good set of quiz questions. In our concrete implementation, we build quizzes for a collection of Wikipedia articles and evaluate them both with simulated students and with real human quiz takers, finding that our balanced quizzes are better suited at determining which articles the user has not read (corresponding to their knowledge gaps) than reasonable baselines.
In this paper we explore the use of Continuation Methods and Curriculum Learning techniques in the area of Learning to Rank. The basic idea is to design the training process as a learning path across increasingly complex training instances and objective functions. We propose to instantiate continuation methods in Learning to Rank by changing the IR measure to optimize during training, and we present two different curriculum learning strategies to identify easy training examples. Experimental results show that simple continuation methods are more promising than curriculum learning ones since they allow for slightly improving the performance of state-of-the-art λ-MART models and provide a faster convergence speed.
Cyber-physical systems often consist of entities that interact with each other over time. Meanwhile, as part of the continued digitization of industrial processes, various sensor technologies are deployed that enable us to record time-varying attributes (a.k.a., time series) of such entities, thus producing correlated time series. To enable accurate forecasting on such correlated time series, this paper proposes two models that combine convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The first model employs a CNN on each individual time series, combines the convoluted features, and then applies an RNN on top of the convoluted features in the end to enable forecasting. The second model adds additional auto-encoders into the individual CNNs, making the second model a multi-task learning model, which provides accurate and robust forecasting. Experiments on a large real-world correlated time series data set suggest that the proposed two models are effective and outperform baselines in most settings.
This study takes the lead to study the aspect/sentiment-aware abstractive review summarization in domain adaptation scenario. The proposed model CASAS (neural attentive model for Cross-domain Aspect/Sentiment-aware Abstractive review Summarization) leverages domain classification task, working on datasets of both source and target domains, to recognize the domain information of texts and transfer knowledge from source domains to target domains. The extensive experiments on Amazon reviews demonstrate that CASAS outperforms the compared methods in both out-of-domain and in-domain setups.
Given the great amounts of data being transmitted between devices in the 21st century, existing channels of wireless communication are getting congested. In the wireless space, the focus up to now has been on the microwave frequency range. An alternative for high-speed medium- and long-range communication is the millimeter wave spectrum, which is most effectively used through point-to-point links. In this paper, we develop and compare methods for verifying the Line of Sight (LOS) constraint between two points in a city. To be useful for online wireless network planning systems, the methods must be able to process terabytes of 3D city geolocation data and provide answers in milliseconds. We evaluate our methods using data for the city of San Jose, a major metropolitan area in Silicon Valley, California. Our results indicate that our Hierarchical Polygon Aggregation (HPA) method is able to achieve millisecond-level query times with very little loss of precision.
Parkinson's disease (PD) is a slowly progressing neurodegenerative disease with early manifestation of motor signs. Recently, there has been a growing interest in developing automatic tools that can assess motor function in PD patients. Here we show that mouse tracking data collected during people's interaction with a search engine can be used to distinguish PD patients from similar, non-diseased users and present a methodology developed for the diagnosis of PD from these data. The main challenge we address is the extraction of informative features from raw mouse tracking data. We do so in two complementary ways: First, we manually construct expert-recommended features, aiming to identify abnormalities in motor behaviors. Second, we use an unsupervised representation learning technique to map these raw data to high-level features. Using all the extracted features, a Random Forest classifier is then used to distinguish PD patients from controls, achieving an AUC of 0.92, while results using only expert-generated or auto-generated features are 0.87 and 0.83, respectively. Our results indicate that mouse tracking data can help in detecting users at early stages of the disease and that both expert-generated features and unsupervised techniques for feature generation are required to achieve the best possible performance.
Missing values in real world datasets are a common issue. Handling missing values is one of the most key aspects in data mining, as it can seriously impact the performance of predictive models. In this paper we proposed a unified Boosting framework that consolidates model construction and missing value handling. At each Boosting iteration, weights are assigned to both the samples and features. The sample weights make difficult samples become the learning focus, while the feature weights enable critical features to be compensated by less critical features when they are unavailable. A weak classifier that abstains (i.e, produce no prediction when required feature value is missing) is learned on a data subset determined by the feature weights. Experimental results demonstrate the efficacy and robustness of the proposed method over existing Boosting algorithms.
Several previous approaches attempted to predict bursty topics on Twitter. Such approaches have usually reported that the time information (e.g. the topic popularity over time) of hashtag topics contribute the most to the prediction of bursty topics. In this paper, we propose a novel approach to use time features to predict bursty topics on Twitter. We model the popularity of topics as density curves described by the density function of a beta distribution with different parameters. We then propose various approaches to predict/classify the bursty topics by estimating the parameters of topics, using estimators such as Gradient Decent or Likelihood Maximization. In our experiments, we show that the estimated parameters of topics have a positive effect on classifying bursty topics. In particular, our estimators when combined together improve the bursty topic classification by 6.9 in terms of micro F1 compared to a baseline classifier using hashtag content features.
Query Expansion (QE) techniques expand the user queries with additional terms, e.g., synonyms and acronyms, to enhance the system recall. State-of-the-art solutions employ machine learning methods to select the most suitable terms. However, most of them neglect the cost of processing the expanded queries, thus selecting effective, yet very expensive, terms. The goal of this paper is to enable QE in scenarios with tight time constraints proposing a QE framework based on structured queries and efficiency-aware term selection strategies. In particular, the proposed expansion selection strategies aim at capturing the efficiency and the effectiveness of the expansion candidates, as well as the dependencies among them. We evaluate our proposals by conducting an extensive experimental assessment on real-world search engine data and public TREC data. Results confirm that our approach leads to a remarkable efficiency improvement w.r.t. the state-of-the-art: a reduction of the retrieval time up to 30 times, with only a small loss of effectiveness.
Distributed Web search engines (WSEs) require warehouse-scale computers to deal with the ever-increasing size of the Web and the large amount of user queries they daily receive. The energy consumption of this infrastructure has a major impact on the economic profitability of WSEs. Recently several approaches to reduce the energy consumption of WSEs have been proposed. Such solutions leverage dynamic voltage and frequency scaling techniques in modern CPUs to adapt the WSEs' query processing to the incoming query traffic without negative impacts on latencies.
A state-of-the-art research approach is the PESOS (Predictive Energy Saving Online Scheduling) algorithm, which can reduce the energy consumption of a WSE' single server by up to 50%. We evaluate PESOS on a simulated distributed WSE composed of a thousand of servers, and we compare its performance w.r.t. an industry-level baseline, called PEGASUS. Our results show that PESOS can reduce the CPU energy consumption of a distributed WSE by up to 18% with respect to PEGASUS, while providing query response times which are in line with user expectations.
This paper addresses the pipeline processing of sequential workflows in crowdsourcing. Sequential workflows consisting of several subtasks are ubiquitous in crowdsourcing. Our approach is to control the budget distribution to subtasks in order to balance the execution speed of the subtasks and to improve throughput of overall sequential workflows. As we cannot control the price for earlier steps retrospectively in the stepwise batch execution, we explore pipeline processing schemes. Our experimental results show that our pipeline processing scheme with price control achieves significantly higher throughput of sequential workflows.
A similarity join aims to find all similar pairs between two collections of records. Established approaches usually deal with synthetic differences like typos and abbreviations, but neglect the semantic relations between words. Such relations, however, are helpful for obtaining high-quality joining results. In this paper, we leverage the taxonomy knowledge (i.e., a set of IS-A hierarchical relations) to define a similarity measure which finds semantic-similar records from two datasets. Based on this measure, we develop a similarity join algorithm with prefix filtering framework to prune away irrelevant pairs effectively. Our technical contribution here is an algorithm that judiciously selects critical parameters in a prefix filter to maximise its filtering power, supported by an estimation technique and Monte Carlo simulation process. Empirical experiments show that our proposed methods exhibit high efficiency and scalability, outperforming the state-of-art by a large margin.
As one of the most widely used clustering techniques, the fuzzy K-Means (also called FKM or FCM) assigns every data point to each cluster with a certain degree of membership. However, conventional FKM approach relies on the square data fitting term which is not robust to data outliers and ignores the prior information, which leads to unsatisfactory clustering results. In this paper, we present a novel and robust fuzzy K-Means clustering algorithm, namely Embedding Fuzzy K-Means with Nonnegative Spectral Clustering via Incorporating Side Information. The proposed method combines fuzzy K-Means with nonnegative spectral clustering into a unified model, and further takes the advantage of the prior knowledge of data pairs such that the quality of similarity graph is enhanced and the clustering performance is effectively improved. Besides, the ℓ2,1-norm loss function is adopted in the objective function, which achieves better robustness to outliers. Last, experimental results on benchmark datasets verify the effectiveness and superiority of the proposed clustering method.
Given a SERP in response to a user-originated query, Moffat et al. (CIKM 2013; TOIS 2017) suggest that C(i), the conditional continuation probability of the user examining the (i+1)st element presented in the SERP, given that they are known to have examined the ith one, is positively correlated with both i and with the user's initial estimate of the volume of answer pages they are looking for, and negatively correlated with the extent to which suitable answer pages have been identified in the SERP at positions 1 through i. Here we first describe a methodology for specifying how C(i) should be defined in practical (as against ideal) settings, and then evaluate the applicability of the approach using three large search interaction logs from two different sources.
Sub-event detection can help faster and deeper understanding of an event by providing human-friendly clusters, and thus has become an important research topic in Web mining and knowledge management. In existing sub-event detection methods, clustering based methods are brittle for using heuristic similarity metric to judge whether documents belong to the same sub-event, while topic model based methods are limited to the bag of words assumption. To overcome these drawbacks in previous research, in this paper, we propose an encoder-memory-decoder framework for sub-event detection. Our model learns document and sub-event representations suitable for the similarity metric in a data-driven manner, and transforms sub-event detection into selecting the most proper sub-event representation that can maximize text reconstruction probability. Considering the case of over-fitting, we also apply transfer learning in our model. To the best of our knowledge, our model is the first to develop an unsupervised deep neural model for sub-event detection. We use Twitter as an examplar social media platform for our study, and experimental results show that our model outperforms baseline methods for sub-event detection.
Network embedding methods aim at learning low-dimensional latent representation of nodes in a network. While achieving competitive performance on a variety of network inference tasks such as node classification and link prediction, these methods treat the relations between nodes as a binary variable and ignore the rich semantics of edges. In this work, we attempt to learn network embeddings which simultaneously preserve network structure and relations between nodes. Experiments on several real-world networks illustrate that by considering different relations between different node pairs, our method is capable of producing node embeddings of higher quality than a number of state-of-the-art network embedding methods, as evaluated on a challenging multi-label node classification task.
Graph kernels have recently emerged as a promising approach to perform machine learning on graph-structured data. A graph kernel implicitly embedds graphs in a Hilbert space and computes the inner product between these representations. However, the inner product operation greatly limits the representational power of kernels between graphs. In this paper, we propose to perform a series of successive embeddings in order to improve the performance of existing graph kernels and derive more expressive kernels. We first embed the input graphs in a Hilbert space using a graph kernel and then we embed them into another space by employing popular kernels for vector data (e.g., gaussian kernel). Our experiments on several datasets show that by composing kernels, we can achieve significant improvements in classification accuracy.
Recently, there has been considerable interest in the use of historical logged user interaction data—queries and clicks—for evaluation of search systems in the context of counterfactual analysis [8,10]. Recent approaches attempt to de-bias the historical log data by conducting randomization experiments and modeling the bias in user behavior. Thus far, the focus has been on addressing bias that arises due to the position of the document being clicked (position-bias) or sparsity of clicks on certain query-document pairs (selection-bias). However, there is another source of bias that could arise: the bias due to the context in which a document was presented to the user. The propensity of the user clicking on a document depends not only on its position but also on many other contextual factors. In this work, we show that the existing counterfactual estimators fail to capture one type of bias, specifically, the effect on click-through rates due to the relevance of documents ranked above. Further, we propose a modification to the existing estimator that takes into account this bias. We rely on full result randomization that allows us to control for the click context at various ranks; we demonstrate the effectiveness of our methods in evaluating retrieval system through experiments on a simulation setup that is designed to cover a wide variety of scenarios.
This paper explores a neural network-based approach to computing similarity of two texts written in different languages. Such similarity can be useful for a variety of applications including cross-lingual information retrieval and cross-lingual text classification. To compute similarity, we focus on neural machine translation models and examine the utility of their intermediate states. Through experiments on an English-Japanese translation corpus, it is demonstrated that the intermediate states of input texts are indeed beneficial for computing cross-lingual text similarity, outperforming other approaches including a strong machine translation-based baseline.
Figures and captions convey essential information in scientific publications. As such, there is a growing interest in mining published figures and in utilizing their respective captions as a source of knowledge. There is also much interest in image captioning systems that can automatically generate captions for images, whose training requires large datasets of image-caption pairs. Notably, the first fundamental step of obtaining figures and captions from publications is neither well-studied nor yet well-addressed. In this paper, we introduce a new and effective system for figure and caption extraction, PDFigCapX. Unlike current methods that extract figures by handling raw encoded contents of PDF documents, we separate text from graphical contents and utilize layout information to detect and disambiguate figures and captions. Files containing the figures and their associated captions are then produced as output to the end-user. We test PDFigCapX on both a previously used generic dataset and on two new sets of publications within the biomedical domain. Our experiments and results show a significant improvement in performance compared to the state-of-the-art, and demonstrate the effectiveness of our approach. Our system will be available for use at: https://www.eecis.udel.edu/~compbio/PDFigCapX.
With the increasing uptake of knowledge graphs comes an increasing need for validating the knowledge contained in these graphs. However, the sheer size and number of knowledge bases used in real-world applications makes manual fact checking impractical. In this paper, we employ sentence coherence features gathered from trustworthy source documents to outperform the state of the art in fact checking. Our approach, FactCheck, uses this information to score how likely a fact is to be true and provides the user the evidence used to validate the input facts. We evaluated our approach on two different benchmark datasets and two different corpora. Our results show that FactCheck outperforms the state of the art by up to 13.3% in F-measure and 19.3% AUC. FactCheck is open-source and is available at https://github.com/dice-group/FactCheck.
It has been shown that stock price movements are influenced by news. To predict stock movements with news, many existing works rely only on the news title since the news content may contain irrelevancies which seriously degrade the prediction accuracy. However, we observe that there is still useful information in the content which is not reflected in the title, and simply ignoring the content will result in poor performance. In this paper, taking advantage of neural representation learning, we propose a hierarchical complementary attention network (HCAN) to capture valuable complementary information in news title and content for stock movement prediction. In HCAN, we adopt a two-level attention mechanism to quantify the importances of the words and sentences in a given news. Moreover, we design a novel measurement for calculating the attention weights to avoid capturing redundant information in the news title and content. Experimental results on news datasets show that our proposed model outperforms the state-of-the-art techniques.
We revisit the fundamental problem of sorting objects using crowdsourced pairwise comparisons. Prior work either treats these comparisons as independent tasks -- in which case the resulting judgments may end up being inconsistent, or fails to capture the accuracies of workers or difficulties of the pairwise comparisons -- in which case the resulting judgments may end up being consistent with each other, but ultimately more inaccurate. We adopt a holistic approach that constructs a graph across the set of objects respecting consistency constraints. Our key contribution is a novel method of encoding difficulty of comparisons in the form of constraints on edges. We couple that with an iterative E-M-style procedure to uncover information about latent variables and constraints, along with the graph structure. We show that our approach predicts edge directions as well as difficulty values more accurately than baselines on both real and simulated data, across graphs of various sizes.
Scholars' homepages are important places to show personal research interest and academic achievement through the Web. However, according to our observation, only a small portion of scholars update their publications and related events on their homepages in time. In this paper, we propose a homepage augmentation technique, which automatically shows the newest academic events related to a scholar on his/her homepage. Specifically, we model the relations between homepages and the events collected from the Web as a complex heterogenous network, and propose an Embedding-based Heterogenous random Walk algorithm, namely EHWalk, to predict the links between homepages and events. Compared with existing embedding-based link prediction algorithms, EHWalk supports more efficient modeling of complex heterogenous relations in a dynamically changing network, which helps link the massive new updated events to homepages precisely and efficiently. Comprehensive experiments on a real-world dataset are conducted and the results show that our algorithm can achieve both good effectiveness and efficiency for real-world deployment.
Search activities involving knowledge acquisition, investigation and synthesis are collectively known as exploratory search. Exploratory search is challenging for users, who may be unable to formulate search queries, have ill-defined search goals or may even struggle to understand search results. To ameliorate these difficulties, reinforcement learning-based information retrieval systems were developed to provide adaptive support to users. Reinforcement learning is used to build a model of user intent based on relevance feedback provided by the user. But how reliable is relevance feedback in this context? To answer this question, we developed a novel permutation-based metric for scoring the consistency of relevance feedback. We used this metric to perform a retrospective analysis of interaction data from lookup and exploratory search experiments. Our analysis shows that for lookup search relevance judgments are highly consistent, supporting previous findings that relevance feedback improves retrieval performance. For exploratory search, however, the distribution of consistency scores shows considerable inconsistency.
Popular methods for news recommendation which are based on collaborative filtering and content-based filtering have multiple drawbacks. The former method does not account for the sequential nature of news reading and suffers from the problem of cold-start, while the latter, suffers from over-specialization. In order to address these issues for news recommendation we propose a Hybrid Recurrent Attention Machine (HRAM). HRAM consists of two components. The first component utilizes a neural network for matrix factorization. While in the second component, we first learn the distributed representation of each news article. We then use the historical data of the user in a sequential manner and feed it to an attention-based recurrent layer. Finally, we concatenate the outputs from both these components and use further hidden layers in order to make predictions. In this way, we harness the information present in the user reading history and boost it with the information available through collaborative filtering for providing better news recommendations. Extensive experiments over two real-world datasets show that the proposed model provides significant improvement over the state-of-the-art.
One of the challenges of automating machine learning applications is the automatic selection of an algorithmic model for a given problem. We present AutoDi, a novel and resource-efficient approach for model selection. Our approach combines two sources of information: metafeatures extracted from the data itself and word-embedding features extracted from a large corpus of academic publications. This hybrid approach enables AutoDi to select top-performing algorithms both for widely and rarely used datasets by utilizing its two types of feature sets. We demonstrate the effectiveness of our proposed approach on a large dataset of 119 datasets and 179 classification algorithms grouped into 17 families. We show that AutoDi can reach an average of 98.8% of optimal accuracy and select the optimal classification algorithm in 49.5% of all cases.
In addition to only considering stocks' price series, utilizing short and instant texts from social medias like Twitter has potential to yield better stock market prediction. While some previous approaches have explored this direction, their results are still far from satisfactory due to their reliance on performance of sentiment analysis and limited capabilities of learning direct relations between target stock trends and their daily social texts. To bridge this gap, we propose a novel Cross-modal attention based Hybrid Recurrent Neural Network (CH-RNN), which is inspired by the recent proposed DA-RNN model. Specifically, CH-RNN consists of two essential modules. One adopts DA-RNN to gain stock trend representations for different stocks. The other utilizes recurrent neural network to model daily aggregated social texts. These two modules interact seamlessly by the following two manners: 1) daily representations of target stock trends from the first module are leveraged to select trend-related social texts through a cross-modal attention mechanism, and 2) representations of text sequences and trend series are further integrated. The comprehensive experiments on the real dataset we build demonstrate the effectiveness of CH-RNN and benefit of considering social texts.
Supervised learning methods are widely used in sentiment classification. However, when sentiment distribution is imbalanced, the performance of these methods declines. In this paper, we propose an effective approach for imbalanced sentiment classification. In our approach, multiple balanced subsets are sampled from the imbalanced training data and a multi-task learning based framework is proposed to learn robust sentiment classifier from these subsets collaboratively. In addition, we incorporate prior knowledge of sentiment expressions extracted from both existing sentiment lexicons and massive unlabeled data into our approach to enhance the learning of sentiment classifier in imbalanced scenario. Experimental results on benchmark datasets validate the effectiveness of our approach in improving imbalanced sentiment classification.
Neural embeddings have been effectively integrated into information retrieval tasks including ad hoc retrieval. One of the benefits of neural embeddings is they allow for the calculation of the similarity between queries and documents through vector similarity calculation methods. While such methods have been effective for document matching, they have an inherent bias towards documents that are sized relatively similarly. Therefore, the difference between the query and document lengths, referred to as the query-document size imbalance problem, becomes an issue when incorporating neural embeddings and their associated similarity calculation models into the ad hoc document retrieval process. In this paper, we propose that document representation methods need to be used to address the size imbalance problem and empirically show their impact on the performance of neural embedding-based ad hoc retrieval. In addition, we explore several types of document representation methods and investigate their impact on the retrieval process. We conduct our experiments on three widely used standard corpora, namely Clueweb09B, Clueweb12B and Robust04 and their associated topics. Summarily, we find that document representation methods are able to effectively address the query-document size imbalance problem and significantly improve the performance of neural ad hoc retrieval. In addition, we find that a document representation method based on a simple term-frequency shows significantly better performance compared to more sophisticated representation methods such as neural composition and aspect-based methods.
The standard bag-of-words vector space model (VSM) is efficient, and ubiquitous in information retrieval, but it underestimates the similarity of documents with the same meaning, but different terminology. To overcome this limitation, Sidorov et al. proposed the Soft Cosine Measure (SCM) that incorporates term similarity relations. Charlet and Damnati showed that the SCM is highly effective in question answering (QA) systems. However, the orthonormalization algorithm proposed by Sidorov et al. has an impractical time complexity of O(n^4), where n is the size of the vocabulary. In this paper, we prove a tighter lower worst-case time complexity bound of O(n^3). We also present an algorithm for computing the similarity between documents and we show that its worst-case time complexity is O(1) given realistic conditions. Lastly, we describe implementation in general-purpose vector databases such as Annoy, and Faiss and in the inverted indices of text search engines such as Apache Lucene, and ElasticSearch. Our results enable the deployment of the SCM in real-world information retrieval systems.
Learning network representations is essential for many downstream tasks such as node classification, link prediction, and recommendation. Many algorithms derived from SGNS (skip-gram with negative sampling) have been proposed, such as LINE, DeepWalk, and node2vec. In this paper, we show that these algorithms suffer from norm convergence problem, and propose to use L2 regularization to rectify the problem. The proposed method improves the embeddings consistently. This is verified on seven different datasets with various sizes and structures. The best improvement is 46.41% for the task of node classification.
Detecting controversy in general web pages is a daunting task, but increasingly essential to efficiently moderate discussions and effectively filter problematic content. Unfortunately, controversies occur across many topics and domains, with great changes over time. This paper investigates neural classifiers as a more robust methodology for controversy detection in general web pages. Current models have often cast controversy detection on general web pages as Wikipedia linking, or exact lexical matching tasks. The diverse and changing nature of controversies suggest that semantic approaches are better able to detect controversy. We train neural networks that can capture semantic information from texts using weak signal data. By leveraging the semantic properties of word embeddings we robustly improve on existing controversy detection methods. To evaluate model stability over time and to unseen topics, we asses model performance under varying training conditions to test cross-temporal, cross-topic, cross-domain performance and annotator congruence. In doing so, we demonstrate that weak-signal based neural approaches are closer to human estimates of controversy and are more robust to the inherent variability of controversies.
In this paper, we improve the low-rank matrix completion algorithm by assuming that the data points lie in a union of low dimensional subspaces. We applied the self-expressiveness, which is a property of a dataset when the data points lie in a union of low dimensional subspaces, to the low-rank matrix completion. By considering self-expressiveness of low dimensional subspaces, the proposed low-rank matrix completion may perform well even with little information, leading to the robust completion on a dataset with high missing rate. In our experiments on movie rating datasets, the proposed model outperforms state-of-the-art matrix completion models. In clustering experiments conducted on MNIST dataset, the result indicates that our method closely recovers the subspaces of original dataset even with the high missing rate.
In this paper, we propose to incorporate information of related corporations of a target company for its stock price prediction. We first construct a graph including all involved corporations based on investment facts from real market and learn a distributed representation for each corporation via node embedding methods applied on the graph. Two approaches are then explored to utilize information of related corporations based on a pipeline model and a joint model via graph convolutional neural networks respectively. Experiments on the data collected from stock market in Mainland China show that the representation learned from our model is able to capture relationships between corporations, and prediction models incorporating related corporations' information are able to make more accurate predictions on stock market.
We address the problem of constructing a knowledge base of entity-oriented search intents. Search intents are defined on the level of entity types, each comprising of a high-level intent category (property, website, service, or other), along with a cluster of query terms used to express that intent. These machine-readable statements can be leveraged in various applications, e.g., for generating entity cards or query recommendations. By structuring service-oriented search intents, we take one step towards making entities actionable. The main contribution of this paper is a pipeline of components we develop to construct a knowledge base of entity intents. We evaluate performance both component-wise and end-to-end, and demonstrate that our approach is able to generate high-quality data.
With the increasing of multi-modal data on the internet, cross-modal retrieval has received a lot of attention in recent years. It aims to use one type of data as query and retrieve results of another type. For different modality data, how to reduce their heterogeneous property and preserve their local relationship are two main challenges. In this paper, we present a novel joint dictionary learning and semantic constrained latent subspace learning method for cross-modal retrieval~(JDSLC) to deal with above two issues. In this unified framework, samples from different modalities are encoded by their corresponding dictionaries to reduce the semantic gap. In the meantime, we learn modality-specific projection matrices to map the sparse coefficients into the shared latent subspace. Meanwhile, we impose a novel cross-modal similarity constraint to make the representations of samples that belong to same class but from different modalities as close as possible in the latent subspace. An efficient algorithm is proposed to jointly optimize the proposed model and learn the optimal dictionary, coefficients and projection matrix for each modality. Extensive experimental results on multiple benchmark datasets show that our method outperforms the state-of-the-art approaches.
In social networks, dense relationships among users contribute to stable networks. Breakdowns of some relationships may cause users to leave the network hence decrease the network stability. A popular metric to measure the stability of a network is k-core, i.e., the maximal subgraph of a social network in which each node has at least k neighbors. In this paper, we propose a novel problem, called k-core minimization. Given a graph G, an integer k and a budget b, we aim to identify a set B of edges with size b, so that we can get the minimum k-core by deleting B from G. We first formally define the problem and prove its NP-hardness. Then a baseline greedy algorithm is proposed. To handle large graphs, an optimized algorithm, named KC-Edge, is developed by adopting novel pruning rules. Finally, comprehensive experiments on 6 real social networks are conducted to demonstrate the efficiency and effectiveness of our proposed methods.
Label Propagation (LP) is a popular transductive learning method for very large datasets, in part due to its simplicity and ability to parallelize. However, it has limited ability to handle node features, and its accuracy can be sensitive to the number of iterations. We propose an algorithm called LPNN that solves these problems by a loose-coupling of LP with a feature-based classifier. We experimentally establish the effectiveness of LPNN.
Fine-grained geolocation of tweets has become an important feature for reliably performing a wide range of tasks such as real-time event detection, topic detection or disaster and emergency analysis. Recent work adopted a ranking approach to return a predicted location based on content-based similarity to already available individual geotagged tweets. However, this work made use of the IDF weighting model to compute the ranking, which can diminish the quality of the Top-N retrieved tweets. In this work, we adopt a learning to rank approach towards improving the effectiveness of the ranking and increasing the accuracy of fine-grained geolocalisation. To this end we propose a set of features extracted from pairs of geotagged tweets generated within the same fine-grained geographical area (squared areas of size 1 km). Using geotagged tweets from two cities (Chicago and New York, USA), our experimental results show that our learning to rank approach significantly outperforms previous work based on IDF ranking, and improves accuracy of tweet geolocalisation at a fine-grained level.
Modeling spillover effects from observational data is an important problem in economics, business, and other fields of research. It helps us infer the causality between two seemingly unrelated set of events. For example, if consumer spending in the United States declines, it has spillover effects on economies that depend on the U.S. as their largest export market. In this paper, we aim to infer the causation that results in spillover effects between pairs of entities (or units); we call this effect as paired spillover. To achieve this, we leverage the recent developments in variational inference and deep learning techniques to propose a generative model called Linked Causal Variational Autoencoder (LCVA). Similar to variational autoencoders (VAE), LCVA incorporates an encoder neural network to learn the latent attributes and a decoder network to reconstruct the inputs. However, unlike VAE, LCVA treats the latent attributes as confounders that are assumed to affect both the treatment and the outcome of units. Specifically, given a pair of units u and $\baru $, their individual treatment and outcomes, the encoder network of LCVA samples the confounders by conditioning on the observed covariates of u, the treatments of both u and $\baru $ and the outcome of u. Once inferred, the latent attributes (or confounders) of u captures the spillover effect of $\baru $ on u. Using a network of users from job training dataset (LaLonde (1986)) and co-purchase dataset from Amazon e-commerce domain, we show that LCVA is significantly more robust than existing methods in capturing spillover effects.
Since heterogeneous information network (HIN) is able to integrate complex information and contain rich semantics, there is a surge of HIN based recommendation in recent years. Although existing methods have achieved performance improvement to some extent, they still face the following problems: how to extensively exploit and comprehensively explore the local and global information in HIN for recommendation. To address these issues, we propose a unified model LGRec to fuse local and global information for top-N recommendation in HIN. We firstly model most informative local neighbor information for users and items respectively with a co-attention mechanism. In addition, our model learns effective relation representations between users and items to capture rich information in HIN by optimizing a multi-label classification problem. Finally, we combine the two parts into an unified model for top-N recommendation. Extensive experiments on four real-world datasets demonstrate the effectiveness of the proposed model.
Failure event prediction is becoming increasingly important in wide applications, such as the planning of proactive maintenance, the active investment management, and disease surveillance. To address the issue, the hazard function in survival analysis has been employed to describe the pattern of failures. Different from traditional survival analysis, this paper discovers how to apply recurrent neural network (RNN) to the long-term hazard function prediction. The proposed Long-Term RNN (LT-RNN) is able to leverage the precedent information shared by other entities, leading to more reliable long-term predictions. Specifically, our method allows a black-box treatment for modelling the hazard function which is often a pre-defined parametric form in typical survival analysis. The key idea of our approach is to model the hazard function as a nonparameteric function of the history. The same precedent information from other entities is embedded to a stitched vector for LT-RNN to automatically learn a representation of the long-term hazard function. We apply our model to the proactive maintenance problem using a large dataset from a water utility in Australia.
Combining multiple retrieval functions can lead to notable gains in retrieval performance. Learning to Rank (LETOR) techniques achieve outstanding retrieval results, by learning models with no bounds on model complexity. Often, minor retrieval gains are attained at a significant cost in model complexity. This paper focuses on the research question:can less complex models achieve results comparable to LETOR models? In this paper, we investigate an approach for the selection and fusion of rank lists with low-complexity models. The described Learning to Fuse (L2F) algorithm, is a supervised rank fusion procedure that controls the model complexity by discarding rank lists that bring minor improvements to final rank. Evaluation results, on two different datasets, show that it is indeed possible to achieve a retrieval performance comparable to LETOR methods, using only 3-5% of the rank lists of the number of rank lists used by LETOR methods.
Today electronic communications have become the prime medium for people to express their opinions and influence the policy preferences. One such popular channel reflecting the voice of the masses is electronic petitions. To understand people's perspective on various issues it is important to know what petitions say. However, due to the sheer volume of the petitions it is difficult to process each petition manually. As each petition talks about a different issue, prioritizing on over other is difficult. To alleviate these challenges, we present an end to end system for generating comprehensive and concise summaries from e-petitions. A petition contains multiple aspects, the core problem, evidence(s) in support of the problem and potential solutions. Therefore, it is imperative that an useful summary should contain information about all these three aspects explicitly. To achieve this, our system generates three aspect based summaries for each petition for better understanding. We also introduce a new annotated petition dataset, developed through crowd-sourcing, that served as gold standard. Our model is tested through quantitative and qualitative evaluations.
In this paper, we proposed a framework to evaluate information retrieval systems in presence of multidimensional relevance. This is an important problem in tasks such as consumer health search, where the understandability and trustworthiness of information greatly influence people's decisions based on the search engine results, but common topicality-only evaluation measures ignore these aspects. We used synthetic and real data to compare our proposed framework, named MM, to the understandability-biased information evaluation (UBIRE), an existing framework used in the context of consumer health search. We showed how the proposed approach diverges from the UBIRE framework, and how MM can be used to better understand the trade-offs between topical relevance and the other relevance dimensions.
Although marketing researchers and sociologists have recognized the importance of buying decision process and its significant influence on consumer's purchasing behaviors, existing recommender systems do not explicitly model the consumer buying decision process or capture the sequential regularities of what happens before and after each purchase. In this paper, we try to bridge the gap and improve recommendation systems by explicitly modeling consumer buying decision process and corresponding stages. In particular, we propose a multi-task learning model with long short-term memory networks (LSTM) to learn consumer buying decision process. It maps items, users, product categories, and the behavior sequences into real valued vectors, with which the probability of purchasing a product can be estimated. In this way, the model can capture user intentions and preferences, predicts the conversion rate of each candidate product, and makes recommendations accordingly. Experiments on real world data demonstrate the effectiveness of the proposed approach.
Hypergraph is a data structure commonly used to represent connections and relations between multiple objects. Embedding a hypergraph into a low-dimensional space and representing each vertex as a vector is useful in various tasks such as visualization, classification, and link prediction. However, most hypergraph embedding or learning algorithms reduce multi-way relations to pairwise ones, which turn hypergraphs into graphs and lose a lot of information. Inspired by Laplacian tensors of uniform hypergraphs, we propose in this paper a novel method that incorporates multi-way relations into an optimization problem. We design an objective that is applicable to both uniform and non-uniform hypergraphs with the constraint of having non-negative embedding vectors. For scalability, we apply negative sampling and use constrained stochastic gradient descent to solve the optimization problem. We test our method in a context-aware recommendation task on a real-world dataset. Experimental results show that our method outperforms a few well-known graph and hypergraph embedding methods.
In the typical state of an ever growing mailbox, it becomes essential to assist the user to better organize and quickly look up the content of his electronic life. Our work addresses this challenge, by identifying related messages within a user's mailbox. We study the notion of semantic relatedness between email messages and aim to offer the user with a wider context of the message he selects or reads. The context is represented by a small set of messages that are semantically related to the given message. We conduct experiments on a large-scale mail dataset obtained from a major Web mail service and demonstrate the effectiveness of our model in this task.
Completing knowledge bases (KBs) with missing facts is of great importance, since most existing KBs are far from complete. To this end, many knowledge base completion (KBC) methods have been proposed. However, most existing methods embed each relation into a vector separately, while ignoring the correlations among different relations. Actually, in large-scale KBs, there always exist some relations that are semantically related, and we believe this can help to facilitate the knowledge sharing when learning the embedding of related relations simultaneously. Along this line, we propose a novel KBC model by Multi -Task E mbedding, named MultiE. In this model, semantically related relations are first clustered into the same group, and then learning the embedding of each relation can leverage the knowledge among different relations. Moreover, we propose a three-layer network to predict the missing values of incomplete knowledge triples. Finally, experiments on three popular benchmarks FB15k, FB15k-237 and WN18 are conducted to demonstrate the effectiveness of MultiE against some state-of-the-art baseline competitors.
Sentiment analysis and opinion mining are significant and valuable for subject information extraction from the text. Word embedding that can map the words to low-dimensional vector representations has been widely used in natural language processing tasks. But the word embedding based on context such as Word2Vec and GloVe is lack of capturing the sentiment information. Most of existing sentiment analysis methods incorporate sentiment polarity (positive and negative) to improve the sentiment embedding for sentiment tasks. Instead of making a new word embedding model, we introduce the multi-emotion category (MEC) model to improve the pre-trained word vectors which aims to move target word vectors closer to the words from both similar semantics and similar emotions. The MEC model can give eight-dimensional vector for one word in emotion space that can capture more sentiment information than the binary polarity labels. In addition, the obvious advantage of the MEC model is that it can be fit for any pre-trained word embedding. The experimental results on several Chinese and English data sets show that this new model can improve the conventional word embedding and some existing sentiment embedding for sentiment classification.
Multi-view clustering has received an increasing attention in many applications, where different views of objects can provide complementary information to each other. Existing approaches on multi-view clustering mainly focus on extending Non-negative Matrix Factorization (NMF) by enforcing the constraint over the coefficient matrices from different views in order to preserve their consensus. In this paper, we argue that it is more reasonable to utilize the high-level manifold consensus rather than the low-level coefficient matrix consensus to better capture the underlying clustering structure of the data. Moreover, it is also effective to utilize the sparse coding framework, instead of the NMF framework, to deal with the sparsity issue. To this end, we propose a novel approach, named Multiple Manifold Regularized Sparse Coding (MMRSC). Experimental results on two publicly available real-world image datasets demonstrate that our proposed approach can significantly outperform the state-of-the-art approaches for the multi-view image clustering task.
As users implicitly express their preferences to items on many real-world applications, the implicit feedback based collaborative filtering has attracted much attention in recent years. Pairwise methods have shown state-of-the-art solutions for dealing with the implicit feedback, with the assumption that users prefer the observed items to the unobserved items. However, for each user, the huge unobserved items are not equal to represent her preference. In this paper, we propose a Multiple Pairwise Ranking (MPR) approach, which relaxes the simple pairwise preference assumption in previous works by further tapping the connections among items with multiple pairwise ranking criteria. Specifically, we exploit the preference difference among multiple pairs of items by dividing the unobserved items into different parts. Empirical studies show that our algorithms outperform the state-of-the-art methods on real-world datasets.
Hashing techniques for approximate nearest neighbor search (ANNS) encode data points into a set of short binary codes, while trying to preserve the neighborhood structure of the original data as much as possible. With the binary codes, the task of ANNS can be easily conducted over large-scale dataset, due to the high efficiency of pairwise comparison with the Hamming distance. Although binary codes have low computation and storage cost, the data are heavily compressed so that partial neighborhood structure information would be inevitably lost. To address this issue, we propose to introduce the k-nearest neighbors (k-NNs) in the original space into the Hamming space (i.e., associating a binary code with its original k-NNs) to enhance the effectiveness of existing hashing techniques with little overhead. Based on this idea, we develop a novel search scheme for hashing techniques namely neighborhood voting, i.e., each point retrieved by a query code will vote for its neighbors and itself, and the more voted, the better candidates. In this way, search in hashing is not simply the collision between codes (i.e., query code and candidate code), but also the collision between neighbors (i.e., neighbors of candidate points). The underlying assumption is that the true neighbors of a query point should be close to each other, while points with similar binary codes but seldom be the neighbors of other candidate points would be false positives. We introduce a novel data structure called aggregated hash table for implementing our idea and accelerating the online search process. Experimental results show that our search scheme can significantly improve the search effectiveness while having good efficiency over different existing hashing techniques.
Most existing author disambiguation work relies heavily on feature engineering or cannot use multiple paper relationships. In this work, we propose a network-embedding based method for author disambiguation. For each ambiguous name, we construct networks among papers sharing an ambiguous name, and connect papers with multiple relationships (e.g., co-authoring a paper). We focus on maximizing the gap between positive paper edges and negative edges, and propose a graph coarsening technique to learn global information. Further, we design a clustering algorithm which partitions paper representations into disjoint sets such that each set contains all papers of a unique author. Through extensive experiments, we show that our method is significantly better than the state-of-the-art author disambiguation and network-embedding methods.
One category of neural information retrieval models tries to learn text representation in a common embedding space for both queries and documents. However, a single embedding space is not always sufficient, since queries and documents are different in terms of length, number of topics covered, etc. We argue that queries and documents should be mapped into different but overlapping embedding spaces, which is named Partially Shared Embedding Space (PSES) model in this paper. PSES consists of two embedding spaces respectively for queries and documents, and a shared embedding space capturing common features of two sources. Those three embeddings are learned by jointly obeying three constraints: a feature separation constraint, a pairwise matching constraint, and a reconstruction constraint. Experiments on standard TREC collections indicate that PSES leads to significant better performance of retrieval over traditional IR models and several neural IR models with only one embedding space.
In machine reading comprehension (MRC) tasks, sentence inference is an important but extremely difficult problem. Most of MRC models directly interact articles with questions from the word level, which ignores inter and intra information of sentences and cannot well focus on problems about sentence reasoning and inference, especially when the answer clues are far apart in the article. In this paper, we propose an option gate approach for reading comprehension. We consider applying a sentence-level option gate module to make the model incorporate sentence information. In our approach we (1) extract key sentences in the article to filter out noise unrelated to the question and the options, (2) encode each sentence in articles, questions and options with dot-product self-attention to obtain intra sentence representations, (3) model inter relationships between the article and the question with bilinear attention and (4) apply an option gate with sentence inference information to each option representation with the question-aware article representation. This module can help better reasoning instead of directly word matching or paraphrasing. And this module can easily supply sentence information for most of the existing reading comprehension models. Experimental results on the RACE dataset show that this easy and simple module helps outperform the baseline models by 2.5% at most (single model), and achieve state-of-the-art results on the RACE-H dataset.
Clustering is a central task in unsupervised learning. Recent advances that perform clustering into learned deep features (such as DEC, IDEC  or VaDe ) have shown improvements over classical algorithms, but most of them are based on the Euclidean distance. Moreover, symmetry-based distances have shown to be a powerful tool to distinguish symmetric shapes -- such as circles, ellipses, squares, etc. This paper presents an adaptation of symmetry-based distances into deep clustering algorithms, named SymDEC. Our results show that the proposed strategy outperforms significantly the existing Euclidean-based deep clustering as well as recent symmetry-based algorithms in several of the synthetic symmetric and UCI studied datasets.
Researchers have shown that it is possible to identify reported instances of personal life events from users' social content, e.g., tweets. This is known as personal life event detection. In this paper, we take a step forward and explore the possibility of predicting users' next personal life event based solely on the their historically reported personal life events, a task which we refer to as personal life event prediction. We present a framework for modeling streaming social content for the purpose of personal life event prediction and describe how various instantiations of the framework can be developed to build a life event prediction model. In our extensive experiments, we find that (i) historical personal life events of a user have strong predictive power for determining the user's future life event; (ii) the consideration of sequence in historically reported personal life events shows inferior performance compared to models that do not consider sequence, and (iii) the number of historical life events and the length of the past time intervals that are taken into account for making life event predictions can impact prediction performance whereby more recent life events show more relevance for the prediction of future life events.
With the development of dialog techniques, conversational search has attracted more and more attention as it enables users to interact with the search engine in a natural and efficient manner. However, comparing with the natural language understanding in traditional task-oriented dialog which focuses on slot filling and tracking, the query understanding in E-commerce conversational search is quite different and more challenging due to more diverse user expressions and complex intentions. In this work, we define the real-world problem of query tracking in E-commerce conversational search, in which the goal is to update the internal query after each round of interaction. We also propose a self attention based neural network to handle the task in a machine comprehension perspective. Further more we build a novel E-commerce query tracking dataset from an operational E-commerce Search Engine, and experimental results on this dataset suggest that our proposed model outperforms several baseline methods by a substantial gain for Exact Match accuracy and F1 score, showing the potential of machine comprehension like model for this task.
Understanding searchers' queries is an essential component of semantic search systems. In many cases, search queries involve specific attributes of an entity in a knowledge base (KB), which can be further used to find query answers. In this study, we aim to move forward the understanding of queries by identifying their related entity attributes from a knowledge base. To this end, we introduce the task of entity attribute identification and propose two methods to address it: (i) a model based on Markov Random Field, and (ii) a learning to rank model. We develop a human annotated test collection and show that our proposed methods can bring significant improvements over the baseline methods.
Brain-Computer Interface (BCI) enables human to communicate with and intuitively control an external device through brain signals. Movement intention recognition paves the path for developing BCI applications. The current state-of-the-art in EEG based BCI usually involves subject-specific adaptation before ready to use. However, the subject-independent scenario, in which a well-trained model is directly applied to new subjects without any pre-calibration, is particularly desired yet rarely explored. In order to fill the gap, we present a Convolutional Attention Model (CAM) for EEG-based human movement intention recognition in the subject-independent scenario. The convolutional network is designed to capture the spatio-temporal features of EEG signals, while the integrated attention mechanism is utilized to focus on the most discriminative information of EEG signals during the period of movement imagination while omitting other less relative parts. Experiments conducted on a real-world EEG dataset containing 55 subjects show that our model is capable of mining the underlying invariant EEG patterns across different subjects and generalizing to unseen subjects. Our model achieves better performance than a series of state-of-the-art and baseline approaches.
Social recommendation, which utilizes social relations to enhance recommender systems, has been gaining increasing attention recently with the rapid development of online social network. Existing social recommendation methods are based on the fact that users preference or decision is influenced by their social friends' behaviors. However, they assume that the influences of social relation are always the same, which violates the fact that users are likely to share preference on diverse products with different friends. In this paper, we present a novel CSR (short for C haracterized S ocial R egularization) model by designing a universal regularization term for modeling variable social influence. Our proposed model can be applied to both explicit and implicit iteration. Extensive experiments on a real-world dataset demonstrate that CSR significantly outperforms state-of-the-art social recommendation methods.
Most recommender algorithms are designed to suggest relevant items, but suggesting these items does not always result in user satisfaction. Therefore, the efforts in recommender systems recently shifted towards serendipity, but generating serendipitous recommendations is difficult due to the lack of training data. To the best of our knowledge, there are many large datasets containing relevance scores (relevance oriented) and only one publicly available dataset containing a relatively small number of serendipity scores (serendipity oriented). This limits the learning capabilities of serendipity oriented algorithms. Therefore, in the absence of any known deep learning algorithms for recommending serendipitous items and the lack of large serendipity oriented datasets, we introduce SerRec our novel transfer learning method to recommend serendipitous items. SerRec uses transfer learning to firstly train a deep neural network for relevance scores using a large dataset and then tunes it for serendipity scores using a smaller dataset. Our method shows benefits of transfer learning for recommending serendipitous items as well as performance gains over the state-of-the-art serendipity oriented algorithms
Quantification is a supervised learning task that consists in predicting, given a set of classes C and a set D of unlabelled items, the prevalence (or relative frequency) p_c(D) of each class c\in\mathcalC in D. Quantification can in principle be solved by classifying all the unlabelled items and counting how many of them have been attributed to each class. However, this "classify and count" approach has been shown to yield suboptimal quantification accuracy; this has established quantification as a task of its own, and given rise to a number of methods specifically devised for it. We propose a recurrent neural network architecture for quantification (that we call QuaNet) that observes the classification predictions to learn higher-order "quantification embeddings", which are then refined by incorporating quantification predictions of simple classify-and-count-like methods. We test QuaNet on sentiment quantification on text, showing that it substantially outperforms several state-of-the-art baselines.
Incompleteness of large knowledge graphs (KG) has motivated many researchers to propose methods to automatically find missing edges in KGs. A promising approach for KG completion (link prediction) is embedding a KG into a continuous vector space. There are different methods in the literature that learn a continuous representation of KG (latent features of KG). The benchmark dataset FB15k has been widely employed to evaluate these methods. However, It has been noted that FB15k contains many pairs of edges in which a pair represents the same relationship in reverse directions. Therefore, the inverse of numerous test triples occurs in the training set. To address this problem, FB15k-237, a subset of FB15k, was created by removing those inverse-duplicate relations to form a more challenging, realistic dataset. There is not any study that investigates how the aforementioned bias in this widely used benchmark dataset affects the results of embedding-based knowledge graph completion methods and whether their promising results are largely due to the bias. Motivated by this question, we conducted extensive experiments and report the link prediction results on FB15K and FB15k-237 using several embedding-based methods. We compare the results of different methods to see how their performances change in absence of inverse relations. Our experiment results demonstrate that the performance of embedding models in link prediction task diminishes tremendously when the inverse relationships do not exist anymore.
Science journalism is the art of conveying a detailed scientific research paper in a form that non-scientists can understand and appreciate while ensuring that its underlying information is conveyed accurately. It plays a crucial role in making scientific content suitable for consumption by the public at large. In this work, we introduce the problem of automating some parts of the science journalism workflow by automatically generating the 'title' of a blog version of a scientific paper. We have built a corpus of $87,328$ pairs of research papers and their corresponding blogs from two science news aggregators and have used it to buildSci ence-Blogger - a pipeline-based architecture consisting of a two-stage mechanism to generate the blog titles. Evaluation using standard metrics indicate viability of the proposed system.
It is important to detect social spammers and spam messages in microblogging platforms. Existing methods usually handle the detection of social spammers and spam messages as two separate tasks using supervised learning techniques. However, labeled samples are usually scarce and manual annotation is expensive. In this paper, we propose a semi-supervised collaborative learning approach to jointly detect social spammers and spam messages in microblogging platforms. In our approach, the social spammer classifier and spam message classifier are collaboratively trained by exploiting the inherent relatedness between these tasks. In addition, unlabeled samples are incorporated into model training with the help of social contexts of users and messages. Experiments on real-world dataset show our approach can effectively improve the performance of both social spammer detection and spam message detection.
In this paper, we propose a novel sequential neural network with structure attention to model information diffusion. The proposed model explores both sequential nature of an information diffusion process and structural characteristics of user connection graph. The recurrent neural network framework is employed to model the sequential information. The attention mechanism is incorporated to capture the structural dependency among users, which is defined as the diffusion context of a user. A gating mechanism is further developed to effectively integrate the sequential and structural information. The proposed model is evaluated on the diffusion prediction task. The performances on both synthetic and real datasets demonstrate its superiority over popular baselines and state-of-the-art sequence-based models.
Adverse drug-drug interaction has been a critical issue for the development of drugs. In Traditional Chinese Medicine, adverse herb-herb interaction is a negative reaction in patients after the absorption of decoction of Incompatible Herb Pair (IHP). Recently, many methods have been proposed for IHP research, but most of them focused on revealing and analyzing the adverse reaction of some known IHPs, despite that a number of new IHPs have been discovered by accidents. Up to now, IHPs have been a serious threat to public health in the TCM medication. In this paper, we propose a novel supervised learning framework for potential IHP prediction. In this framework, we model the prediction task as a non-negative matrix tri-factorization problem, in which two important herb attributes (efficacy and flavor) and their correlation are incorporated to characterize the incompatible relationship among herbs. A hypothetical test method is adopted to evaluate the statistical significance of dissimilar characteristics of two attributes and the results are used as a regularization term to improve the accuracy of IHP prediction. Experiments on the real-world IHP dataset demonstrate that the proposed framework is very effective for prediction of potential IHPs.
Known-item search is an everyday natural scenario that we search for a specific thing (maybe a song) while only remembering some details about it. Existing benchmarks generally focus on brief user requests which specify some metadata like the title, or the time. However, in most cases, the users can hardly recall such information accurately. In order to embrace the research of known-item search, we present a new publicly available known-item speech video search benchmark, namely TED-KISS, which takes TED talks as an example. The video collection is constructed with up-to-date nearly 80,000 TED and TEDx talks on Youtube. These talks cover various topics, and their titles, speakers, descriptions, full-text subtitles, as well as original links are extracted as metadata, which makes the researches on text-based retrieval and multimedia retrieval feasible. Unlike other benchmarks concerning visual contents in segments, the user requests in TED-KISS are generated through a more natural process, partly through original related topics posted on Reddit and Baidu Tieba, and partly through manual imitative requests annotated by volunteers in a scenario simulation. In addition, we analyze the characteristics of our benchmark through evaluations of several existing text-based IR and Neural-IR models, which also can be served as baselines for this task.
Question answering over knowledge bases (KB-QA) poses challenges in handling complex questions that need to be decomposed into sub-questions. An important case, addressed here, is that of temporal questions, where cues for temporal relations need to be discovered and handled. We present TEQUILA, an enabler method for temporal QA that can run on top of any KB-QA engine. TEQUILA has four stages. It detects if a question has temporal intent. It decomposes and rewrites the question into non-temporal sub-questions and temporal constraints. Answers to sub-questions are then retrieved from the underlying KB-QA engine. Finally, TEQUILA uses constraint reasoning on temporal intervals to compute final answers to the full question. Comparisons against state-of-the-art baselines show the viability of our method.
In an effort to support users' decision making process in regards to shared and co-managed online images, in this paper we present a novel model to early detect images which may be subject to possible conflicting access control decisions. We present a group-based stochastic model able to identify potential privacy conflicts among multiple stakeholders of an image. We discuss experiments on a dataset of over 3000 online images, and compare our results with several baselines. Our approach outperforms all baselines, even the strong ones based on a Convolutional Neural Network architecture.
Machine Learning models learn the relationship between input and output by examples and then apply the learned models to relate unseen input. Although ML has successfully been used in almost every field, there is always room for improvement. To this end, researchers have recently been trying to implement Quantum Mechanics(QM) in ML, since it is believed that quantum-inspired ML can enhance learning rate and effectiveness. In this paper, we address a specific task of ML and present a binary classification model inspired by the quantum detection framework. We compared the model to the state of the art. Our experimental results suggest that the use of the quantum detection framework in binary classification can improve effectiveness for a number of topics of the RCV-1 test collection and that it may still provide ways to improve effectiveness for the other topics.
Networked prediction has attracted lots of research attention in recent years. Compared with the traditional learning setting, networked prediction is even harder to understand due to its coupled, \em multi-level nature. The learning process propagates top-down through the underlying network from the macro level (the entire learning system), to meso level (learning tasks), and to micro level (individual learning examples). In the meanwhile, the networked prediction setting also offers rich context to explain the learning process through the lens of \em multi-aspect, including training examples ( e.g., what are the most influential examples ), the learning tasks ( e.g., which tasks are most important ) and the task network ( e.g., which task connections are the keys ). Thus, we propose a multi-aspect, multi-level approach to explain networked prediction. The key idea is to efficiently quantify the influence on different levels of the learning system due to the perturbation of various aspects. The proposed method offers two distinctive advantages: (1) multi-aspect, multi-level: it is able to explain networked prediction from multiple aspects (i.e., example-task-network) at multiple levels (i.e., macro-meso-micro); (2) efficiency: it has a linear complexity by efficiently evaluating the influences of changes to the networked prediction without retraining.
There is an ever increasing number of rule learning algorithms and tools for automatic knowledge base (KB) construction. These tools often produce weighted rules and facts that make up a probabilistic KB (PKB). In such a PKB, probabilistic inference is used in order to perform marginal inference, consistency checking and other tasks. However, in general, inference is known to be intractable. Hence, recently, there are a number of studies aimed at lifting (making tractable or approximating) inference by exploiting symmetries in the structure of a PKB. These studies alleviate grounding entirely a given PKB which can generate a sizable factor graph for inference (e.g. to compute the probability of a query). In line with this, we propose a novel technique to automatically partition rules based on their structure for efficient parallel grounding. In addition, we perform query expansion so as to generate a factor graph small enough to be used for efficient probability computation. We present a novel approximate marginal inference algorithm that uses N-hop subgraph extraction and query expansion. Moreover, we show that our system is much faster than state-of-the-art systems.
Text clustering, which allows to divide a dataset into groups of similar documents, plays an important role at various stages of the information retrieval process. Co-clustering is an extension of one-side clustering, and consists in simultaneously clustering the rows and columns of a data matrix. However, while co-clustering algorithms consider both dimensions of a document-term matrix, they are usually evaluated on the quality of the obtained document clusters alone. In this paper, we therefore propose an evaluation scheme that accounts for the two-dimensional nature of co-clustering algorithms, thus allowing for a more precise evaluation of their performance. Another important benefit of the proposed approach is that it does not require the use of any prior labels. This is achieved by leveraging large, public domain embedding matrices (GloVe, word2vec, FastText) to compute comparable representations of both document and term clusters. Experiments carried out on several textual datasets show that the proposed measures are both reliable and stable, and can even provide hints to improve co-clustering performance.
User Identification with Spatio-Temporal awareness has been attracting much attention from academia. The existing methods not only handle temporal and spatial data separately but also do not consider conflictive check-in records. To tackle these problems, we propose a novel approach that consists of three parts. 1) Measure the similarity of users with a kernel density estimation based method, which handles the spatial and temporal data together; 2) Assign weights to check-in records following the idea of TFIDF, which highlights the discriminative check-in records; 3) Penalize the similarity of users based on the number of conflictive check-in records. The pair-wise users with similarity higher than a predefined threshold are considered to belong to the same individual. Experiments demonstrate the superiority of the proposed approach.
Neural word embedding approaches, due to their ability to capture semantic meanings of vocabulary terms, have recently gained attention of the information retrieval (IR) community and have shown promising results in improving ad hoc retrieval performance. It has been observed that these approaches are sensitive to various choices made during the learning of word embeddings and their usage, often leading to poor reproducibility. We study the effect of varying following two parameters, viz., i) the term normalization and ii) the choice of training collection, on ad hoc retrieval performance with word2vec and fastText embeddings. We present quantitative estimates of similarity of word vectors obtained under different settings, and use embeddings based query expansion task to understand the effects of these parameters on IR effectiveness.
Session-based recommendation performance has been significantly improved by Recurrent Neural Networks (RNN). However, existing RNN-based models do not expose the global knowledge of frequent click patterns or consider variability of sequential behaviors in sessions. In this paper, we propose a novel Variational Recurrent Model (VRM), which employs the stochastic latent variable to capture the knowledge of frequent click patterns and impose variability for the sequential behavior modeling. A stochastic generative process of session sequence is specified, where the latent variable modulates the generation of session sequences in RNN. We further extend VRM to a Conditional Variational Recurrent Model (CVRM) by considering additional information (e.g., focused category in sessions) as the generative condition. When evaluated on a public benchmark dataset, VRM and its extension clearly demonstrate their superiority over popular baselines and state-of-the-art models.
Recent advances in network representation learning have enabled significant improvements in the link prediction task, which is at the core of many downstream applications. As an increasing amount of mobility data becoming available due to the development of location technologies, we argue that this resourceful user mobility data can be used to improve link prediction performance. In this paper, we propose a novel link prediction framework that utilizes user offline check-in behavior combined with user online social relations. We model user offline location preference via probabilistic factor model and represent user social relations using neural network embedding. Furthermore, we employ locality-sensitive hashing to project the aggregated user representation into a binary matrix, which not only preserves the data structure but also speeds up the followed convolutional network learning. By comparing with several baseline methods that solely rely on social network or mobility data, we show that our unified approach significantly improves the performance.
Topic detection and tracking in document streams is a critical task in many important applications, hence has been attracting research interest in recent decades. With the large size of data streams, there have been a number of works from different approaches that propose automatic methods for the task. However, there is only a few small benchmark datasets that are publicly available for evaluating the proposed methods. The lack of large datasets with fine-grained groundtruth implicitly restrains the development of more advanced methods. In this work, we address this issue by collecting and publishing W2E - a large dataset consisting of news articles from more than 50 prominent mass media channels worldwide. The articles cover a large set of popular events within a full year. W2E is more than 15 times larger than TREC's TDT2 dataset, which is widely used in prior work. We further conduct exploratory analysis to examine the dynamics and diversity of W2E and propose potential uses of the dataset in other research.
Wireless coverage is the received signal strength of a particular region, which is a key prerequisite to provide high quality mobile communication service. In this paper, we aim to estimate the wireless coverage of an area based on the randomly distributed samples of received signal strength collected within the area. Specifically, we propose a weakly-supervised generative adversarial nets with auxiliary information (WS-GAN), which is a data-driven model to make estimation on the wireless coverage. Different from conventional methods (e.g., k-NN) that predict the missing values by simply replacing them with the average of nearby observations, WS-GAN adopts a GAN-based structure, in which the generator approximates the real distribution of observations under the supervision of the discriminator, while the discriminator tries to distinguish the ground truth from "faked" data produced by the generator. WS-GAN differs from the literature in two aspects: (1) WS-GAN is able to use the auxiliary information of local radio environment, e.g., geographical information as terrain and buildings, to improve the estimation performance, since such auxiliary information has significant impact on the variation of received signal strength. (2) The objective function of WS-GAN combines adversarial loss and a new light-weight reconstruction error term for weakly supervision, which are jointly optimized during training. The experiments on a real dataset show that WS-GAN can generally achieve more accurate results than baselines. Moreover, through a case study, we demonstrate that the wireless coverage maps generated by WS-GAN are more rational and practical than those obtained by baselines.
An effective news recommendation system should harness the historical information of the user based on her interactions as well as the content of the articles. In this paper we propose a novel deep learning model for news recommendation which utilizes the content of the news articles as well as the sequence in which the articles were read by the user. To model both of these information, which are essentially of different types, we propose a simple yet effective architecture which utilizes a 3-dimensional Convolutional Neural Network which takes the word embeddings of the articles present in the user history as its input. Using such a method endows the model with the capability to automatically learn spatial (features of a particular article) as well as temporal features (features across articles read by a user) which signify the interest of the user. At test time, we use this in combination with a 2-dimensional Convolutional Neural Network for recommending articles to users. On a real-world dataset our method outperformed strong baselines which also model the news recommendation problem using neural networks.
Recently, convolutional neural networks(CNNs) has been demonstrated to effectively model reviews in recommender systems, due to the learning of contextual features such as surrounding words and word order for reviews. However, CNNs with max-pooling fails to capture the count information of contextual features, since the feature map generated by a convolution filter can only get a max feature value with max-pooling. If the max feature value appears more than once in the feature map, CNNs will lose the count information of the contextual feature. The count information is quite critical for modeling reviews, for example, ten "a good quality" is more credible than one "a good quality" for representing item properties, and five "the scenery is" shows that a user may pay more attention to the scenery than one "the scenery is" does. Our model, called WCN-MF, extends CNNs by introducing a new module named Deep Latent Dirichlet Allocation (DLDA) to capture the count information of contextual features. DLDA is inspired by the fact that contextual features consist of words, hence, we can first capture the count information of words, and second generate contextual features with words. By combining DLDA with CNNs, we can get a word-driven and context-aware review representation. Further, we incorporate the review representation with Matrix Factorization for recommendation. Our evaluations on three real- world datasets reveal that our model can significantly outperform the state-of-the-art recommendation models.
The rise of the Internet of Things (IoT) and the recent focus on a gamut of 'Smart City' initiatives world-wide have pushed for new advances in data stream systems to (1) support complex analytics and evolving graph applications as continuous queries, and (2) deliver fast and scalable processing on large data streams. Unfortunately current continuous query languages (CQL) lack the features and constructs needed to support the more advanced applications. For example recursive queries are now part of SQL, Datalog, and other query languages, but they are not supported by most CQLs, a fact that caused a significant loss of expressive power, which is further aggravated by the limitation that only non-blocking queries can be supported. To overcome these limitations we have developed an a dvanced st ream r easo ning system ASTRO that builds on recent advances in supporting aggregates in recursive queries. In this demo, we will briefly elucidate the formal Streamlog semantics, which combined with the Pre-Mappability (PreM) concept, allows the declarative specification of many complex continuous queries, which are then efficiently executed in real-time by the portable ASTRO architecture. Using different case studies, we demonstrate (i) the ease-of-use, (ii) the expressive power and (iii) the robustness of our system, as compared to other state-of-the-art declarative CQL systems.
BBoxDB is a distributed and highly available key-bounding-box-value store which enhances the classical key-value data model with an axis-parallel bounding box. The bounding box describes the location of the values in an n-dimensional space, and enables BBoxDB to efficiently distribute multi-dimensional data across a cluster of nodes. Well-known geometric algorithms (such as the K-D Tree) are used to create distribution regions (multi-dimensional shards). Distribution regions are created dynamically, based on the stored data. BBoxDB stores data of multiple tables co-partitioned, which enables efficient distributed spatial joins. Spatial joins on co-partitioned tables can be executed without data shuffling between nodes. A two-level index structure is employed to retrieve stored data quickly. We demonstrate the interaction with the system, the dynamic creation of distribution regions and the data redistribution feature of BBoxDB.
The large amount of heterogeneous data in these email corpora renders experts' investigations by hand infeasible. Auditors or journalists, e.g., who are looking for irregular or inappropriate content or suspicious patterns, are in desperate need for computer-aided exploration tools to support their investigations.
We present our Beacon system for the exploration of such corpora at different levels of detail. A distributed processing pipeline combines text mining methods and social network analysis to augment the already semi-structured nature of emails. The user interface ties into the resulting cleaned and enriched dataset. For the interface design we identify three objectives expert users have: gain an initial overview of the data to identify leads to investigate, understand the context of the information at hand, and have meaningful filters to iteratively focus onto a subset of emails. To this end we make use of interactive visualisations based on rearranged and aggregated extracted information to reveal salient patterns.
The improvement of collision avoidance for ship navigation in encounter situation is an important topic in maritime traffic safety. Most research on maritime collision avoidance has focused on planning a safe path for a ship to keep away from the approaching ship under the requirements of the International Regulations for Preventing Collision at Sea (COLREGs). However, the specific anti-collision actions are actually carried out by the navigators' own experience according to the local encounter situation.
In this paper, different from the existing works, we discover the collision avoidance behavior from real ships' movement, i.e., AIS trajectory data. However, the uncertainty of maritime trajectory data brings the challenge of collision avoidance behavior mining. ollision behavior, which is effective in the encounter situation, and generate the discovered behavior in form of collision avoidance pattern. Furthermore, a prototype of CAPatternMiner is built for pattern analysis and visualization and also benefits a deeper understanding of collision avoidance behavior on maritime traffic. The proposed framework will be applied to the developing of pattern-aware collision avoidance system to improve the maritime traffic safety.
Explaining the results of data-intensive computation via provenance has been extensively studied in the literature. We focus here on explaining the output of Machine Learning Classifiers, which are main components of many contemporary Data Science applications. We have developed a simple generic approach for explaining classification results, by looking for constrained perturbations to parts of the input that would have the most significant effect on the classification. Our solution requires white-box access to the model internals and a specification of constraints that define which perturbations are "reasonable" in the application domain; both are typically available to the data scientist.
We propose to demonstrate CEC, a system prototype that is based on these foundations to provide generic explanations for Neural Networks and Random Forests. We will demonstrate the system usefulness in the context of two application domains: bank marketing campaigns, and visually clear explanations for image classifications. We will highlight the benefit that such explanations could yield to the data scientist and interactively engage the audience in computing and viewing explanations for different cases and different sets of constraints.
The integration of diverse structured and unstructured information sources into a unified, domain-specific knowledge base is an important task in many areas. A well-maintained knowledge base enables data analysis in complex scenarios, such as risk analysis in the financial sector or investigating large data leaks, such as the Paradise or Panama papers. Both the creation of such knowledge bases, as well as their continuous maintenance and curation involves many complex tasks and considerable manual effort.
With CurEx, we present a modular system that allows structured and unstructured data sources to be integrated into a domain-specific knowledge base. In particular, we (i) enable the incremental improvement of each individual integration component; (ii) enable the selective generation of multiple knowledge graphs from the information contained in the knowledge base; and (iii) provide two distinct user interfaces tailored to the needs of data engineers and end-users respectively. The former has curation capabilities and controls the integration process, whereas the latter focuses on the exploration of the generated knowledge graph.
In the last ten years, genomic computing has made gigantic steps due to Next Generation Sequencing (NGS), a high-throughput, massively parallel technology; the cost of producing a complete human sequence dropped to 1000 US$ in 2015 and is expected to drop below 100 US$ by 2020. Several new methods have recently become available for extracting heterogeneous datasets from the genome, revealing data signals such as variations from a reference sequence, levels of expression of coding regions, or protein binding enrichments ('peaks') with their statistical or geometric properties. Huge collections of such datasets are made available by large international consortia.
In this new context, we developed GenoMetric Query Language (GMQL), a new data extraction and integration language. GMQL supports queries over thousands of heterogeneous datasets; as such, it is key to genomic data analysis. GMQL queries are executed on the cloud, after being translated and optimized; our best deployment uses Spark over Hadoop. Datasets are described by the Genomic Data Model (GDM), which provides interoperability between many data formats; GDM combines abstractions for genomic region data with the associated experimental, biological and clinical metadata.
GMQL is targeted to the bio-informatics community for facilitating data exploration and for integrating data extraction and data analysis; this demonstration highlights its usability and expressive power. We show GMQL at work from a Web-based user interface and from a language embedding (Python).
Drug-drug interaction related adverse events (DIAE) signals are a major public health issue. Drug safety analysts must sift through thousands of adverse event reports submitted daily to U.S. Food and Drug Administration (FDA) to discover unexpected DIAE signals, which if addressed can lead to life-saving actions. To facilitate the DIAE discovery from these massive data sets, we design several technological innovations that together are integrated into an interactive visual analytics system called DEVES 1. First, our state- of-the-art DIAE mining algorithm efficiently infers potential DIAE signals from these reports, and then ranks them based on their significance score. For interpretability of these inferred DIAE signals, domain knowledge of adverse events and already known drug interactions is extracted from external authoritative data sources and then seamlessly integrated with the inferred signal set. Guided by this augmented signal model, DEVES supports advanced signal analytics - empowering the analyst to interact with linked visualizations offering complementary perspectives into the signal set and its supporting evidence in the form of reports. Our demonstration showcases the technological innovations of DEVES using real-world FDA datasets, demonstrating that DEVES effectively supports the core regulatory tasks from signal screening, signal prioritization to signal validation.
Logistics management plays a crucial role in the execution and success of international trades. Current solutions for logistics management suffer from a number of issues. One of the major areas for improvement is inconsistent data and lack of trust and transparency between multiple participants. Distributed Ledger Technology (DLT) or blockchain has the inherent characteristics to address this issue, and is well suited to be applied to supply chain document and workflow management. This paper demonstrates how to apply DLT to achieve a higher level of efficiency through consistent data store, automated workflow process, and tamper-proof transaction history for provenance in the supply chain.
Human language constantly evolves due to the changing world and the need for easier forms of expression and communication. Our knowledge of language evolution is however still fragmentary despite significant interest of both researchers as well as wider public in the evolution of language. In this paper, we present an interactive framework that permits users study the evolution of words and concepts. The system we propose offers a rich online interface allowing arbitrary queries and complex analytics over large scale historical textual data, letting users investigate changes in meaning, context and word relationships across time.
Exploring large medical image sets by means of traditional similarity query criteria (e.g., neighborhood) can be fruitless if retrieved images are too similar among themselves. This demonstration introduces Kundaha, an exploration tool that assists experts in retrieving and navigating on results from a diversified similarity perspective of user-posed queries. Its implementation includes a wide set of metrics, descriptors, and indexes for enhancing query execution. Users can combine such features with diversified similarity criteria for the organized exploration of result sets and also employ relevance feedback cycles for finding new query-based viewpoints.
Navigating large knowledge bases made of billions of triples is very challenging. In this demonstration, we showcase Fouilla, a topical Knowledge Base browser that offers a seamless navigational experience of DBpedia. We propose an original approach that leverages both structural and semantic contents of Wikipedia to enable a topic-oriented filter on DBpedia entities. We devise an approach to drastically reduce the query time and to ensure a seamless browsing experience to the end user. We demonstrate how our system offers a novel and meaningful experience of DBpedia browsing by challenging the user to search for relevant information within the Knowledge Base in different use cases.
Copernicus is the European program for monitoring the Earth. It consists of a set of complex systems that collect data from satellites and in-situ sensors, process this data and provide users with reliable and up-to-date information on a range of environmental and security issues. The data collected by Copernicus is made available freely following an open access policy. Information extracted from Copernicus data is disseminated to users through the Copernicus services which address six thematic areas: land, marine, atmosphere, climate, emergency and security. We present a demo from the Horizon 2020 Copernicus App Lab project which takes big data from the Copernicus land service, makes it available on the Web as linked geospatial data and interlinks it with other useful public data to aid the development of applications by developers that might not be Earth Observation experts. Our demo targets a scenario where we want to study the "greenness" of Paris.
The massive captured data from industrial sensors (time-series data) that could serve as relevant indicators for predictive maintenance of equipment, fault diagnosis, etc. is generating a problem related to the considerable costs associated with their storage.
In this paper we present a system called I4TSRS1, available as a Web Application, that efficiently guides a data engineer in the task of obtaining industrial time-series data reduced representations that preserve their main characteristics. Dealing with those reduced representations, data storage and transmission costs can be decreased, without limiting the future exploitation of the data in different processes. The novel contribution of the I4TSRS is that it is an intelligent system that recommends the best techniques to achieve a reduced representation of time-series captured in industrial settings. Its core element is a machine learning model that combines time-series reduction techniques with extracted features from industrial time-series. We have built the model using several heterogeneous real industrial time-series.
Influence Maximization (IM) is the problem of finding a set of influential users in a social network, so that their aggregated influence is maximized. IM has natural applications in viral marketing and has been the focus of extensive recent research. One critical problem, however, is that while existing IM algorithms serve the goal of reaching a large audience, they may obliviously focus on certain well-connected populations, at the expense of key demographics, creating an undesirable imbalance, an illustration of a broad phenomenon referred to as algorithmic discrimination. Indeed, we demonstrate an inherent trade-off between two objectives: (1) maximizing the overall influence and (2) maximizing influence over a predefined "protected" demographic, with the optimal balance between the two being open to different interpretations. To this end, we present IM-Balanced, a system enabling end users to declaratively specify the desired trade-off between these objectives w.r.t. an emphasized population. IM-Balanced provides theoretical guarantees for the proximity to the optimal solution in terms of both objectives and ensures an efficient, scalable computation via careful adaptation of existing state-of-the-art IM algorithms. Our demonstration illustrates the effectiveness of our approach through real-life viral marketing scenarios in an academic social network.
Digital mathematical libraries (DMLs) such as arXiv, Numdam, and EuDML contain mainly documents from STEM fields, where mathematical formulae are often more important than text for understanding. Conventional information retrieval (IR) systems are unable to represent formulae and they are therefore ill-suited for math information retrieval (MIR). To fill the gap, we have developed, and open-sourced the MIaS MIR system. MIaS is based on the full-text search engine Apache Lucene. On top of text retrieval, MIaS also incorporates a set of tools for preprocessing mathematical formulae. We describe the design of the system and present speed, and quality evaluation results. We show that MIaS is both efficient, and effective, as evidenced by our victory in the NTCIR-11 Math-2 task.
We present Ontop-temporal, an extension of the ontology-based data access system Ontop for query answering with temporal data and ontologies. Ontop is a system to answer SPARQL queries over various data stores, using standard R2RML mappings and an OWL2QL domain ontology to produce high-level conceptual views over the raw data. The Ontop-temporal extension is designed to handle timestamped log data, by additionally using (i) mappings supporting validity time specification, and (ii) rules based on metric temporal logic to define temporalised concepts. In this demo we present how Ontop-temporal can be used to facilitate the access to the MIMIC-III critical care unit dataset containing log data on hospital admissions, procedures, and diagnoses. We use the ICD9CM diagnoses ontology and temporal rules formalising the selection of patients for clinical trials taken from the clinicaltrials.gov database. We demonstrate how high-level queries can be answered by Ontop-temporal to identify patients eligible for the trials.
Manually constructing rankings is a tedious ad-hoc process, requiring extensive user effort to evaluate data attribute importance, and often leading to undesirable results. Meanwhile, sophisticated learning-to-rank algorithms are able to leverage large amounts of data to generate high quality rankings automatically. In this work we present RanKit, a novel technology that brings the power of automatic learning-to-rank to the public. RanKit serves as a personal recommender system for building rankings from partial user knowledge in the form of item preferences. A user-friendly rank building interface provides rich input modes for preference specification. Visual feedback on the quality of the learned ranking model is given in real time, empowering the user to guide the underlying learn-to-rank algorithm. Users are actively involved with every step of the rank generation process, developing trust in the model and producing personalized rankings suitable for real-world decision making. In this demonstration, the audience works directly with the RanKit system on public domain datasets ranging from college rankings and economic indicators to movies and sports.
The collaboration of financial institutes against fraudsters is a promising path for reducing resource investments and increasing coverage. Yet, such collaboration is held back by two somewhat conflicting challenges: effective knowledge sharing and limiting leakage of private information. While the censorship of private information is likely to reduce knowledge sharing effectiveness, the generalization of private information to a desired degree can potentially allow, on one hand, to limit the leakage, and on the other hand, to reveal some properties of the private information that can be beneficial for sharing.
In this demo we present a system that allows knowledge sharing via effective adaptation of fraud detection rules while preserving privacy. The system uses taxonomies to generalize concrete values appearing in fraud detection rules to higher level concepts which conform to some privacy/utility requirements set by the owner. Our demonstration will engage the CIKM'18 audience by showing that private information can be abstracted to enforce privacy while maintaining its usage by (partially) trusted allies.
We present an open source tool, searchrefiner, for researchers that conduct medical systematic reviews to assist in formulating, visualising, and understanding Boolean queries. The searchrefiner web interface allows researchers to explore how Boolean queries retrieve citations in existing, popular query syntaxes used in systematic review literature search. The web interface allows researchers to perform tasks such as using validation citations to ensure queries are retrieving a minimum set of known relevant citations, and editing Boolean queries by dragging and dropping clauses in a structured editor. In addition, the tools provided by the searchrefiner interface allow researchers to visualise why the queries they formulate retrieve citations, and ways to understand how to refine queries into more effective ones. searchrefiner is targeted at both experts and novices, as a tool for query formulation and refinement, and as a tool for training users to search for literature to compile systematic reviews.
The searchrefiner website located at https://ielab.io/searchrefiner contains information about how to download, install, and use the tool, as well as a link to an online hosted version for demonstration purposes.
Strong member privacy in technology enterprises involves, among other objectives, deleting or anonymizing various kinds of data that a company controls. Those requirements are complicated in a heterogeneous data ecosystem where data is stored in multiple stores with different semantics: different indexing or update capabilities require processes specific to a store or even schema. In this demo we showcase a method to enforce record controls of arbitrary data stores via a three step process: generate an offline snapshot, run a policy mechanism to select rows to delete/update, and apply the changes to the original store. The first and third steps work on any store by leveraging Apache Gobblin, an open source data integration framework. The policy computation step runs as a batch Gobblin job where each table can be customized via a dataset metadata tracking system and SQL expressions providing table-specific business logic. This setup allows enforcement of highly-customizable privacy requirements in a variety of systems from hosted databases to third party data storage systems.
Data scientists are usually interested in a subset of sources with properties that are most aligned to intended data use. The SOURCERY system supports interactive multi-criteria user-driven source selection. SOURCERY allows a user to identify criteria they consider of importance and indicate their relative importance, and seeks a source selection result aligned to the user-supplied criteria preferences. The user is given an overview of the properties of the sources that are selected along with visual analyses contextualizing the result in relation to what is theoretically possible and what is possible given the set of available sources. The system also enables a user to interactively perform iterative fine-tuning to explore how changes to preferences may impact results.
A growing number of domains (finance, seismology, internet-of-things, etc.) collect massive time series. When the number of series grow to the hundreds of millions or even billions, similarity queries become intractable on a single machine. Further, naive (quadratic) parallelization won't work well. So, we need both efficient indexing and parallelization. We propose a demonstration of Spark-parSketch, a complete solution based on sketches / random projections to efficiently perform both the parallel indexing of large sets of time series and a similarity search on them. Because our method is approximate, we explore the tradeoff between time and precision. A video showing the dynamics of the demonstration can be found by the link http://parsketch.gforge.inria.fr/video/parSketchdemo_720p.mov.
As road transportation supports both economic and social activities in developed cities, it is important to maintain smooth traffic on all highways and local roads. Whenever possible, traffic congestions should be detected early and resolved quickly. While existing traffic monitoring dashboard systems have been put in place in many cities, these systems require high-cost vehicle speed monitoring instruments and detect traffic congestion as independent events. There is a lack of low-cost dashboards to inspect and analyze the lifecycle of traffic congestion which is critical in assessing the overall impact of congestion, determining the possible the source(s) of congestion and its evolution. In the absence of publicly available sophisticated road sensor data which measures on-road vehicle speed, we make use of publicly available vehicle trajectory data to detect the lifecycle of traffic congestion, also known as congestion cascade. We have developed Traffic-Cascade, a dashboard system to identify traffic congestion events, compile them into congestion cascades, and visualize them on a web dashboard. Traffic-Cascade unveils spatio-temporal insights of the congestion cascades.
In this paper we present a web-based prototype for an explainable ranking algorithm in multi-layered networks, incorporating both network topology and knowledge information. While traditional ranking algorithms such as PageRank and HITS are important tools for exploring the underlying structure of networks, they have two fundamental limitations in their efforts to generate high accuracy rankings. First, they are primarily focused on network topology, leaving out additional sources of information (e.g. attributes, knowledge). Secondly, most algorithms do not provide explanations to the end-users on why the algorithm gives the specific ranking results, hindering the usability of the ranking information. We developed Xrank, an explainable ranking tool, to address these drawbacks. Empirical results indicate that our explainable ranking method not only improves ranking accuracy, but facilitates user understanding of the ranking by exploring the top influential elements in multi-layered networks. The web-based prototype (Xrank: http://www.x-rank.net) is currently online - we believe it will assist both researchers and practitioners looking to explore and exploit multi-layered network data.
Personal information is typically fragmented across multiple, heterogeneous, distributed sources and saved as small, heterogeneous data objects, or traces. The DigitalSelf project at Rutgers University focuses on developing tools and techniques to manage (organize, search, summarize, make inferences on and personalize) such heterogeneous collections of personal digital traces. We propose to demonstrate YourDigitalSelf, a mobile phone-based personal information organization application developed as part of the DigitalSelf project. The demonstration will use a sample user data set to show how several disparate data traces can be integrated and combined to create personal narratives, or coherent episodes, of the user's activities. Conference attendees will be given the option to install YourDigitalSelf on their own devices to interact with their own data.
Helpdesk is a key component of any large IT organization, where users can log a ticket about any issue they face related to IT infrastructure, administrative services, human resource services, etc. Normally, users have to assign appropriate set of labels to a ticket so that it could be routed to right domain expert who can help resolve the issue. In practice, the number of labels are very large and organized in form of a tree. It is non-trivial to describe the issue completely and attach appropriate labels unless one knows the cause of the problem and the related labels. Sometimes domain experts discuss the issue with the users and change the ticket labels accordingly, without modifying the ticket description. This results in inconsistent and badly labeled data, making it hard for supervised algorithms to learn from. In this paper, we propose a novel approach of creating a conversational helpdesk system, which will ask relevant questions to the user, for identification of the right category and will then raise a ticket on users' behalf. We use attention based seq2seq model to assign the hierarchical categories to tickets. We use a slot filling model to help us decide what questions to ask to the user, if the top-k model predictions are not consistent. We also present a novel approach to generate training data for the slot filling model automatically based on attention in the hierarchical classification model. We demonstrate via a simulated user that the proposed approach can give us a significant gain in accuracy on ticket-data without asking too many questions to users. Finally, we also show that our seq2seq model is as versatile as other approaches on publicly available datasets, as state of the art approaches.
Community detection in complex networks is a fundamental problem that attracts much attention across various disciplines. Previous studies have been mostly focusing on external connections between nodes (i.e., topology structure) in the network whereas largely ignoring internal intricacies (i.e., local behavior) of each node. A pair of nodes without any interaction can still share similar internal behaviors. For example, in an enterprise information network, compromised computers controlled by the same intruder often demonstrate similar abnormal behaviors even if they do not connect with each other. In this paper, we study the problem of community detection in enterprise information networks, where large-scale internal events and external events coexist on each host. The discovered host communities, capturing behavioral affinity, can benefit many comparative analysis tasks such as host anomaly assessment. In particular, we propose a novel community detection framework to identify behavior-based host communities in enterprise information networks, purely based on large-scale heterogeneous event data. We continue proposing an efficient method for assessing host's anomaly level by leveraging the detected host communities. Experimental results on enterprise networks demonstrate the effectiveness of our model.
Given a large number of low-quality heterogeneous categorical alerts collected from an anomaly detection system, how to characterize the complex relationships between different alerts and deliver trustworthy rankings to end users? While existing techniques focus on either mining alert patterns or filtering out false positive alerts, it can be more advantageous to consider the two perspectives simultaneously in order to improve detection accuracy and better understand abnormal system behaviors. In this paper, we propose CAR, a collaborative alert ranking framework that exploits both temporal and content correlations from heterogeneous categorical alerts. CAR first builds a hierarchical Bayesian model to capture both short-term and long-term dependencies in each alert sequence. Then, an entity embedding-based model is proposed to learn the content correlations between alerts via their heterogeneous categorical attributes. Finally, by incorporating both temporal and content dependencies into a unified optimization framework, CAR ranks both alerts and their corresponding alert patterns. Our experiments-using both synthetic and real-world enterprise security alert data-show that CAR can accurately identify true positive alerts and successfully reconstruct the attack scenarios at the same time.
Job recommendation is an important task for the modern recruitment industry. An excellent job recommender system not only enables to recommend a higher paying job which is maximally aligned with the skill-set of the current job, but also suggests to acquire few additional skills which are required to assume the new position. In this work, we created three types of information net- works from the historical job data: (i) job transition network, (ii) job-skill network, and (iii) skill co-occurrence network. We provide a representation learning model which can utilize the information from all three networks to jointly learn the representation of the jobs and skills in the shared k-dimensional latent space. In our experiments, we show that by jointly learning the representation for the jobs and skills, our model provides better recommendation for both jobs and skills. Additionally, we also show some case studies which validate our claims.
Matching buyers with most suitable sellers providing relevant items (e.g., products) is essential for e-commerce platforms to guarantee customer experience. This matching process is usually achieved through modeling inter-group (buyer-seller) proximity by e-commerce ranking systems. However, current ranking systems often match buyers with sellers of various qualities, and the mismatch is detrimental to not only buyers' level of satisfaction but also the platforms' return on investment (ROI). In this paper, we address this problem by incorporating intra-group structural information (e.g., buyer-buyer proximity implied by buyer attributes) into the ranking systems. Specifically, we propose De ep Gr aph E mbe dding (DEGREE), a deep learning based method, to exploit both inter-group and intra-group proximities jointly for structural learning. With a sparse filtering technique, DEGREE can significantly improve the matching performance with computation resources less than that of alternative deep learning based methods. Experimental results demonstrate that DEGREE outperforms state-of-the-art graph embedding methods on real-world e-commence datasets. In particular, our solution boosts the average unit price in purchases during an online A/B test by up to 11.93%, leading to better operational efficiency and shopping experience.
The success of applications that process data critically depends on the quality of the ingested data. Completeness of a data source is essential in many cases. Yet, most missing value imputation approaches suffer from severe limitations. They are almost exclusively restricted to numerical data, and they either offer only simple imputation methods or are difficult to scale and maintain in production. Here we present a robust and scalable approach to imputation that extends to tables with non-numerical values, including unstructured text data in diverse languages. Experiments on public data sets as well as data sets sampled from a large product catalog in different languages (English and Japanese) demonstrate that the proposed approach is both scalable and yields more accurate imputations than previous approaches. Training on data sets with several million rows is a matter of minutes on a single machine. With a median imputation F1 score of 0.93 across a broad selection of data sets our approach achieves on average a 23-fold improvement compared to mode imputation. While our system allows users to apply state-of-the-art deep learning models if needed, we find that often simple linear n-gram models perform on par with deep learning methods at a much lower operational cost. The proposed method learns all parameters of the entire imputation pipeline automatically in an end-to-end fashion, rendering it attractive as a generic plugin both for engineers in charge of data pipelines where data completeness is relevant, as well as for practitioners without expertise in machine learning who need to impute missing values in tables with non-numerical data.
With the increasing volume of transactions taking place online, mobile fraud has also increased. Mobile applications often authenticate the user only at install time. The user may then remain logged in for hours or weeks. Any unauthorized access may lead to financial, criminal or privacy losses. In this work, we leverage currently available built-in motion sensors in smartphones to learn users' behavioral characteristics while interacting with the mobile device to provide an implicit re-authentication mechanism that enables a frictionless and secure user experience in the application. This approach improves the generality as well as power efficiency of the authentication mechanism compared to using the camera feed which involves (a) specific hardware, (b) higher battery usage and (c) privacy concerns. We present DeepAuth as a generic framework for re-authenticating users in a mobile app. In our approach, we use time and frequency domain features extracted from motion sensors and a long short-term memory (LSTM) model with negative sampling to build a re-authentication framework. The framework is able to re-authenticate a user with 96.70% accuracy in 20 seconds from a set of data collected from 47 volunteers.
With over 34 billion IoT devices to be installed by 2020, the Internet of Things (IoT) is fundamentally changing our lives. One of the greatest benefits of the IoT is the powerful automations achieved by applying rules to IoT devices. For instance, a rule named "Make me a cup of coffee when I wake up'' automatically turns on the coffee machine when the sensor in the bedroom detects motion in the morning. With large numbers of possible rules out there, a recommendation system is of great necessity to help users find rules they need. However, little effort has been made to design a model tailored for the IoT rule recommendation, which comes with lots of new challenges compared with traditional recommendation tasks. We not only need to re-define "users'' and "items'' in the recommendation task, but also have to consider a new type of entities, devices, and the extra information and constraints brought by them. To handle these challenges, we propose a novel efficient recommendation algorithm, which not only considers the implicit feedback of users on rules, but also takes user-rule-device interactions and the match between rule device requirements and user device possessions into account. In collaboration with Samsung, one of the leading companies in this field, we have designed an IoT rule recommendation framework and evaluated our algorithm on a real-life industry dataset. Experiments show the effectiveness and efficiency of our method.
User action modeling and prediction has long been a topic of importance to recommender systems and user profiling. The quality of the model or accuracy of prediction plays a vital role in related applications like recommendation, advertisement displaying, searching, etc. For large scale systems with a massive number of users, beside the pure prediction performance, there are other practical factors like training and prediction latency, memory overhead, that must be optimized to ensure smooth operation of the system. We propose a fast linear computational framework to handle a vast number of second order crossed features with dimensionality reduction. By leveraging the training and serving system architecture, we shift heavy calculation burden from online serving to offline preprocessing, at the cost of a reasonable amount of memory overhead. The experiments on a 15-day data trace from Tencent MyApp shows that our proposed framework can achieve comparable prediction performance to much complex models like the field-aware factorization machine (FFM) while being served in 2 ms with a reasonable amount of memory overhead.
Mobile devices (e.g., smartphones) play a crucial role in our daily lives nowadays. People rely heavily on mobile devices for searching online, sending emails, chatting with friends, etc. As a result, input efficiency becomes increasingly important for real-time communication on mobile devices. Due to the small size of the screen on mobile devices, however, it is oftentimes frustrating for users to correct or update the input sequences on an even smaller input area on the screen. This often causes poor user experience. In this paper, we focus on improving the input efficiency on mobile devices to offer better user experience. In order to achieve efficient input, there are multiple challenges: 1) how to employ a single, unified representation of the keyboard layouts for different input languages; 2) how to build a framework to correct a mistouch immediately and predict the coming input texts (words or phrases) effectively; 3) how to deploy and evaluate the model on mobile devices with limited computational power. To address these challenges, we introduce \em FastInput to improve the user input efficiency on mobile devices. Three key techniques are developed in FastInput -- layout modeling, instant mistouch correction and user input text prediction. We also design solutions for efficient deployment and evaluation of FastInput on mobile devices. The proposed FastInput achieves higher efficiency compared to the traditional input system over millions of user input sequences in different languages.
Paraphrase identification (PI) aims at determining whether two natural language sentences roughly have identical meaning. PI has been conventionally formalized as a binary classification task and widely used in many talks such as text summarization, plagiarism detection, etc. The emergence of deep neural networks (DNNs) renovates and dominates the learning paradigm of PI, as DNNs do not rely on lexical nor syntactic knowledge of a language, unlike traditional methods. State-of-the-art DNNs-based approaches to PI mainly adopt multi-layer convolutional neural networks (CNNs) to model paraphrastic sentences, which could discover alignments of phrases with the same length (unigram-to-unigram, bigram-to-bigram, trigram-to-trigram, etc.) at each layer. However, paraphrasing phenomena globally exist at all levels of granularity between a pair of paraphrastic sentences, i.e., word-to-word, word-to-phrase, phrase-to-phrase, and even sentence-to-sentence.
In this paper, we contribute a globalization-semantic matching neural network (GSMNN) paradigm which has been deployed in Baidu.com to tackle practical PI problems. Established on a weight-sharing single-layer CNN, GSMNN is composed of a multi-granular matching layer with the attention mechanism and a sentence-level matching layer. These layers are designed to capture essentially all phenomena of semantic matching. Evaluations are conducted on a public large-scale dataset for PI: Quora-QP which contains more than 400,000 paraphrasing and non-paraphrasing question pairs from Quora.com. Experimental results show that GSMNN outperforms the state-of-the-art model by a substantial margin.
We present, GEM, the first heterogeneous graph neural network approach for detecting malicious accounts at Alipay, one of the world's leading mobile cashless payment platform. Our approach, inspired from a connected subgraph approach, adaptively learns discriminative embeddings from heterogeneous account-device graphs based on two fundamental weaknesses of attackers, i.e. device aggregation and activity aggregation. For the heterogeneous graph consists of various types of nodes, we propose an attention mechanism to learn the importance of different types of nodes, while using the sum operator for modeling the aggregation patterns of nodes in each type. Experiments show that our approaches consistently perform promising results compared with competitive methods over time.
In Taobao, the largest e-commerce platform in China, billions of items are provided and typically displayed with their images.For better user experience and business effectiveness, Click Through Rate (CTR) prediction in online advertising system exploits abundant user historical behaviors to identify whether a user is interested in a candidate ad. Enhancing behavior representations with user behavior images will help understand user's visual preference and improve the accuracy of CTR prediction greatly. So we propose to model user preference jointly with user behavior ID features and behavior images. However, training with user behavior images brings tens to hundreds of images in one sample, giving rise to a great challenge in both communication and computation. To handle these challenges, we propose a novel and efficient distributed machine learning paradigm called Advanced Model Server (AMS). With the well-known Parameter Server (PS) framework, each server node handles a separate part of parameters and updates them independently. AMS goes beyond this and is designed to be capable of learning a unified image descriptor model shared by all server nodes which embeds large images into low dimensional high level features before transmitting images to worker nodes. AMS thus dramatically reduces the communication load and enables the arduous joint training process. Based on AMS, the methods of effectively combining the images and ID features are carefully studied, and then we propose a Deep Image CTR Model. Our approach is shown to achieve significant improvements in both online and offline evaluations, and has been deployed in Taobao display advertising system serving the main traffic.
The knowledge of all occupied and unoccupied trips made by self-employed drivers are essential for optimized vehicle dispatch by ride-hailing services (e.g., Didi Dache, Uber, Lyft, Grab, etc.). However, the occupancy status of vehicles is not always known to the service operators due to adoption of multiple ride-hailing apps. In this paper, we propose a novel framework, Learning to INfer Trips (LINT), to infer occupancy of car trips by exploring characteristics of observed occupied trips. Two main research steps, stop point classification and structural segmentation, are included in LINT. In the stop point classification step, we represent a vehicle trajectory as a sequence of stop points, and assign stop points with pick-up, drop-off, and intermediate labels. The classification of vehicle trajectory stop points produces a stop point label sequence. For structural segmentation, we further propose several segmentation algorithms, including greedy segmentation (GS), efficient greedy segmentation (EGS), and dynamic programming-based segmentation (DP) to infer occupied trip from stop point label sequences. Our comprehensive experiments on real vehicle trajectories from self-employed drivers show that (1) the proposed stop point classifier predicts stop point labels with high accuracy, and (2) the proposed segmentation algorithm GS delivers the best accuracy performance with efficient running time.
Previous efforts in recommendation of candidates for talent search followed the general pattern of receiving an initial search criteria and generating a set of candidates utilizing a pre-trained model. Traditionally, the generated recommendations are final, that is, the list of potential candidates is not modified unless the user explicitly changes his/her search criteria. In this paper, we are proposing a candidate recommendation model which takes into account the immediate feedback of the user, and updates the candidate recommendations at each step. This setting also allows for very uninformative initial search queries, since we pinpoint the user's intent due to the feedback during the search session. To achieve our goal, we employ an intent clustering method based on topic modeling which separates the candidate space into meaningful, possibly overlapping, subsets (which we call intent clusters) for each position. On top of the candidate segments, we apply a multi-armed bandit approach to choose which intent cluster is more appropriate for the current session. We also present an online learning scheme which updates the intent clusters within the session, due to user feedback, to achieve further personalization. Our offline experiments as well as the results from the online deployment of our solution demonstrate the benefits of our proposed methodology.
Recent years have witnessed a widespread increase of rumor news generated by humans and machines. Therefore, tools for investigating rumor news have become an urgent necessity. One useful function of such tools is to see ways a specific topic or event is represented by presenting different points of view from multiple sources. In this paper, we propose Maester, a novel agreement-aware search framework for investigating rumor news. Given an investigative question, Maester will retrieve related articles to that question, assign and display top articles from agree, disagree, and discuss categories to users. Splitting the results into these three categories provides the user a holistic view towards the investigative question. We build Maester based on the following two key observations: (1) relatedness can commonly be determined by keywords and entities occurring in both questions and articles, and (2) the level of agreement between the investigative question and the related news article can often be decided by a few key sentences. Accordingly, we use gradient boosting tree models with keyword/entity matching features for relatedness detection, and leverage recurrent neural network to infer the level of agreement. Our experiments on the Fake News Challenge (FNC) dataset demonstrate up to an order of magnitude improvement of Maester over the original FNC winning solution, for agreement-aware search.
User information needs vary significantly across different tasks, and therefore their queries will also differ considerably in their expressiveness and semantics. Many studies have been proposed to model such query diversity by obtaining query types and building query-dependent ranking models. These studies typically require either a labeled query dataset or clicks from multiple users aggregated over the same document. These techniques, however, are not applicable when manual query labeling is not viable, and aggregated clicks are unavailable due to the private nature of the document collection, e.g., in email search scenarios. In this paper, we study how to obtain query type in an unsupervised fashion and how to incorporate this information into query-dependent ranking models. We first develop a hierarchical clustering algorithm based on truncated SVD and varimax rotation to obtain coarse-to-fine query types. Then, we study three query-dependent ranking models, including two neural models that leverage query type information as additional features, and one novel multi-task neural model that views query type as the label for the auxiliary query cluster prediction task. This multi-task model is trained to simultaneously rank documents and predict query types. Our experiments on tens of millions of real-world email search queries demonstrate that the proposed multi-task model can significantly outperform the baseline neural ranking models, which either do not incorporate query type information or just simply feed query type as an additional feature.
Receivable financing -- the process whereby cash is advanced to firms against receivables their customers have yet to pay -- is a well-established funding source for businesses. In this paper we present a novel, collaborative approach to receivable financing: unlike existing centralized approaches where the financing company acts as a server handling each request individually, our approach employs a network perspective where money flow is triggered among customers themselves. The main contribution of this work is to provide a principled formulation of the network-based receivable-settlement strategy, and show how all algorithmic challenges posed by the design of this strategy are achieved in practice. Extensive experiments on real receivable data attest the effectiveness of the proposed methods.
In coal-fired power plants, it is critical to improve the operational efficiency of boilers for sustainability. In this work, we formulate real-time boiler control as an optimization problem that looks for the best distribution of temperature in different zones and oxygen content from the flue to improve the boiler's stability and energy efficiency. We employ an efficient algorithm by integrating appropriate machine learning and optimization techniques. We obtain a large dataset collected from a real boiler for more than two months from our industry partner, and conduct extensive experiments to demonstrate the effectiveness and efficiency of the proposed algorithm.
The use of large-scale machine learning~(ML) is becoming ubiquitous in various domains ranging from business intelligence to self-driving cars. Many companies are building ML pipelines in a unified data processing environment, and leveraging well-tuned numerical optimization packages for obtaining model parameters. However, most existing optimization tools are specifically designed for a single machine setup, and cannot handle vast volume of data. In this work, we build a distributed computing framework towards optimizing generalized linear models with billions of variables. We at first design a new distributed vector to represent data points from extremely large feature space. Then, we introduce an efficient and scalable approach to compute the second order derivatives of loss function, and optimizes model parameters with limited memory requirement. Experiments on real-world datasets demonstrate that our proposed techniques can scale up for ML models with billions of variables, and achieves better performance than state-of-the-art systems on a wide range of applications, e.g., ad CTR prediction and rideshare price bidding.
Many recommendation systems produce result sets with large numbers of highly similar items. Diversifying these results is often accomplished with heuristics, which are impoverished models of users' desire for diversity. However, integrating more complex statistical models of diversity into large-scale, mature systems is challenging. Without a good match between the model's definition of diversity and users' perception of diversity, the model can easily degrade users' perception of the recommendations. In this work we present a statistical model of diversity based on determinantal point processes (DPPs). We train this model from examples of user preferences with a simple procedure that can be integrated into large and complex production systems relatively easily. We use an approximate inference algorithm to serve the model at scale, and empirical results on live YouTube homepage traffic show that this model, coupled with a re-ranking algorithm, yields substantial short- and long-term increases in user engagement.
The rapid growth of mobile devices has resulted in the generation of a large number of user behavior logs that contain latent intentions and user interests. However, exploiting such data in real-world applications is still difficult for service providers due to the complexities of user behavior over a sheer number of possible actions that can vary according to time. In this work, a time-aware RNN model, TRNN, is proposed for predictive analysis from user behavior data. First, our approach predicts next user action more accurately than the baselines including the n-gram models as well as two recently introduced time-aware RNN approaches. Second, we use TRNN to learn user embeddings from sequences of user actions and show that overall the TRNN embeddings outperform conventional RNN embeddings. Similar to how word embeddings benefit a wide range of task in natural language processing, the learned user embeddings are general and could be used in a variety of tasks in the digital marketing area. This claim is supported empirically by evaluating their utility in user conversion prediction, and preferred application prediction. According to the evaluation results, TRNN embeddings perform better than the baselines including Bag of Words (BoW), TFIDF and Doc2Vec. We believe that TRNN embeddings provide an effective representation for solving practical tasks such as recommendation, user segmentation and predictive analysis of business metrics.
Preserving privacy of users is a key requirement of web-scale analytics and reporting applications, and has witnessed a renewed focus in light of recent data breaches and new regulations such as GDPR. We focus on the problem of computing robust, reliable analytics in a privacy-preserving manner, while satisfying product requirements. We present PriPeARL, a framework for privacy-preserving analytics and reporting, inspired by differential privacy. We describe the overall design and architecture, and the key modeling components, focusing on the unique challenges associated with privacy, coverage, utility, and consistency. We perform an experimental study in the context of ads analytics and reporting at LinkedIn, thereby demonstrating the tradeoffs between privacy and utility needs, and the applicability of privacy-preserving mechanisms to real-world data. We also highlight the lessons learned from the production deployment of our system at LinkedIn.
Real-time advertising allows advertisers to bid for each impression for a visiting user. To optimize specific goals such as maximizing revenue and return on investment (ROI) led by ad placements, advertisers not only need to estimate the relevance between the ads and user's interests, but most importantly require a strategic response with respect to other advertisers bidding in the market. In this paper, we formulate bidding optimization with multi-agent reinforcement learning. To deal with a large number of advertisers, we propose a clustering method and assign each cluster with a strategic bidding agent. A practical Distributed Coordinated Multi-Agent Bidding (DCMAB) has been proposed and implemented to balance the tradeoff between the competition and cooperation among advertisers. The empirical study on our industry-scaled real-world data has demonstrated the effectiveness of our methods. Our results show cluster-based bidding would largely outperform single-agent and bandit approaches, and the coordinated bidding achieves better overall objectives than purely self-interested bidding agents.
We introduce a new problem, namely, check-in time prediction where the goal is to predict the time when a given user will check-in to a location of interest. We design a novel Recurrent Spatio-Temporal Point Process (RSTPP) model for check-in time prediction. RSTPP addresses two key challenges: 1) Data scarcity due to uneven distribution of check-ins among users/locations. 2) User trajectories contain valuable information that is ignored by standard temporal point process which only considers historical event times. RSTPP is designed to learn the latent dependencies of event times over both historical events and spatio-temporal information about locations a user visited before check-in to the location of interest. We evaluate RSTPP on several real-world datasets, and it significantly outperforms state-of-the-art event time predicting techniques. Our work derives a set of practical implications that can benefit a wide spectrum of applications.
Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a first-principles formulation of entity resolution, this paper presents a novel entity resolution algorithm that introduces a data-driven blocking and record linkage technique based on the probabilistic identification of entity signatures in data. The scalability and accuracy of the proposed algorithm are evaluated using benchmark datasets and shown to achieve state-of-the-art results. The proposed algorithm can be implemented simply on modern parallel databases, which we have done in the financial intelligence domain with tens of Terabytes of noisy data.
During the past few years advancements in sports information systems and technology has allowed the collection of a number of detailed spatio-temporal data that capture various aspects of basketball. For example, shot charts, that is, maps capturing locations of (made or missed) shots, and spatio-temporal trajectories for the players on the court can capture information about the offensive and defensive tendencies, as well as, schemes used by a team. Characterization of these processes is important for player and team comparisons, scouting, game preparation etc. Team and player tendencies have traditionally been compared in a heuristic manner, which inevitably can lead to subtle but crucial information being ignored. Recently automated ways for these comparisons have appeared in the sports analytics literature. However, these approaches are almost exclusively focused on the spatial distribution of the underlying actions (usually shots taken), ignoring a multitude of other parameters that can affect the action studied. In this study, we propose a framework based on tensor decomposition for obtaining a set of prototype spatio-temporal patterns based on the core spatio-temporal information and contextual meta-data. At the epicenter of our work is a 3D tensor $\tensor$, whose dimensions represent the entity under consideration (team, player, possession etc.), the location on the court and time. We make use of the PARAFAC decomposition and we decompose the tensor into several interpretable patterns, that can be thought of as prototype patterns of the process examined (e.g., shot selection, offensive schemes etc.). We also introduce an approach for choosing the number of components to be considered. Using the tensor components, we can then express every entity as a weighted combination of these components. Finally, the framework introduced in this paper has applications that go beyond purely pattern analysis. In particular, it can facilitate a variety of tasks in the work-flow of a franchise's basketball operations as well as in the sports analytics research community.
Search relevance is a very critical component in e-commerce applications. One of the strongest signals that determine the relevance of an item listing to an e-commerce query is the title of the item. Traditional methods for capturing this signal compare words in listing titles and the user query using tf-idf scores, or use a machine learned model with words as features and target clicks or relevance labels. Contrary to these approaches, we build a parameterized model to determine the weights of popular title terms for a query and then use these title term weights to compute the relevance of a listing title to the query. For this, we use human judged binary relevance labels of query and item title pairs as labeled data and train a model leveraging a variety of features to learn these query specific title term weights. We propose two novel approaches to model these title term weights using the relevance target and explore several novel features specific to e-commerce for this term weighting model. We use the resulting title relevance score as a feature in eBay's machine learned ranker for e-commerce search serving millions of queries each day. We observe a significant improvement over a baseline click-based binary independence model for capturing item title relevance in several metrics including model accuracy and overall relevance and engagement observed through A/B testing. We also experimentally illustrate that this feature optimized for relevance works well in conjunction with textual features optimized for demand.
Two-sided marketplaces are platforms that have customers not only on the demand side (e.g. users), but also on the supply side (e.g. retailer, artists). While traditional recommender systems focused specifically towards increasing consumer satisfaction by providing relevant content to consumers, two-sided marketplaces face the problem of additionally optimizing for supplier preferences, and visibility. Indeed, the suppliers would want afair opportunity to be presented to users. Blindly optimizing for consumer relevance may have a detrimental impact on supplier fairness. Motivated by this problem, we focus on the trade-off between objectives of consumers and suppliers in the case of music streaming services, and consider the trade-off betweenrelevance of recommendations to the consumer (i.e. user) andfairness of representation of suppliers (i.e. artists) and measure their impact on consumersatisfaction.
We propose a conceptual and computational framework using counterfactual estimation techniques to understand, and evaluate different recommendation policies, specifically around the trade-off between relevance and fairness, without the need for running many costly A/B tests. We propose a number of recommendation policies which jointly optimize relevance and fairness, thereby achieving substantial improvement in supplier fairness without noticeable decline in user satisfaction. Additionally, we consider user disposition towards fair content, and propose a personalized recommendation policy which takes into account consumer's tolerance towards fair content. Our findings could guide the design of algorithms powering two-sided marketplaces, as well as guide future research on sophisticated algorithms for joint optimization of user relevance, satisfaction and fairness.
Talent search and recommendation systems at LinkedIn strive to match the potential candidates to the hiring needs of a recruiter or a hiring manager expressed in terms of a search query or a job posting. Recent work in this domain has mainly focused on linear models, which do not take complex relationships between features into account, as well as ensemble tree models, which introduce non-linearity but are still insufficient for exploring all the potential feature interactions, and strictly separate feature generation from modeling. In this paper, we present the results of our application of deep and representation learning models on LinkedIn Recruiter. Our key contributions include: (i) Learning semantic representations of sparse entities within the talent search domain, such as recruiter ids, candidate ids, and skill entity ids, for which we utilize neural network models that take advantage of LinkedIn Economic Graph, and (ii) Deep models for learning recruiter engagement and candidate response in talent search applications. We also explore learning to rank approaches applied to deep models, and show the benefits for the talent search use case. Finally, we present offline and online evaluation results for LinkedIn talent search and recommendation systems, and discuss potential challenges along the path to a fully deep model architecture. The challenges and approaches discussed generalize to any multi-faceted search engine.
Software support tickets contain short and noisy text from the customers. Software products are often represented by various surface forms and informal abbreviations. Automatically identifying software mentions from support tickets and determining the official names and versions are helpful for many downstream applications, \eg routing the support tickets to the right expert groups for support. In this work, we study the problem ofsoftware product name extraction andlinking from support tickets. We first annotate and analyze sampled tickets to understand the language patterns. Next, we design features using local, contextual, and external information sources, for extraction and linking models. In experiments, we show that linear models with the proposed features are able to deliver better and more consistent results, compared with the state-of-the-art baseline models, even on dataset with sparse labels.
Industrial telecommunication applications prefer to run at the optimal capacity configuration to achieve the required Quality of Service (QoS) at the minimum cost. The optimal capacity configuration is usually achieved through the selection of cell towers capacities and locations. Given a set of service providers (e.g., cell towers) and a set of customers (e.g., major residential areas), where each customer has an amount of demand and each provider has multiple candidate capacities and corresponding costs, the optimal capacity selection is configured through spatial matching to satisfy the demand of each customer at the minimum cost. However, existing solutions developed for spatial matching, in which each provider's capacity is fixed, cannot be directly applied to the capacity configuration problem with multiple capacities and location selections. In this paper, we are the first to study Service Provider Configuration and Planning with Optimal Spatial Matching (SPC-POSM) problem, in which the objectives are 1) to select the proper capacity for each provider at the minimum total cost and 2) to assign providers' service to satisfy the demand of each customer on a condition that the matching distance is no more than service quality requirement. We prove that SPC-POSM problem is NP-hard and design an efficient two-layer meta-heuristic framework to solve the problem. Unsupervised learning technique is designed to accelerate the calculation and a novel local search mechanism is embedded to further improve solution quality. Extensive experimental results on both real and synthetic datasets verify the effectiveness and efficiency of the proposed framework.
We consider the problem of predicting the success of startup companies at their early development stages. We formulate the task as predicting whether a company that has already secured initial (seed or angel) funding will attract a further round of investment in a given period of time. Previous work on this task has mostly been restricted to mining structured data sources, such as databases of the startup ecosystem consisting of investors, incubators and startups. Instead, we investigate the potential of using web-based open sources for the startup success prediction task and model the task using a very rich set of signals from such sources. In particular, we enrich structured data about the startup ecosystem with information from a business- and employment-oriented social networking service and from the web in general. Using these signals, we train a robust machine learning pipeline encompassing multiple base models using gradient boosting. We show that utilizing companies' mentions on the Web yields a substantial performance boost in comparison to only using structured data about the startup ecosystem. We also provide a thorough analysis of the obtained model that allows one to obtain insights into both the types of useful signals discoverable on the Web and market mechanisms underlying the funding process.
Some particularly important rich sources of open and free big geospatial data are the Earth observation (EO) programs of various countries such as the Landsat program of the US and the Copernicus programme of the European Union. EO data is a paradigmatic case of big data and the same is true for the big information and big knowledge extracted from it. EO data (satellite images and in-situ data), and the information and knowledge extracted from it, can be utilized in many applications with financial and environmental impact in areas such as emergency management, climate change, agriculture and security.
Graphs have been widely used as modeling tools in Natural Language Processing (NLP), Text Mining (TM) and Information Retrieval (IR). Traditionally, the unigram bag-of-words representation is applied; that way, a document is represented as a multiset of its terms, disregarding dependencies between the terms. Although several variants and extensions of this modeling approach have been proposed, the main weakness comes from the underlying term independence assumption; the order of the terms within a document is completely disregarded and any relationship between terms is not taken into account in the final task. To deal with this problem, the research community has explored various representations, and to this direction, graphs constitute a well-developed model for text representation. The goal of this tutorial is to offer a comprehensive presentation of recent methods that rely on graph-based text representations to deal with various tasks in Text Mining, NLP and IR.
Many applications from various disciplines are now required to analyze fast evolving big data in real time. Various approaches for incremental processing of queries have been proposed over the years. Traditional approaches rely on updating the results of a query when updates are streamed rather than re-computing these queries, and therefore, higher execution performance is expected. However, they do not perform well for large databases that are updated at high frequencies. Therefore, new algorithms and approaches have been proposed in the literature to address these challenges by, for instance, reducing the complexity of processing updates. Moreover, many of these algorithms are now leveraging distributed streaming platforms such as Spark Streaming and Flink. In this tutorial, we briefly discuss legacy approaches for incremental query processing, and then give an overview of the new challenges introduced due to processing big data streams. We then discuss in detail the recently proposed algorithms that address some of these challenges. We emphasize the characteristics and algorithmic analysis of various proposed approaches and conclude by discussing future research directions.
The process of extracting, structuring, and organizing knowledge requires processing large and originally heterogeneous data sources. Offering existing data as Linked Data increases its shareability, extensibility, and reusability. However, using Linking Data as a means to represent knowledge can be easier said than done. In this tutorial, we elaborate on how to semantically annotate data, and generate and publish Linked Data. We introduce [R2]RML languages to generate Linked Data. We also show how to easily publish Linked Data on the Web as Triple Pattern Fragments. As a result, participants, independently of their knowledge background, can model, annotate and publish Linked Data on their own.
One of the most challenging issues in the era of Big Data is the Variety of the data. In general, there are two solutions to directly manage multi-model data currently: a single integrated multi-model database system or a tightly-integrated middleware over multiple single-model data stores. In this tutorial, we review and compare these two approaches giving insights on their advantages, trade-offs, and research opportunities. In particular, we dive into four key aspects of technology for both types of systems, namely (1) theoretical foundation of multi-model data management, (2) storage strategies for multi-model data, (3) query languages across models, and (4) query evaluation and its optimization. We provide a comparison of performance for the two approaches and discuss related open problems and remaining challenges. The slides of this tutorial can be found at http://udbms.cs.helsinki.fi/?tutorials/CIKM2018.
Recently, semantic technologies have been successfully deployed to overcome the typical difficulties in accessing and integrating data stored in different kinds of legacy sources. In particular, knowledge graphs are being used as a mechanism to provide a uniform representation of heterogeneous information. Such graphs represent data in the RDF format, which is complemented by an ontology and can be queried using the standard SPARQL language. The RDF graph is often obtained by materializing source data, following the traditional extract-transform-load workflow. Alternatively, the sources are declaratively mapped to the ontology, and the RDF graph is maintained virtual. In such an approach, usually called ontology-based data access/integration (OBDA/I), query answering is based on sophisticated query transformation techniques. In this tutorial: (i) we provide a general introduction to semantic technologies; (ii) we illustrate the principles underlying OBDA/I, providing insights into its theoretical foundations, and describing well-established algorithms, techniques, and tools; (iii) we discuss relevant use-cases for OBDA/I; (iv) we provide an overview on some recent advancements.
Implicit feedback (e.g., user clicks) is an important source of data for modern search engines. While heavily biased [8, 9, 11, 27], it is cheap to collect and particularly useful for user-centric retrieval applications such as search ranking. To develop an unbiased learning-to-rank system with biased feedback, previous studies have focused on constructing probabilistic graphical models (e.g., click models) with user behavior hypothesis to extract and train ranking systems with unbiased relevance signals. Recently, a novel counterfactual learning framework that estimates and adopts examination propensity for unbiased learning to rank has attracted much attention. Despite its popularity, there is no systematic comparison of the unbiased learning-to-rank frameworks based on counterfactual learning and graphical models. In this tutorial, we aim to provide an overview of the fundamental mechanism for unbiased learning to rank. We will describe the theory behind existing frameworks, and give detailed instructions on how to conduct unbiased learning to rank in practice.
User data is becoming increasingly available in various domains from the social Web to patient health records. User data is characterized by a combination of demographics (e.g., age, gender, occupation) and user actions (e.g., rating a movie, following a diet). User data analytics is usually based on identifying group-level behaviors such as "countryside teachers who watch Woody Allen movies." User Group Analytics (UGA) addresses peculiarities of user data such as noise and sparsity. This tutorial reviews research on UGA and discusses different approaches and open challenges for group discovery, exploration, and visualization.
This paper provides an overview of the workshops co-located with the 27th ACM International Conference on Information and Knowledge Management (CIKM 2018), held during October 22-26, 2018 in Turin, Italy.