prof. RNDr. Tomáš Skopal, Ph.D.

Publikace

Modular framework for similarity-based dataset discovery using external knowledge

Autoři
Nečaský, M.; Škoda, P.; Bernhauer, D.; Klímek, J.; Skopal, T.
Rok
2022
Publikováno
Data Technologies and Applications. 2022, 56(4), 506-535. ISSN 2514-9288.
Typ
Článek
Anotace
Purpose Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth. Design/methodology/approach In this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery. Findings The study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework. Originality/value To the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.

Open dataset discovery using context-enhanced similarity search

Autoři
Bernhauer, D.; Nečaský, M.; Škoda, P.; Klímek, J.; Skopal, T.
Rok
2022
Publikováno
Knowledge and Information Systems. 2022, 64(12), 3265-3291. ISSN 0219-1377.
Typ
Článek
Anotace
Today, open data catalogs enable users to search for datasets with full-text queries in metadata records combined with simple faceted filtering. Using this combination, a user is able to discover a significant number of the datasets relevant to a user’s search intent. However, there still remain relevant datasets that are hard to find because of the enormous sparsity of their metadata (e.g., several keywords). As an alternative, in this paper, we propose an approach to dataset discovery based on similarity search over metadata descriptions enhanced by various semantic contexts. In general, the semantic contexts enrich the dataset metadata in a way that enables the identification of additional relevant datasets to a query that could not be retrieved using just the keyword or full-text search. In experimental evaluation we show that context-enhanced similarity retrieval methods increase the findability of relevant datasets, improving thus the retrieval recall that is critical in dataset discovery scenarios. As a part of the evaluation, we created a catalog-like user interface for dataset discovery and recorded streams of user actions that served us to create the ground truth. For the sake of reproducibility, we published the entire evaluation testbed.

Similarity vs. Relevance: From Simple Searches to Complex Discovery

Autoři
Skopal, T.; Bernhauer, D.; Škoda, P.; Klímek, J.; Nečaský, M.
Rok
2021
Publikováno
Similarity Search and Applications. Springer, Cham, 2021. p. 104-117. ISSN 0302-9743. ISBN 978-3-030-89656-0.
Typ
Stať ve sborníku
Anotace
Similarity queries play the crucial role in content-based retrieval. The similarity function itself is regarded as the function of relevance between a query object and objects from database; the most similar objects are understood as the most relevant. However, such an automatic adoption of similarity as relevance leads to limited applicability of similarity search in domains like entity discovery, where relevant objects are not supposed to be similar in the traditional meaning. In this paper, we propose the meta-model of data-transitive similarity operating on top of a particular similarity model and a database. This meta-model enables to treat directly non-similar objects x, y as similar if there exists a chain of objects x, i_1,... ,i_n, y having the neighboring members similar enough. Hence, this approach places the similarity in the role of relevance, where objects do not need to be directly similar but still remain relevant to each other (transitively similar). The data-transitive similarity concept allows to use standard similarity-search methods (queries, joins, rankings, analytics) in more complex tasks, like the entity discovery, where relevant results are often complementary or orthogonal to the query, rather than directly similar. Moreover, we show the data-transitive similarity is inherently self-explainable and non-metric. We discuss the approach in the domain of open dataset discovery.

Analysing Indexability of Intrinsically High-dimensional Data using TriGen

Autoři
Bernhauer, D.; Skopal, T.
Rok
2020
Publikováno
Similarity Search and Applications. Springer, Cham, 2020. p. 261-269. ISSN 0302-9743. ISBN 978-3-030-60935-1.
Typ
Stať ve sborníku
Anotace
The TriGen algorithm is a general approach to transform distance spaces in order to provide both exact and approximate similarity search in metric and non-metric spaces. This paper focuses on the reduction of intrinsic dimensionality using TriGen. Besides the well-known intrinsic dimensionality based on distance distribution, we inspect properties of triangles used in metric indexing (the triangularity) as well as properties of quadrilaterals used in ptolemaic indexing (the ptolemaicity). We also show how LAESA with triangle and ptolemaic filtering behaves on several datasets with respect to the proposed indicators.

Evaluation Framework for Search Methods Focused on Dataset Findability in Open Data Catalogs

Autoři
Škoda, P.; Bernhauer, D.; Nečaský, M.; Klímek, J.; Skopal, T.
Rok
2020
Publikováno
Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services. New York: Association for Computing Machinery, 2020. p. 200-209. ISBN 978-1-4503-8922-8.
Typ
Stať ve sborníku
Anotace
Many institutions publish datasets as Open Data in catalogs, however, their retrieval remains problematic issue due to the absence of dataset search benchmarking. We propose a framework for evaluating findability of datasets, regardless of retrieval models used. As task-agnostic labeling of datasets by ground truth turns out to be infeasible in the general domain of open data datasets, the proposed framework is based on evaluation of entire retrieval scenarios that mimic complex retrieval tasks. In addition to the framework we present a proof of concept specification and evaluation on several similarity-based retrieval models and several dataset discovery scenarios within a catalog, using our experimental evaluation tool. Instead of traditional matching of query with metadata of all the datasets, in similarity-based retrieval the query is formulated using a set of datasets (query by example) and the most similar datasets to the query set are retrieved from the catalog as a result.

Advanced Behavioral Analyses Using Inferred Social Networks: A Vision

Autoři
Holubová, I.; Svoboda, M.; Skopal, T.; Bernhauer, D.; Peška, L.
Rok
2019
Publikováno
Database and Expert Systems Applications. Springer, Cham, 2019. p. 210-219. ISSN 1865-0929. ISBN 978-3-030-27683-6.
Typ
Stať ve sborníku
Anotace
The success of many businesses is based on a thorough knowledge of their clients. There exists a number of supervised as well as unsupervised data mining or other approaches that allow to analyze data about clients, their behavior or environment. In our ongoing project focusing primarily on bank clients, we propose an innovative strategy that will overcome shortcomings of the existing methods. From a given set of user activities, we infer their social network in order to analyze user relationships and behavior. For this purpose, not just the traditional direct facts are incorporated, but also relationships inferred using similarity measures and statistical approaches, with both possibly limited measures of reliability and validity in time. Such networks would enable analyses of client characteristics from a new perspective and could provide otherwise impossible insights. However, there are several research and technical challenges making the outlined pursuit novel, complex and challenging as we outline in this vision paper.

Approximate search in dissimilarity spaces using GA

Autoři
Bernhauer, D.; Skopal, T.
Rok
2019
Publikováno
GECCO 2019 Companion - Proceedings of the 2019 Genetic and Evolutionary Computation Conference Companion. New York: Association for Computing Machinery, 2019. p. 279-280. ISBN 978-1-4503-6748-6.
Typ
Stať ve sborníku
Anotace
Nowadays, the metric space properties limit the methods of indexing for content-based similarity search. The target of this paper is a data-driven transformation of a semimetric model to a metric one while keeping the data indexability high. We have proposed a genetic algorithm for evolutionary design of semimetric-to-metric modifiers. The precision of our algorithm is near the specified error threshold and indexability is still good. The paper contribution is a proof of concept showing that genetic algorithms can effectively design semimetric modifiers applicable in similarity search engines.

Inferred Social Networks: A Case Study

Autoři
Holubová, I.; Svoboda, M.; Bernhauer, D.; Skopal, T.; Paščenko, P.
Rok
2019
Publikováno
19th IEEE International Conference on Data Mining Workshops. Los Alamitos: IEEE Computer Society, 2019. p. 65-68. ISBN 978-1-7281-4603-4.
Typ
Stať ve sborníku
Anotace
The behavior, environment, and characteristics of clients form a crucial source of information for various businesses. There exists a number of supervised as well as unsupervised data mining or other approaches that allow analyzing the respective data. In our ongoing project, focusing primarily on the financial sector, we suggest an innovative strategy that will overcome persisting shortcomings of the state-of-the-art methods using an analysis of a social network of clients. In addition, we do not assume the existence of such a network, but from a given set of client financial activities, we are able to infer a social network representing their relationships and behavior. Using real-world data and selected use cases from our domain, we show (a part of) the process of construction of an inferred social network, i.e., what kind of "hidden" information can, for example, be found and exploited.

Non-metric Similarity Search Using Genetic TriGen

Autoři
Bernhauer, D.; Skopal, T.
Rok
2019
Publikováno
Similarity Search and Applications. Springer, Cham, 2019. p. 86-93. ISBN 978-3-030-32046-1.
Typ
Stať ve sborníku
Anotace
The metric space model is a popular and extensible model for indexing data for fast similarity search. However, there is often need for broader concepts of similarities (beyond the metric space model) while these cannot directly benefit from metric indexing. This paper focuses on approximate search in semi-metric spaces using a genetic variant of the TriGen algorithm. The original TriGen algorithm generates metric modifications of semi-metric distance functions, thus allowing metric indexes to index non-metric models. However, “analytic” modifications provided by TriGen are not stable in predicting the retrieval error. In our approach, the genetic variant of TriGen – the TriGenGA – uses genetically learned semi-metric modifiers (piecewise linear functions) that lead to better estimates of the retrieval error. Additionally, the TriGenGA modifiers result in better overall performance than original TriGen modifiers.

Recommender System as the Support for Binaural Audio

Autoři
Bernhauer, D.; Skopal, T.
Rok
2019
Publikováno
Augmented Reality and Virtual Reality. Cham: Springer International Publishing AG, 2019. p. 233-246. ISSN 2196-8705. ISBN 978-3-030-06245-3.
Typ
Kapitola v knize
Anotace
Virtual reality devices nowadays can effectively utilise other senses besides vision, too. The most often used secondary sense is hearing with binaural audio as VR engine. Currently, practical usage of binaural audio as the source of VR is impossible because of the inaccuracy of a general model. On the contrary, measuring the personalised parameters can be time-consuming. Our task was to prove the possibility of reconstruction of the binaural audio parameters in domestic conditions. We have focused on the design of the user interface that can be used independently on the platform. Our proposed browser-based application uses collaborative filtering as a recommender system. We have proven that sound-based navigation in axial plane is possible with 6.6° inaccuracy. The gamification and browser-based implementation make it easier for all people to find the best possible parameters. The resulting profile can be used both with fully VR environment and with semi-VR games.

SIMILANT: An Analytic Tool for Similarity Modeling

Autoři
Bernhauer, D.; Skopal, T.; Holubová, I.; Peška, L.; Svoboda, M.
Rok
2019
Publikováno
Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York: Association for Computing Machinery, 2019. p. 2889-2892. ISBN 978-1-4503-6976-3.
Typ
Stať ve sborníku
Anotace
We present SIMILANT, a data analytics tool for modeling similarity in content-based retrieval scenarios. In similarity search, data elements are modeled using black-box descriptors, where a pair-wise similarity function is the only way how to relate data elements to each other. Only these relations provide information about the dataset structure. Data analysts need to identify meaningful combinations of descriptors and similarity functions effectively. Therefore, we proposed a tool enabling a data analyst to systematically browse, tune, and analyze similarity models for a specific domain.