prof. RNDr. Tomáš Skopal, Ph.D.

Modular framework for similarity-based dataset discovery using external knowledge

Authors

Nečaský, M.; Škoda, P.; Bernhauer, D.; Klímek, J.; Skopal, T.

Year

2022

Published

Data Technologies and Applications. 2022, 56(4), 506-535. ISSN 2514-9288.

Type

Article

DOI

10.1108/DTA-09-2021-0261

Departments

Department of Software Engineering

Annotation

Purpose Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth. Design/methodology/approach In this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery. Findings The study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework. Originality/value To the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.

Open dataset discovery using context-enhanced similarity search

Authors

Bernhauer, D.; Nečaský, M.; Škoda, P.; Klímek, J.; Skopal, T.

Year

2022

Published

Knowledge and Information Systems. 2022, 64(12), 3265-3291. ISSN 0219-1377.

Type

Article

DOI

10.1007/s10115-022-01751-z

Departments

Department of Software Engineering

Annotation

Today, open data catalogs enable users to search for datasets with full-text queries in metadata records combined with simple faceted filtering. Using this combination, a user is able to discover a significant number of the datasets relevant to a user’s search intent. However, there still remain relevant datasets that are hard to find because of the enormous sparsity of their metadata (e.g., several keywords). As an alternative, in this paper, we propose an approach to dataset discovery based on similarity search over metadata descriptions enhanced by various semantic contexts. In general, the semantic contexts enrich the dataset metadata in a way that enables the identification of additional relevant datasets to a query that could not be retrieved using just the keyword or full-text search. In experimental evaluation we show that context-enhanced similarity retrieval methods increase the findability of relevant datasets, improving thus the retrieval recall that is critical in dataset discovery scenarios. As a part of the evaluation, we created a catalog-like user interface for dataset discovery and recorded streams of user actions that served us to create the ground truth. For the sake of reproducibility, we published the entire evaluation testbed.

Similarity vs. Relevance: From Simple Searches to Complex Discovery

Authors

Skopal, T.; Bernhauer, D.; Škoda, P.; Klímek, J.; Nečaský, M.

Year

2021

Published

Similarity Search and Applications. Springer, Cham, 2021. p. 104-117. ISSN 0302-9743. ISBN 978-3-030-89656-0.

Type

Proceedings paper

DOI

10.1007/978-3-030-89657-7_9

Departments

Department of Software Engineering

Annotation

Similarity queries play the crucial role in content-based retrieval. The similarity function itself is regarded as the function of relevance between a query object and objects from database; the most similar objects are understood as the most relevant. However, such an automatic adoption of similarity as relevance leads to limited applicability of similarity search in domains like entity discovery, where relevant objects are not supposed to be similar in the traditional meaning. In this paper, we propose the meta-model of data-transitive similarity operating on top of a particular similarity model and a database. This meta-model enables to treat directly non-similar objects x, y as similar if there exists a chain of objects x, i_1,... ,i_n, y having the neighboring members similar enough. Hence, this approach places the similarity in the role of relevance, where objects do not need to be directly similar but still remain relevant to each other (transitively similar). The data-transitive similarity concept allows to use standard similarity-search methods (queries, joins, rankings, analytics) in more complex tasks, like the entity discovery, where relevant results are often complementary or orthogonal to the query, rather than directly similar. Moreover, we show the data-transitive similarity is inherently self-explainable and non-metric. We discuss the approach in the domain of open dataset discovery.

Analysing Indexability of Intrinsically High-dimensional Data using TriGen

Authors

Bernhauer, D.; Skopal, T.

Year

2020

Published

Similarity Search and Applications. Springer, Cham, 2020. p. 261-269. ISSN 0302-9743. ISBN 978-3-030-60935-1.

Type

Proceedings paper

DOI

10.1007/978-3-030-60936-8_20

Departments

Department of Software Engineering

Annotation

The TriGen algorithm is a general approach to transform distance spaces in order to provide both exact and approximate similarity search in metric and non-metric spaces. This paper focuses on the reduction of intrinsic dimensionality using TriGen. Besides the well-known intrinsic dimensionality based on distance distribution, we inspect properties of triangles used in metric indexing (the triangularity) as well as properties of quadrilaterals used in ptolemaic indexing (the ptolemaicity). We also show how LAESA with triangle and ptolemaic filtering behaves on several datasets with respect to the proposed indicators.

Evaluation Framework for Search Methods Focused on Dataset Findability in Open Data Catalogs

Authors

Škoda, P.; Bernhauer, D.; Nečaský, M.; Klímek, J.; Skopal, T.

Year

2020

Published

Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services. New York: Association for Computing Machinery, 2020. p. 200-209. ISBN 978-1-4503-8922-8.

Type

Proceedings paper

DOI

10.1145/3428757.3429973

Departments

Department of Software Engineering

Annotation

Many institutions publish datasets as Open Data in catalogs, however, their retrieval remains problematic issue due to the absence of dataset search benchmarking. We propose a framework for evaluating findability of datasets, regardless of retrieval models used. As task-agnostic labeling of datasets by ground truth turns out to be infeasible in the general domain of open data datasets, the proposed framework is based on evaluation of entire retrieval scenarios that mimic complex retrieval tasks. In addition to the framework we present a proof of concept specification and evaluation on several similarity-based retrieval models and several dataset discovery scenarios within a catalog, using our experimental evaluation tool. Instead of traditional matching of query with metadata of all the datasets, in similarity-based retrieval the query is formulated using a set of datasets (query by example) and the most similar datasets to the query set are retrieved from the catalog as a result.

Advanced Behavioral Analyses Using Inferred Social Networks: A Vision

Authors

Holubová, I.; Svoboda, M.; Skopal, T.; Bernhauer, D.; Peška, L.

Year

2019

Published

Database and Expert Systems Applications. Springer, Cham, 2019. p. 210-219. ISSN 1865-0929. ISBN 978-3-030-27683-6.

Type

Proceedings paper

DOI

10.1007/978-3-030-27684-3_26

Departments

Department of Software Engineering

Annotation

The success of many businesses is based on a thorough knowledge of their clients. There exists a number of supervised as well as unsupervised data mining or other approaches that allow to analyze data about clients, their behavior or environment. In our ongoing project focusing primarily on bank clients, we propose an innovative strategy that will overcome shortcomings of the existing methods. From a given set of user activities, we infer their social network in order to analyze user relationships and behavior. For this purpose, not just the traditional direct facts are incorporated, but also relationships inferred using similarity measures and statistical approaches, with both possibly limited measures of reliability and validity in time. Such networks would enable analyses of client characteristics from a new perspective and could provide otherwise impossible insights. However, there are several research and technical challenges making the outlined pursuit novel, complex and challenging as we outline in this vision paper.

Approximate search in dissimilarity spaces using GA

Authors

Bernhauer, D.; Skopal, T.

Year

2019

Published

GECCO 2019 Companion - Proceedings of the 2019 Genetic and Evolutionary Computation Conference Companion. New York: Association for Computing Machinery, 2019. p. 279-280. ISBN 978-1-4503-6748-6.

Type

Proceedings paper

DOI

10.1145/3319619.3321907

Departments

Department of Software Engineering

Annotation

Nowadays, the metric space properties limit the methods of indexing for content-based similarity search. The target of this paper is a data-driven transformation of a semimetric model to a metric one while keeping the data indexability high. We have proposed a genetic algorithm for evolutionary design of semimetric-to-metric modifiers. The precision of our algorithm is near the specified error threshold and indexability is still good. The paper contribution is a proof of concept showing that genetic algorithms can effectively design semimetric modifiers applicable in similarity search engines.

Inferred Social Networks: A Case Study

Authors

Holubová, I.; Svoboda, M.; Bernhauer, D.; Skopal, T.; Paščenko, P.

Year

2019

Published

19th IEEE International Conference on Data Mining Workshops. Los Alamitos: IEEE Computer Society, 2019. p. 65-68. ISBN 978-1-7281-4603-4.

Type

Proceedings paper

DOI

10.1109/ICDMW.2019.00019

Departments

Department of Software Engineering

Annotation

The behavior, environment, and characteristics of clients form a crucial source of information for various businesses. There exists a number of supervised as well as unsupervised data mining or other approaches that allow analyzing the respective data. In our ongoing project, focusing primarily on the financial sector, we suggest an innovative strategy that will overcome persisting shortcomings of the state-of-the-art methods using an analysis of a social network of clients. In addition, we do not assume the existence of such a network, but from a given set of client financial activities, we are able to infer a social network representing their relationships and behavior. Using real-world data and selected use cases from our domain, we show (a part of) the process of construction of an inferred social network, i.e., what kind of "hidden" information can, for example, be found and exploited.

Non-metric Similarity Search Using Genetic TriGen

Authors

Bernhauer, D.; Skopal, T.

Year

2019

Published

Similarity Search and Applications. Springer, Cham, 2019. p. 86-93. ISBN 978-3-030-32046-1.

Type

Proceedings paper

DOI

10.1007/978-3-030-32047-8_8

Departments

Department of Software Engineering

Annotation

The metric space model is a popular and extensible model for indexing data for fast similarity search. However, there is often need for broader concepts of similarities (beyond the metric space model) while these cannot directly benefit from metric indexing. This paper focuses on approximate search in semi-metric spaces using a genetic variant of the TriGen algorithm. The original TriGen algorithm generates metric modifications of semi-metric distance functions, thus allowing metric indexes to index non-metric models. However, “analytic” modifications provided by TriGen are not stable in predicting the retrieval error. In our approach, the genetic variant of TriGen – the TriGenGA – uses genetically learned semi-metric modifiers (piecewise linear functions) that lead to better estimates of the retrieval error. Additionally, the TriGenGA modifiers result in better overall performance than original TriGen modifiers.

Recommender System as the Support for Binaural Audio

Authors

Bernhauer, D.; Skopal, T.

Year

2019

Published

Augmented Reality and Virtual Reality. Cham: Springer International Publishing AG, 2019. p. 233-246. ISSN 2196-8705. ISBN 978-3-030-06245-3.

Type

Book chapter

DOI

10.1007/978-3-030-06246-0_17

Departments

Department of Software Engineering

Annotation

Virtual reality devices nowadays can effectively utilise other senses besides vision, too. The most often used secondary sense is hearing with binaural audio as VR engine. Currently, practical usage of binaural audio as the source of VR is impossible because of the inaccuracy of a general model. On the contrary, measuring the personalised parameters can be time-consuming. Our task was to prove the possibility of reconstruction of the binaural audio parameters in domestic conditions. We have focused on the design of the user interface that can be used independently on the platform. Our proposed browser-based application uses collaborative filtering as a recommender system. We have proven that sound-based navigation in axial plane is possible with 6.6° inaccuracy. The gamification and browser-based implementation make it easier for all people to find the best possible parameters. The resulting profile can be used both with fully VR environment and with semi-VR games.

SIMILANT: An Analytic Tool for Similarity Modeling

Authors

Bernhauer, D.; Skopal, T.; Holubová, I.; Peška, L.; Svoboda, M.

Year

2019

Published

Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York: Association for Computing Machinery, 2019. p. 2889-2892. ISBN 978-1-4503-6976-3.

Type

Proceedings paper

DOI

10.1145/3357384.3357852

Departments

Department of Software Engineering

Annotation

We present SIMILANT, a data analytics tool for modeling similarity in content-based retrieval scenarios. In similarity search, data elements are modeled using black-box descriptors, where a pair-wise similarity function is the only way how to relate data elements to each other. Only these relations provide information about the dataset structure. Data analysts need to identify meaningful combinations of descriptors and similarity functions effectively. Therefore, we proposed a tool enabling a data analyst to systematically browse, tune, and analyze similarity models for a specific domain.

prof. RNDr. Tomáš Skopal, Ph.D.

Publications

Modular framework for similarity-based dataset discovery using external knowledge

Open dataset discovery using context-enhanced similarity search

Similarity vs. Relevance: From Simple Searches to Complex Discovery

Analysing Indexability of Intrinsically High-dimensional Data using TriGen

Evaluation Framework for Search Methods Focused on Dataset Findability in Open Data Catalogs

Advanced Behavioral Analyses Using Inferred Social Networks: A Vision

Approximate search in dissimilarity spaces using GA

Inferred Social Networks: A Case Study

Non-metric Similarity Search Using Genetic TriGen

Recommender System as the Support for Binaural Audio

SIMILANT: An Analytic Tool for Similarity Modeling