Inexact tree pattern matching with 1-degree edit distance using finite automata
Autoři
Šestáková, E.; Guth, O.; Janoušek, J.
Rok
2023
Publikováno
Discrete Applied Mathematics. 2023, 330 78-97. ISSN 0166-218X.
Typ
Článek
Pracoviště
Anotace
Given an input tree and a tree pattern, the inexact tree pattern matching problem is about finding all subtrees in the input tree that match the tree pattern with up to errors. To measure the number of errors between two labeled ordered trees, we use the 1-degree edit distance that uses operations node relabel, leaf insert, and leaf delete. The proposed solution is based on a finite automaton that reads a given input tree represented in linear, prefix bar, notation. First, we show its construction for a constrained variant of 1-degree edit distance where leaf insert/delete operations cannot be applied to the tree pattern recursively. Then, we extend this approach to both unit cost and non-unit cost 1-degree edit distance. The deterministic version of the proposed finite automaton finds all inexact occurrences of the tree pattern in time linear to the size of the input tree. However, since the size of such automaton can be exponential in the number of nodes of the tree pattern, it is practical only for small patterns. Therefore, we also consider the dynamic programming approach as a way of simulating the nondeterministic finite automaton. This approach comes with the space complexity and time complexity where is the number of nodes of the tree pattern, is the number of nodes of the input tree, and is the maximum number of errors allowed.
Automata Approach to Inexact Tree Pattern Matching Using 1-degree Edit Distance
Autoři
Šestáková, E.; Guth, O.; Janoušek, J.
Rok
2021
Publikováno
Proceedings of the Prague Stringology Conference 2021. Praha: CESKE VYSOKE UCENI TECHNICKE V PRAZE, 2021. p. 1-15. ISBN 978-80-01-06869-4.
Typ
Stať ve sborníku
Pracoviště
Anotace
We compare labeled ordered trees based on unit cost 1-degree edit distance that uses operations vertex relabeling, leaf insertion, and leaf deletion. Given an input tree T and a tree pattern P, we find all subtrees in T that match P with up to k errors. We show that this problem can be solved by finite automaton when T and P are represented in linear, prefix bar, notation. First, we solve this problem by a pushdown automaton. Then, we show that it can be transformed into a nondeterministic finite automaton due to its restricted use of the pushdown store. We also show a simulation of the nondeterministic finite automaton by dynamic programming.
Co-Teaching Computer Science Across Borders: Human-Centric Learning at Scale
Autoři
Piech, C.; Yan, L.; Einstein, L.; Saavedra, A.; Bozkurt, B.; Šestáková, E.; Guth, O.; McKeown, N.
Rok
2020
Publikováno
L@S '20: Proceedings of the Seventh ACM Conference on Learning @ Scale. New York: Association for Computing Machinery, 2020. p. 103-113. ISBN 978-1-4503-7951-9.
Typ
Stať ve sborníku
Pracoviště
Anotace
Programming is fast becoming a required skill set for students in every country. We present CS Bridge, a model for cross-border co-teaching of CS1, along with a corresponding open-source course-in-a-box curriculum made for easy localization. In the CS Bridge model, instructors and student-teachers from different countries come together to teach a short, stand-alone CS1 course to hundreds of local high school students. The corresponding open-source curriculum has been specifically designed to be easily adapted to a wide variety of local teaching practices, languages, and cultures.
Over the past six years, the curriculum has been used to teach CS1 material to over 1,000 high school students in Colombia, the Czech Republic, Turkey, and Guinea. A large majority of our students continue on to study CS or CS-related fields in university. More importantly, many of our undergraduate student-teachers stay involved with teaching beyond the program. Joint teaching creates a positive, high-quality learning experience for students around the world and a powerful, high-impact professional development experience for the teaching team---instructors and student-teachers alike.
Automata Approach to XML Data Indexing
Typ
Článek
Pracoviště
Anotace
The internal structure of XML documents can be viewed as a tree. Trees are among the fundamental and well-studied data structures in computer science. They express a hierarchical structure and are widely used in many applications. This paper focuses on the problem of processing tree data structures; particularly, it studies the XML index problem. Although there exist many state-of-the-art methods, the XML index problem still belongs to the active research areas. However, existing methods usually lack clear references to a systematic approach to the standard theory of formal languages and automata. Therefore, we present some new methods solving the XML index problem using the automata theory. These methods are simple and allow one to efficiently process a small subset of XPath. Thus, having an XML data structure, our methods can be used efficiently as auxiliary data structures that enable answering a particular set of queries, e.g., XPath queries using any combination of the child and descendant-or-self axes. Given an XML tree model with n nodes, the searching phase uses the index, reads an input query of size m, finds the answer in time O(m) and does not depend on the size of the original XML document.
Constrained Approximate Subtree Matching by Finite Automata
Autoři
Šestáková, E.; Melichar, B.; Janoušek, J.
Rok
2018
Publikováno
Proceedings of the Prague Stringology Conference 2018. Praha: Czech Technical University in Prague, 2018. p. 79-90. ISBN 978-80-01-06484-9.
Typ
Stať ve sborníku
Pracoviště
Anotace
Processing tree data structures usually requires a pushdown automaton as a model of computation. Therefore, it is interesting that a finite automaton can be used to solve the constrained approximate subtree pattern matching problem. A systematic approach to the construction of such matcher by finite automaton, which reads input trees in prefix bar notation, is presented. Given a tree pattern and an input tree with m and n nodes, respectively, the nondeterministic finite automaton for the pattern is constructed and it is able to find all occurrences of the pattern to subtrees of the input tree with maximum given distance k. The distance between the pattern and subtrees of an input tree is measured by minimal number of restricted tree edit operations, called leaf nodes edit operations. The corresponding deterministic finite automaton finds all occurrences in time O(n) and has O(|A|^k m^(k+1)) states, where A is an alphabet containing all possible node labels. Note that the size is not exponential in the number of nodes of the tree pattern but only in the number of errors. In practice, the number of errors is expected to be a small constant that is much smaller than the size of the pattern. To achieve better space complexity, it is also shown how dynamic programming approach can be used to simulate the nondeterministic automaton. The space complexity of this approach is O(m), while the time complexity is O(mn).
Indexing XML Documents Using Tree Paths Automaton
Autoři
Rok
2017
Publikováno
6th Symposium on Languages, Applications and Technologies. Saarbrücken: Dagstuhl Publishing,, 2017. p. 10:1-10:14. ISSN 1868-8969. ISBN 978-3-95977-056-9.
Typ
Stať ve sborníku
Pracoviště
Anotace
An XML document can be viewed as a tree in a natural way. Processing tree data structures usually requires a pushdown automaton as a model of computation. Therefore, it is interesting that a finite automaton can be used to solve the XML index problem. In this paper, we attempt to support a significant fragment of XPath queries which may use any combination of child (i.e., /) and descendant-or-self (i.e., //) axis. A systematic approach to the construction of such XML index, which is a finite automaton called Tree Paths Automaton, is presented. Given an XML tree model T, the tree is first of all preprocessed by means of its linear fragments called string paths. Since only path queries are considered, the branching structure of the XML tree model can be omitted. For individual string paths, smaller Tree Paths Automata are built, and they are afterwards combined to form the index. The searching phase uses the index, reads an input query Q of size m, and computes the list of positions of all occurrences of Q in the tree T. The searching is performed in time O(m) and does not depend on the size of the XML document. Although the number of queries is clearly exponential in the number of nodes of the XML tree model, the size of the index seems to be, according to our experimental results, usually only about 2.5 times larger than the size of the original document.
Tree String Path Subsequences Automaton and Its Use for Indexing XML Documents
Autoři
Rok
2015
Publikováno
Languages, Applications and Technologies. Cham: Springer International Publishing AG, 2015. pp. 171-181. ISSN 1865-0929. ISBN 978-3-319-27652-6.
Typ
Stať ve sborníku
Pracoviště
Anotace
The theory of indexing texts is well-researched, which does not hold for indexing other data structures, such as trees for example. In this paper a simple method of indexing a tree for subsequences of string paths in the tree by finite automaton is presented. The use of the index is shown on indexing XML documents for XPath descendant-or-self axis inspired queries. Given a subject tree T with n nodes, the tree is preprocessed and an index, which is a directed acyclic subsequence graph for a set of strings, is constructed. The searching phase uses the index, reads an input string path subsequence Q inspired by the specific XPath query of size m and computes the list of positions of all occurrences of Q in the tree T. The searching is performed in time O(m) and does not depend on n. Although the number of distinct valid queries is O(2n), the size of the index is O(hk), where h is the height of the tree T and k is the number of its leaves. Moreover, we discuss that in the case of indexing a common XML document the size of the index is even smaller O(h⋅2k).