Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests
Autoři
Bohuslavová, D.; Ahmed, O.; Gagie, T.; Holub, J.; Langmead, B.; Manzini, G.; Navarro, G.
Rok
2024
Publikováno
Symposium on Experimental Algorithms. Dagstuhl: Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2024. p. 10:1-10:13. vol. 301. ISSN 1868-8969.
Typ
Stať ve sborníku
Pracoviště
Anotace
For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use k-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM’s occurrences in those genomes; find the minimum and maximum values stored in that interval; take the lowest common ancestor (LCA) of the genomes containing the characters at those positions.
This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes’ concatenation: a KATKA kernel, which discards characters that are not in the first or last occurrence of any kmax-tuple, for a parameter kmax; a minimizer digest; a KATKA kernel of a minimizer digest.
With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads’ longest MEMs occurred only in the sequences from which those reads were generated (“true positive” reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.
Backward Pattern Matching on Elastic-Degenerate Strings
Autoři
Procházka, P.; Cvacho, O.; Krčál, L.; Holub, J.
Rok
2023
Publikováno
SN Computer Science. 2023, 4(5), ISSN 2662-995X.
Typ
Článek
Pracoviště
Anotace
Recently, the concept of Elastic Degenerate Strings (EDS) was introduced as a way of representing a sequenced population of the same species. Several on-line Elastic Degenerate String Matching (EDSM) algorithms were presented so far. Some of them provide practical implementation. We propose a new on-line EDSM algorithm BNDM-EDS. Our algorithm combines two traditional algorithms BNDM and the Shift-And that were adapted to the specifics needed by Elastic Degenerate Strings. BNDM-EDS is running in O (Nmd w m e) worst-case time.
This implies O (Nm) time for small patterns, where m is the length of the searched pattern, N is the size of EDS, and w is the size of the computer word. The algorithm uses O (N + n) space, where n is the length of EDS. BNDM-EDS requires a simple preprocessing step with time and space O(m). Experimental results on real genomic data show superiority of BNDM-EDS over state-of-the-art algorithms.
Hierarchical Bitmap Indexing for Range Queries on Multidimensional Arrays
Autoři
Krčál, L.; Ho, S.-S.; Holub, J.
Rok
2022
Publikováno
Database Systems for Advanced Applications. Springer, Cham, 2022. p. 509-525. Lecture Notes in Computer Science. vol. 13245. ISSN 0302-9743. ISBN 978-3-031-00122-2.
Typ
Stať ve sborníku
Pracoviště
Anotace
Bitmap indices are widely used in commercial databases for processing complex queries, due to their efficient use of bit-wise operations. Bitmap indices apply natively to relational and linear datasets, with distinct separation of the columns or attributes, but do not perform well on multidimensional array scientific data.
We propose a new method for multidimensional array indexing that considers the spatial component of multidimensional arrays. The hierarchical indexing method is based on sparse n-dimensional trees for dimension partitioning, and bitmap indexing with adaptive binning for attribute partitioning. This indexing performs well on range queries involving both dimension and attribute constraints, as it prunes the search space early. Moreover, the indexing is easily extensible to membership queries.
The indexing method was implemented on top of a state of the art bitmap indexing library Fastbit, using tables partitioned along any subset of the data dimensions. We show that the hierarchical bitmap index outperforms conventional bitmap indexing, where an auxiliary attribute is required for each dimension. Furthermore, the adaptive binning significantly reduces the amount of bins and therefore memory requirements.
Backward Pattern Matching on Elastic Degenerate Strings
Autoři
Procházka, P.; Cvacho, O.; Krčál, L.; Holub, J.
Rok
2021
Publikováno
Proceedings of 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021). Lisboa: SCITEPRESS – Science and Technology Publications, Lda, 2021. p. 50-59. vol. 3. ISSN 2184-4305. ISBN 978-989-758-490-9.
Typ
Stať ve sborníku
Pracoviště
Anotace
Recently, the concept of Elastic Degenerate Strings (EDS) was introduced as a way of representing a sequenced population of the same species. Several on-line Elastic Degenerate String Matching (EDSM) algorithms were presented so far. Some of them provide practical implementation. We propose a new on-line EDSM algorithm BNDM-EDS. Our algorithm combines two traditional algorithms BNDM and the Shift-And that were adapted to the specifics needed by Elastic Degenerate Strings. BNDM-EDS is running in O (Nmd m w e) worst-case time.
This implies O (Nm) time for small patterns, where m is the length of the searched pattern, N is the size of EDS, and w is the size of the computer word. The algorithm uses O (N + n) space, where n is the length of EDS. BNDM-EDS requires a simple preprocessing step with time and space O (m). Experimental results on real genomic data show superiority of BNDM-EDS over state-of-the-art algorithms.
Data Structures to Represent a Set of k-long DNA Sequences
Autoři
Chikhi, R.; Holub, J.; Medvedev, P.
Rok
2021
Publikováno
ACM Computing Surveys. 2021, 54(1), 17:1-17:22. ISSN 0360-0300.
Typ
Článek
DOI
Pracoviště
Anotace
The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique features and applications that, over the past 10 years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.
PFP Compressed Suffix Trees
Autoři
Boucher, C.; Cvacho, O.; Gagie, T.; Holub, J.; Manzini, G.; Navarro, G.; Rossi, M.
Rok
2021
Publikováno
Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX2021). Philadelphia: SIAM, 2021. p. 60-72. ISSN 2164-0300. ISBN 978-1-61197-647-2.
Typ
Stať ve sborníku
Pracoviště
Anotace
Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string S, it produces a dictionary D and a parse P of overlapping phrases such that BWT(S) can be computed from D and P in time and workspace
bounded in terms of their combined size |PFP(S)|. In practice D and P are significantly smaller than S and computing BWT(S) from them is more efficient than computing
it from S directly, at least when S is the concatenation of many genomes. In this paper, we consider PFP(S) as a data structure and show how it can be augmented to sup-
port full suffix tree functionality, still built and fitting within O(|PFP(S)|) space. This entails the efficient computation of various primitives to simulate the suffix tree: computing a longest common extension (LCE) of two positions in S; reading any cell of its suffix array (SA), of its inverse (ISA), of its BWT, and of its longest common prefix array (LCP); and
computing minima over ranges and next/previous smaller value queries over the LCP. Our experimental results show that the PFP suffix tree can be efficiently constructed for
very large repetitive datasets and that its operations perform competitively with other compressed suffix trees that can only handle much smaller datasets.
On-line Searching in IUPAC nucleotide sequences
Autoři
Procházka, P.; Holub, J.
Rok
2019
Publikováno
Proceedings of 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019). Lisboa: SCITEPRESS – Science and Technology Publications, Lda, 2019. p. 66-77. vol. 3. ISSN 2184-4305. ISBN 978-989-758-353-7.
Typ
Stať ve sborníku
Pracoviště
Anotace
We propose a novel pattern matching algorithm for consensus nucleotide sequences over IUPAC alphabet, called BADPM (Byte-Aligned Degenerate Pattern Matching). The consensus nucleotide sequences represent a consensus obtained by sequencing a population of the same species and they are considered as so-called degenerate strings. BADPM works at the level of single bytes and it achieves sublinear search time on average. The algorithm is based on tabulating all possible factors of the searched pattern. It needs O (m + mα2 log m)-space data structure and O(mα2 ) time for preprocessing where m is a length of the pattern and α represents a maximum number of variants implied from a 4-gram over IUPAC alphabet. The worst-case locate time is
bounded by O (nm2α4 ) for BADPM where n is the length of the input text. However, the experiments performed on real genomic data proved the sublinear search time. BADPM can easily cooperate with the block q-gram inverted index and so achieve still better locate time. We implemented two other pattern matching algorithms for IUPAC nucleotide sequences as a baseline: Boyer-Moore-Horspool (BMH) and Parallel Naive Search (PNS). Especially PNS proves its efficiency insensitive to the length of the searched pattern m. BADPM proved its strong superiority for searching middle and long patterns.
Filtering Invalid Off-Targets in CRISPR/Cas9 Design Tools
Autoři
Cvacho, O.; Holub, J.
Rok
2018
Publikováno
Proceedings of Data Compression Conference 2018. New York: IEEE Computer Society Press, 2018. p. 403. ISSN 2375-0359. ISBN 978-1-5386-4883-4.
Typ
Stať ve sborníku
Pracoviště
Anotace
The paper deals with an approximate string matching in CRISPR/Cas9 design tools. The approximate string matching is transformed into sequence of exact string matching tasks processed by fast succinct (compressed) self-index. The number of the tasks is rapidly decreased by filtering technique based on de Bruijn graph.
Reconstructing a String from its Lyndon Arrays
Autoři
Daykin, J.W.; Franek, F; Holub, J.; Sohidull Islam, A.S.M.; Smyth, W.F.
Rok
2018
Publikováno
Theoretical Computer Science. 2018, 710 44-51. ISSN 0304-3975.
Typ
Článek
Pracoviště
Anotace
Given a string x = x[1.. n] on an ordered alphabet of size σ , the Lyndon array λ =
λx [1..n] of x is an array of positive integers such that λ[i], 1 ≤ i ≤ n, is the length of the
maximal Lyndon word over the ordering of that begins at position i in x. The Lyndon
array has recently attracted considerable attention due to its pivotal role in establishing the
long-standing conjecture that ρ (n ) < n, where ρ ( n) is the maximum number of maximal
periodicities (runs) in any string of length n. Here we first describe two linear-time
algorithms that, given a valid Lyndon array λ, compute a corresponding string — one for
an alphabet of size n, the other for a smaller alphabet. We go on to describe another
linear-time algorithm that determines whether or not a given integer array is a Lyndon
array of some string. Finally we show how σ Lyndon arrays λ = {λ1 = λ, λ2 , . . . , λσ }
corresponding to σ “rotations” of the alphabet can be used to determine uniquely the
string x on such that λx = λ.
SOPanG: online text searching over a pan-genome
Autoři
Cislak, A.; Grabowski, S.; Holub, J.
Rok
2018
Publikováno
Bioinformatics. 2018, 34(24), 4290-4292. ISSN 1367-4803.
Typ
Článek
Pracoviště
Anotace
Motivation: The many thousands of high-quality genomes available now-a-days imply a shift from single genome to pan-genomic analyses. A basic algorithmic building brick for such a scenario is online search over a collection of similar texts, a problem with surprisingly few solutions presented so far.
Byte-Aligned Pattern Matching in Encoded Genomic Sequences
Autoři
Procházka, P.; Holub, J.
Rok
2017
Publikováno
Proceedings of 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Saarbrücken: Dagstuhl Publishing,, 2017. p. 20:1-20:13. ISSN 1868-8969. ISBN 978-3-95977-050-7.
Typ
Stať ve sborníku
Pracoviště
Anotace
In this article, we propose a novel pattern matching algorithm, called BAPM, that performs searching in the encoded genomic sequences. The algorithm works at the level of single bytes and it achieves sublinear performance on average. The preprocessing phase of the algorithm is linear with respect to the size of the searched pattern $m$. A simple $\mathcal{O}(m)$-space data structure is used to store all factors (with a defined length) of the searched pattern. These factors are later searched during the searching phase which ensures sublinear time on average. Our algorithm significantly overcomes the state-of-the-art pattern matching algorithms in the locate time on middle and long patterns. Furthermore, it is able to cooperate very easily with the block $q$-gram inverted index. The block $q$-gram inverted index together with our pattern matching algorithm achieve superior results in terms of locate time to the current index data structures for less frequent patterns. We present experimental results using real genomic data. These results prove efficiency of our algorithm.
Technology Beats Algorithms (in Exact String Matching)
Autoři
Tarhio, J; Holub, J.; Giaquinta, E
Rok
2017
Publikováno
Software - Practice and Experience. 2017, 47(12), 1877-1885. ISSN 0038-0644.
Typ
Článek
DOI
Pracoviště
Anotace
More than 120 algorithms have been developed for exact string matching within the last 40 years. We show by experiments that the \naive{} algorithm exploiting SIMD instructions of modern CPUs (with symbols compared in a special order) is the fastest one for patterns of length up to about 50 symbols and extremely good for longer patterns and small alphabets. The algorithm compares 16 or 32 characters in parallel by applying SSE2 or AVX2 instructions, respectively. Moreover, it uses loop peeling to further speed up the searching phase. We tried several orders for comparisons of pattern symbols and the increasing order of their probabilities in the text was the best.
Towards Efficient Positional Inverted Index
Typ
Článek
Pracoviště
Anotace
We address the problem of positional indexing in the natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. Inverted lists of single terms are combined into one inverted list that represents the backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with a presentation layer that stores separators, stop words, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The space complexity of the PISI inverted list is O((N−n)⌈log2bN⌉+(⌊N−nα⌋+n)×(⌈log2bn⌉+1)) where N is a number of stems, n is a number of unique stems, α is a step/period of the back pointers in the inverted list and b is the size of the word of computer memory given in bits. The space complexity of the presentation layer is O(−∑Ni=1⌈log2pn(i)i⌉−∑N′j=1⌈log2p′j⌉+N) with respect to pn(i)i as a probability of a stem variant at position i, p′j as the probability of separator or stop word at position j and N′ as the number of separators and stop words
Approximate String Matching for Self-Indexes
Autoři
Hrbek, L.; Holub, J.
Rok
2016
Publikováno
Proceedings of Data Compression Conference 2016. New York: IEEE Computer Society Press, 2016. pp. 604. ISSN 1068-0314. ISBN 978-1-5090-1853-6.
Typ
Stať ve sborníku
Pracoviště
Anotace
Self-index is a data structure that allows to find all exact occurrences of given pattern of length $m$ in $m$ steps. The paper presents an approach how to use self-index for approximate string matching with no more than $k$~differences (given by Hamming or Levenshtein distance). Our approach solution uses filtering based on the pigeonhole principle. For implementation we have selected FM-index. Our program is faster than BLAST for large texts (hundreds of MB) and more space efficient for shorter texts.
Positional Inverted Self-index
Autoři
Procházka, P.; Holub, J.
Rok
2016
Publikováno
Proceedings of Data Compression Conference 2016. New York: IEEE Computer Society Press, 2016. pp. 627. ISSN 1068-0314. ISBN 978-1-5090-1853-6.
Typ
Stať ve sborníku
Pracoviště
Anotace
We address the problem of positional indexing in natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. The inverted lists of single terms are combined into one inverted list that represents a backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with presentation layer that stores separators, stopwords, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The trade-off between compression effectiveness and the decompression speed can be tuned by different parameters of the index algorithm.
⌉ ×
Towards Efficient Positional Inverted Index
Autoři
Procházka, P.; Holub, J.
Rok
2016
Publikováno
Proceedings of International Conference on Data Compression, Communication, Processing and Security 2016. Salerno: Universita di Salerno, 2016.
Typ
Stať ve sborníku
Pracoviště
Anotace
Abstract—We address the problem of positional indexing in the natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to
store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. Inverted lists of single terms are combined into one inverted list that represents the backbone of the text file since it stores the
sequence of the indexed words of the original file. The inverted list is synchronized with a presentation layer that stores separators, stop words, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The trade-
off between compression effectiveness and decompression speed can be tuned by different parameters of the index algorithm. The space complexity of the PISI inverted list is O((N − n)⌈log 2b N ⌉+(⌊ N α −n ⌋+n)×(⌈log2b n⌉+1)) where N is a number of stems, n is a number of unique stems, α is a step/period of the back pointers in the inverted list and b is the size of the word of computer memory given ∑ in N bits. The n(i) space
complexity N′′ of the presentation layer is O(− i=1 ⌈log2 pi ⌉− j=1 ⌈log2 pj ⌉+N ) n(i) with respect to pi as a probability of a stem variant at position i, p′ j as the probability of separator or stop word at ′position j and N as the number of separators and stop words. Our experiments prove that PISI is close to standard positional inverted index in terms of search speed and, at the same time, it is more efficient in memory consumption. PISI also proved that it is significantly faster than its close competitor FWCSA [17] in terms of search speed at the same level of memory consumption.
Compression of a Set of Files with Natural Language Content
Autoři
Procházka, P.; Holub, J.
Rok
2015
Publikováno
Computer Journal. 2015, 58(5), 1169-1185. ISSN 0010-4620.
Typ
Článek
Pracoviště
Anotace
An algorithm for very efficient compression of a set of natural language text files is presented. Not only a very good compression ratio is reached, the used compression method allows fast pattern matching in compressed text, which is an attractive property especially for search engines. Much information is stored in the form of a large collection of text files. The web search engines can store the web pages in the raw text form to build so-called snippets or to perform so-called positional ranking functions on them. Furthermore, there exist many other similar contexts such as the storage of emails, application logs or the databases of text files (literary works or technical reports). In this paper, we address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time. The compression algorithm is based on a word-based approach and the idea of combination of two statistical models: global model (common for all the files of the set) and local model. The latter is built as a set of changes that transform the global model to the proper model of the single compressed file.
Incremental Locality & Clustering-Based Compression
Autoři
Krčál, L.; Holub, J.
Rok
2015
Publikováno
Proceedings of Data Compression Conference 2015. Los Alamitos: IEEE Computer Society Press, 2015. pp. 203-212. ISSN 1068-0314. ISBN 978-1-4799-8430-5.
Typ
Stať ve sborníku
Pracoviště
Anotace
Current compression solutions either use a limited size locality-based context or the entire input, to which the compressors adapt. This results in suboptimal compression effectiveness due to missing similarities further apart in the former case, or due to too generic adaptation. There are many deduplication and near deduplication systems that search for similarity across the entire input. Although most of these systems excel with their simplicity and speed, none of those goes deeper in terms of full-scale redundancy removal. We propose a novel compression and archival system called ICBCS. Our system goes beyond standard measures for similarity detection, using extended similarity hash and incremental clustering techniques to determine groups of sufficiently similar chunks designated for compression. ICBCS outperforms conventional file compression solutions on datasets consisting of at least mildly redundant files. It also shows that selective application of weak compressor results in better compression ratio and speed than conventional application of a strong compressor.
Compressing Similar Biological Sequences using FM-index
Autoři
Procházka, P.; Holub, J.
Rok
2014
Publikováno
Proceedings of Data Compression Conference 2014. Los Alamitos: IEEE Computer Society, 2014. p. 312-321. ISSN 1068-0314. ISBN 978-1-4799-3882-7.
Typ
Stať ve sborníku
Pracoviště
Anotace
Nowadays, decreasing cost and better accessibility of sequencing methods have enabled studies of genetic variation between individuals of the same species and also between two related species. This has led to a rapid increase in biological data consisting of sequences that are very similar to each other, these sequences usually being stored together in one database. We propose a compression method based on Wavelet Tree FM-index optimized for compression of a set of similar biological sequences. The compression method is based on tracking single changes (together with their context) between every single sequence and the chosen reference sequence. We call our compression method BIO-FMI.
It gives very promising results in compression ratio and in locate time when
performed on an extremely repetitive data set (less than 0.5 % mutations) and when the searched patterns are of smaller lengths (less than 20 bases). BIO-FMI is competitive in extraction speed and it seems to be superior in time needed to build the index, especially in the case when the alignments of single sequences are given in advance.
ACM Student Project of the Year 2013 Competition
Autoři
Rok
2013
Publikováno
Information Sciences and Technologies, Bulletin of the ACM Slovakia. 2013, 5(4), 55-56. ISSN 1338-1237.
Typ
Článek
Pracoviště
Anotace
The paper reports on finals of ACM Student Project of the Year 2013 Competition.
Natural Language Compression Optimized for Large Set of Files
Autoři
Procházka, P.; Holub, J.
Rok
2013
Publikováno
Proceedings of Data Compression Conference 2013. Los Alamitos, CA: IEEE Computer Soc., 2013. p. 514. ISSN 1068-0314. ISBN 978-0-7695-4965-1.
Typ
Stať ve sborníku
Pracoviště
Anotace
The paper describes one of the typical scenarios of data compression for large collections of files. We address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time.
ODC: Frame for definition of Dense codes
Autoři
Procházka, P.; Holub, J.
Rok
2013
Publikováno
European Journal of Combinatorics. 2013, 2013(1), 52-68. ISSN 0195-6698.
Typ
Článek
Pracoviště
Anotace
The natural language compression made a great progress in last two decades. The main step in this evolution was the introduction of word-based compression by Moffat. Another improvement came with so called Dense codes which proved to be very fast in compression and decompression while keeping good compression ratio and direct search capability. Many variants of the Dense codes were described, each of them using its own definition. In this paper we present generalized concept of dense coding called Open Dense Code (ODC) which aims to be a frame for definition of many other dense code schemas. ODC underlines common features of the dense code schemas but at the same time allows to express the divergences of each of them. Using the frame of ODC we present two new word-based statistical compression algorithms based on dense coding idea: Two Byte Dense Code (TBDC) and Self-Tuning Dense Code (STDC). Our algorithms improve the compression ratio and are considerate to smaller files which are very often omitted.
Suffix Tree of Alignment: An Efficient Index for Similar Data
Autoři
Na, J. C.; Park, H.; Crochemore, M.; Holub, J.; Iliopoulos, C. S.; Mouchard, L.; Park, K.
Rok
2013
Publikováno
Proceedings of the 24th Workshop on Combinatorial Algorithms (IWOCA 2013). Berlin: Springer, 2013. pp. 337-348. ISSN 0302-9743. ISBN 978-3-642-45277-2.
Typ
Stať ve sborníku
Pracoviště
Anotace
We consider an index data structure for similar strings. The
generalized suffix tree can be a solution for this. The generalized suffix
tree of two strings A and B is a compacted trie representing all suffixes
in A and B. It has |A|+|B| leaves and can be constructed in O(|A|+|B|)
time. However, if the two strings are similar, the generalized suffix tree
is not efficient because it does not exploit the similarity which is usually
represented as an alignment of A and B.
In this paper we propose a space/time-efficient suffix tree of alignment
which wisely exploits the similarity in an alignment. Our suffix tree for
an alignment of A and B has |A|+ld +l1 leaves where ld is the sum of the
lengths of all parts of B different from A and l1 is the sum of the lengths
of some common parts of A and B. We did not compromise the pattern
search to reduce the space. Our suffix tree can be searched for a pattern
P in O(|P | + occ) time where occ is the number of occurrences of P in
A and B. We also present an efficient algorithm to construct the suffix
tree of alignment. When the suffix tree is constructed from scratch, the
algorithm requires O(|A| + ld + l1 + l2 ) time where l2 is the sum of the
lengths of other common substrings of A and B. When the suffix tree of
A is already given, it requires O(ld + l1 + l2 ) time.
The Finite Automata Approaches for Bioinformatics
Autoři
Rok
2012
Publikováno
Proceedings of 22nd Theorietag Automata and Formal Languages. Praha: Matfyzpress, 2012. pp. 1-2. ISBN 978-80-7378-221-4.
Typ
Stať ve sborníku vyzvaná či oceněná
Pracoviště
Anotace
Stringology is a computer science on string and sequence processing. Finite automata can solve efficiently many tasks in stringology. An overview of four approaches of the finite automata use in stringology is presented: deterministic finite automaton, deterministic simulation of nondeterministic finite automaton, finite automaton as a model of computation, and compositions of finite automata solutions. The way the finite automata can process strings build over more complex alphabet than just single symbols (degenerate symbols, strings, variables) is also shown.
Those approaches and complex alphabet use are demonstrated in various tasks of bioinformatics. The tasks work with sequences on various alphabets: DNA, RNA, amino acids. The searched patterns may be explicitly defined or may be defined by their properties. Some patterns may allow errors some must match exactly. Some pattern symbols may match exactly one symbol, some may match a subset of alphabet as they were not read exactly or their practical behaviour is the same.
The Finite Automata Approaches in Stringology
Typ
Článek
Pracoviště
Anotace
We present an overview of four approaches of the finite automata use in stringology: deterministic finite automaton, deterministic simulation of nondeterministic finite automaton, finite automaton as a model of computation, and compositions of finite automata solutions. We also show how the finite automata can process strings build over more complex alphabet than just single symbols (degenerate symbols, strings, variables).
Block-oriented Dense Compressor
Autoři
Procházka, P.; Holub, J.
Rok
2011
Publikováno
Proceedings Data Compression Conference. Los Alamitos, CA: IEEE Computer Soc., 2011. pp. 372. ISSN 1068-0314. ISBN 978-0-7695-4352-9.
Typ
Stať ve sborníku
Pracoviště
Anotace
We address the problem of block-oriented natural language compression. The block-oriented compression is semi-adaptive in terms of one block but it is adaptive in terms of whole input. Our block-oriented compression method is based on Dense Code idea. It achieves very good compression ratio around 32 % on natural language text. We show
that our compression method has some interesting properties which could be applied in digital libraries and other textual databases. The compression method allows direct searching on compressed text. Moreover the vocabulary can be used as a block index which makes some kinds of searching very fast. Another property is that the compressor can send single blocks with corresponding vocabulary which is considerate to limited bandwidth. In addition the compressed file can be continuously extended without need of previous decompression.
Finite Automata in Pattern Matching
Autoři
Rok
2011
Publikováno
Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications. New York: J. Willey, 2011. p. 51-71. ISBN 978-0-470-50519-9.
Typ
Kapitola v knize
Pracoviště
Anotace
Stringology is a part of Computer Science dealing with processing strings and sequences. It finds many important applications in various fields used by Information Society. Biological sequences processing is one of the most important fields. On the other hand the finite automata theory is a well developed formal system used for a long time in the area of compiler construction.
The chapter aims to show various approaches of the finite automata use in Stringology. The approaches are demonstrated on practical examples. Of course it is impossible to describe all the approaches as it would be out of scope of the chapter. However, we would like to present basic approaches that the reader can modify and combine to a given task.
Lossless Data Compression Testbed: ExCom and Prague Corpus
Autoři
Holub, J.; Řezníček, Jakub
Rok
2011
Publikováno
Proceedings Data Compression Conference. Los Alamitos, CA: IEEE Computer Soc., 2011. pp. 457. ISSN 1068-0314. ISBN 978-0-7695-4352-9.
Typ
Stať ve sborníku
Pracoviště
Anotace
Authors developing new algorithms have to compare their implementations with the implementations of the other algorithms. They can do a testing on standard corpora and compare the achieved compression ratio with published compression ratios made on the same corpus. The corpora are static collections of files so they are getting older and do not contain new file types that appear. Moreover one can compare only the compression ratio. The times of compression and decompression are very important as well. In order to compare the times one have to run the implementation on the same platform and the same hardware. So one have to ask the authors for the source codes and incorporate them into his system. Then the tests could be run on the same machine.
Natural Language Compression per Blocks
Autoři
Procházka, P.; Holub, J.
Rok
2011
Publikováno
First International Conference on Data Compression, Communications and Processing. Los Alamitos: IEEE Computer Society, 2011. pp. 67-75. ISBN 978-0-7695-4528-8.
Typ
Stať ve sborníku
Pracoviště
Anotace
We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed is
the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32 % on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture.
An algorithm for mapping short reads to a dynamically changing genomic sequence
Autoři
Flouri, T.; Holub, J.; Iliopoulos, C.S.I.; Pissis, S.P.P.
Rok
2010
Publikováno
2010 IEEE International Conference on Bioinformatics and Biomedicine. Los Alamitos: IEEE Computer Society, 2010. p. 133-136. ISBN 978-1-4244-8305-1.
Typ
Stať ve sborníku
Pracoviště
Anotace
The constant advances in sequencing technology have redefined the way genome sequencing is performed. They are able to produce tens of millions of short sequences (reads), during a single experiment, and with a much lower cost than previously possible. Due to this massive amount of data, efficient algorithms for mapping these reads to reference sequences are in great demand, and recently, there has been ample work for publishing such algorithms. In this paper, we study a different version of this problem: mapping these reads to a dynamically changing genomic sequence. We propose a new practical algorithm, which employs a suitable data structure that takes into account potential dynamic effects (replacements, insertions, deletions) on the genomic sequence. The presented experimental results demonstrate that the proposed approach can be applied to address the problem of mapping millions of reads to multiple genomic sequences.
Data compression of Natural languages
Autoři
Jaroš, J.; Holub, J.
Rok
2010
Publikováno
Workshop 2010. Praha: České vysoké učení technické v Praze, 2010. pp. 100-101. CTU Reports. ISBN 978-80-01-04513-8.
Typ
Stať ve sborníku
Pracoviště
Anotace
In this contribution we present the new data compression schema for natural languages compression. We use two-part compression model in this schema: the static part of the model and the adaptive part of the model. The adaptive part of model is similar to the common adaptive model. The best-suited compression algorithms for the adaptive part with respect to the static part of the model are very efficient PPM algorithms. The static part of the model can be represented by many possibilities.
Improving Practical Exact String Matching
Autoři
Ďurian, B.; Holub, J.; Peltola, H.; Tarhio, J.
Rok
2010
Publikováno
Information Processing Letters. 2010, 110(4), 148-152. ISSN 0020-0190.
Typ
Článek
Pracoviště
Anotace
We present improved variations of the BNDM algorithm for exact string matching. At each alignment the algorithms read a $q$-gram before testing the state variable. In addition we apply reading a 2-gram in one instruction.
Semi-static Word-based Natural Language Compression
Autoři
Jaroš, J.; Holub, J.
Rok
2010
Publikováno
Proceedings of the 11th International PhD Workshop on Systems and Control A Young Generation Viewpoint. Veszprém: University of Pannonia, 2010. pp. 57-62. ISBN 978-615-5044-00-7.
Typ
Stať ve sborníku
Pracoviště
Anotace
The main technique of reducing loads of the space needed to store the data or reducing the time needed to transmit the data is the data compression. The novel semi-static word-based data compression schema for natural languages compression is presented in this paper. The two-part compression model (the static part of the model and the adaptive part of the model) is used in this schema. The compression ratio it achieves is the same as the compression ratio of used specific adaptive data compression algorithms in the worst case.
A Different Proof of the Crochemore-Ilie Lemma Concerning Microruns
Autoři
Franek, F.; Holub, J.
Rok
2009
Publikováno
London Algorithmics 2008: Theory and Practice. London: College Publications, 2009. pp. 1-9. ISBN 978-1-904987-97-0.
Typ
Stať ve sborníku
Pracoviště
Anotace
In a resent paper, Crochemore and Ilie improved the upper bound of the maxrun function ro(n)=max{r(x) : |x| = n}, where
r(x) denotes the number of runs in a string x, to 1.6n (computational results on Ilie's web page claim the improvement to 1.048n.) Their proof is essentially in two major steps: first step shows that there is at most 1center of a d-run on average in each interval of length d, in the second step they showed that ro9(n) <= n, where ro9(n) is the count limited to runs of size <= 9. The proof of ro9(n) <= n they presented relies on computation. We present an alternative simpler proof, also computational, to provide
an independent verification of Crochemore-Ilie's result. The method used is completely different and has a potential to be verified without an aid of computer.
New Word-based Adaptive Dense Compressors
Autoři
Procházka, P.; Holub, J.
Rok
2009
Publikováno
Combinatorial Algorithms. Berlin: Springer, 2009. pp. 420-431. Lecture Notes in Computer Science. ISSN 0302-9743. ISBN 978-3-642-10216-5.
Typ
Stať ve sborníku
Pracoviště
Anotace
In the last two decades the natural language compression
made a great progress. The main step in this evolution was the introduction
of word-based compression by Mofiat. The word-based statistical
compression algorithms are able to achieve 35% improvement in the compression
ratio in comparison with character-based ones. We present two
new word-based statistical compression algorithms based on dense coding
idea: Two Byte Dense Code (TBDC) and Self-Tuning Dense Code
(SCDC). TBDC uses the codewords with maximal size 2 bytes and must
be implemented with some pruning technique. STDC is able to tune its
code space during the compression process and so achieve better compression.
Our algorithms improve the compression ratio and are considerate
to smaller ffles which are very often omitted. We present also
a generalized concept of dense coding called Open Dense Code (ODC)
which provides a frame for deffnition of these two and many other dense
code schemas.
NFA Simulation Using Deterministic State Cache
Autoři
Holub, J.; Kadlec, T.
Rok
2009
Publikováno
London Algorithmics 2008: Theory and Practice. London: College Publications, 2009. pp. 152-166. ISBN 978-1-904987-97-0.
Typ
Stať ve sborníku
Anotace
The nondeterministic finite automaton (NFA) cannot be directly used because of its nondeterminism. There are two ways to use it: Firstly, transform it to the equivalent deterministic finite automaton, or secondly, simulate the run of NFA in a deterministic way. We present a deterministic state cache method that combines these two approaches. It uses the fast memory of already used deterministic states, which can under certain circumstances dramatically accelerate the basic simulation method, while we can control the amount of memory used. We present an implementation based on a hash array.
On Parallel Implementations of Deterministic Finite Automata
Autoři
Holub, J.; Štekr, S.
Rok
2009
Publikováno
Implementation and Application of Automata. Heidelberg: Springer, 2009. p. 54-64. ISSN 0302-9743. ISBN 978-3-642-02978-3.
Typ
Stať ve sborníku
Pracoviště
Anotace
We present implementations of parallel DFA run methods and find whether and under what conditions is worthy to use the parallel methods of simulation of run of finite automata.
First, we introduce the parallel DFA run methods for general DFA, which are universal, but due to the dependency of simulation time on the number of states vertical bar Q vertical bar of automaton being run, they are suitable only for run of automata with the smaller number of states.
Then we show that if we apply some restrictions to properties of automata being run, we can reach the linear speedup compared to the sequential simulation method. We designed methods benefiting from k-locality that allows optimum parallel run of exact and approximate pattern matching automata. Finally, we show the results of experiments conducted on two types of parallel computers (Cluster of workstations and Symmetric shared-memory multiprocessors).
Algorithms on Indeterminate Strings
Autoři
Holub, J.; Smyth, W.F.
Rok
2003
Publikováno
Proceedings of the 14th Australian Workshop on Combinatorial Algorithms AWOCA'03. Seoul: Seoul National University, 2003, pp. 36-45. ISBN 89-7282-326-0.
Typ
Stať ve sborníku
Pracoviště
Anotace
A recent paper describes an average-case linear-time algorithm for the calculation of all the borders of every prefix of a string x, where x can contain ''don't-care'' letters that match with any letter of the alphabet. In this paper we show that in fact two distinct cases arise for the border calculation - we call them ''quantum'' and ''deterministic'', and both of them can arise in practice. The existing algorithm applies to the quantum case, but we show how to modify it to deal with the deterministic case. We then extend the border algorithms to apply also to what we call ''indeterminate strings''. We also discuss other algorithms on these indeterminate strings, some of them left as open problems.