Ing. Dana Tomášková

Mobility ČVUT MSCA-F-CZ-I

Program

Operační program Jan Amos Komenský

Poskytovatel

Evropská komise

Pracoviště

Laboratoř výzkumu programování
Oddělení pro rozvoj

Řešitelé

Pierre Donat-Bouillud, Ph.D.

Kód

EH22_010/0003405, CZ.02.01.01/00/22_010/0003405

Období

2023 - 2025

Popis

V souladu s výzvou projekt umožní mezinárodní mobilitu výzkumným pracovníkům, kterým byl v minulých letech schválen projekt z programu Marie Skłodowska-Curie Individual / Postdoctoral Fellowships, ovšem z důvodů nedostatku financí byl zařazen do kategorie tzv. no-money projektů. Projekt bude realizován pracovními pobyty zahraničních VP na ČVUT. Hlavním cílem projektu je podpora profesního růstu výzkumných pracovníků, kvalitního výzkumu, vzdělávání pro praxi a rozvoje komunikace a spolupráce. Na FIT se jedná o příjezdovou mobilitu na 24 měsíců pracovníka, Pierre Donat-Bouillud, s příspěvkem na rodinu, na Fakultu informačních technologií. Školitel je prof. Jan Vitek, MSc., Ph.D. V rámci této mobility na FIT je plánovaný 3 měsíční secondment v Rakousku, Paris Lodron University, Salzburg).

Reproducible Data Analysis for All

Program

Horizon Europe

Poskytovatel

Evropská komise

Pracoviště

Oddělení pro vědu a výzkum
Laboratoř výzkumu programování

Řešitelé

prof. Jan Vitek, MSc., Ph.D.

Kód

101081989-R4R

Období

2024 - 2025

Popis

Creating a reproducible environment for data-analysis pipelines is hard. The current practice is to assemble it manually. That is both labor-intensive and error-prone and requires skills and knowledge that data analysts do not usually have. While there exist tools that try to simplify this process, they all rely on some metadata that has to be provided by the user. Getting this metadata is not trivial. Not only does one have to include all the libraries directly imported with their transitive dependencies, but each of these libraries can depend on native libraries and tools, which themselves have their dependencies and configurations. Versions have to be pinned appropriately as libraries frequently update and change their behavior. There is no description of these dependencies, and thus the process of gathering the metadata is mainly based on experience and trial-and-error. The challenge that we are addressing is to build an automated system that can track all of the pipeline dependencies, data inputs, and other sources of non-determinism to prepare an environment where data-analysis pipelines can repeatedly run, producing identical results.

Rigorous Engineering of Data Analysis Pipelines (RiGiD)

Program

Grantové projekty excelence v základním výzkumu EXPRO

Poskytovatel

Grantová agentura České republiky

Pracoviště

Laboratoř výzkumu programování
Oddělení pro vědu a výzkum

Řešitelé

prof. Jan Vitek, MSc., Ph.D.

Kód

GX23-07580X

Období

2023 - 2027

Popis

The RiGiD project lays the groundwork for this research programme and aims to develop a methodology for rigorous engineering of data analysis pipelines that can be adopted in practice. Our approach is pragmatic. Rather than chasing functional correctness, we hope to substantially reduce the incidence of errors in the wild. The research is structured in three overlapping chapters. First, identify the problem by carrying out user studies and large-scale program analysis of a corpus of over 100,000 data science pipelines. The outcome will be a catalog of error patterns as well as a labeled dataset to be shared with other researchers. The technical advances will focus on combining dynamic and static program analysis to approximate the behavior of partial programs and programs written in highly dynamic languages. The second part of our effort proposes a methodology and tooling for developing data sciences codes with reduced error rates. The technical contributions of this part of the project focus on lightweight specification techniques and, in particular, the development of a novel gradual typing system that deals with common programming idioms found in our corpus. This includes various forms of object orientation, data frames, and rich value specifications. These specifications are complemented with an automated test generation technique that combines test and input synthesis with fuzzing and test minimization. Finally, the execution environment is extended to support automatic reproducibility and result audits through data lineage. The third and last part of the work evaluates the proposal by conducting user studies and developing tools for automating deployment. The contribution will be a qualitative and quantitative assessment of the RiGiD methodology and tooling. The technical contribution will be tools that leverage program analysis to infer approximate specifications to assist deployment and adoption. Our tools target R, a language for data analytics with 2 milli

Ing. Dana Tomášková

Projekty

Mobility ČVUT MSCA-F-CZ-I

Reproducible Data Analysis for All

Rigorous Engineering of Data Analysis Pipelines (RiGiD)