Enhanced Regular Expression as a DGL for Generation of Synthetic Big Data

Title

Subject

Big data

Probability distributions

Data privacy

Petroleum reservoir evaluation

Data Analytics

Pattern matching

Query languages

Description

Synthetic data generation is generally used in performance evaluation and function tests in data-intensive applications, as well as in various areas of data analytics, such as privacy-preserving data publishing (PPDP) and statistical disclosure limit/control. A significant amount of research has been conducted on tools and languages for data generation. However, existing tools and languages have been developed for specific purposes and are unsuitable for other domains. In this article, we propose a regular expression-based data generation language (DGL) for flexible big data generation. To achieve a general-purpose and powerful DGL, we enhanced the standard regular expressions to support the data domain, type/format inference, sequence and random generation, probability distributions, and resource reference. To efficiently implement the proposed language, we propose caching techniques for both the intermediate and database queries. We evaluated the proposed improvement experimentally. 2023 KIPS

1-16

Creator

Cheng, Kai

Abe, Keisuke

Publisher

Journal of Information Processing Systems

Date

2023

Type

journalArticle

Identifier

1976913X

10.3745/JIPS.04.0262

URL

http://dx.doi.org/10.3745/JIPS.04.0262

Collection

Big Data, AI Applications, and Predictive Maintenance

Citation

Cheng, Kai and Abe, Keisuke, “Enhanced Regular Expression as a DGL for Generation of Synthetic Big Data,” Lamar University Midstream Center Research, accessed May 18, 2024, https://lumc.omeka.net/items/show/29281.

Enhanced Regular Expression as a DGL for Generation of Synthetic Big Data

Title

Subject

Description

Creator

Publisher

Date

Type

Identifier

URL

Collection

Tags

Citation

Output Formats