Dr Abdulhakim Qahtan - Compact Data Representation for Data Cleaning and Knowledge Discovery (venia legendi)

Klipi teostus: Ahti Saar 11.04.2019 2285 vaatamist Arvutiteadus


Abstract: In many organizations, a vast amount of data is collected every day. This data is collected in the form of data streams such as logs or continuous measurements, or tables of relational databases. Processing the data to extract meaningful information is a challenging task. Constructing a compact representation of the data becomes crucial for better data understanding and analysis. In this talk, I will describe the extraction process of a compact representation for numerical data values by estimating their probability density function (PDF) and for categorical data by discovering inherent compact syntactic structures. These compact representations have applications in data stream mining and data cleaning. I will focus on three main applications; (i) change detection in data streams where the distribution of the current data values differs from a reference data distribution of data values that arrived earlier in the stream; (ii) detecting disguised missing values which are fake values that are used to replace the missing values in a given table. These values do not reflect the actual data values; and (iii) discovering pattern functional dependencies where the dependency between attributes appear in the partial values of the attributes instead of the full value. For example, the first three digits of a fax number determines the state.