Treffer: Data Exploration with SQL using Machine Learning Techniques

Title:
Data Exploration with SQL using Machine Learning Techniques
Contributors:
Orange Labs Meylan, Orange Labs, Base de Données (BD), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA), Universitatea Babeş-Bolyai Cluj-Napoca
Source:
Proc. 20th International Conference on Extending Database Technology (EDBT), ; International Conference on Extending Database Technology - EDBT ; https://hal.science/hal-01455715 ; International Conference on Extending Database Technology - EDBT, Mar 2017, Venice, Italy. pp.96-107 ; http://edbticdt2017.unive.it/
Publisher Information:
CCSD
Publication Year:
2017
Collection:
Portail HAL de l'Université Lumière Lyon 2
Subject Geographic:
Time:
Venice, Italy
Document Type:
Konferenz conference object
Language:
English
Rights:
info:eu-repo/semantics/OpenAccess
Accession Number:
edsbas.B647EC84
Database:
BASE

Weitere Informationen

International audience ; Nowadays data scientists have access to gigantic data, many of them being accessible through SQL. Despite the inherent simplicity of SQL, writing relevant and efficient SQL queries is known to be difficult, especially for databases having a large number of attributes or meaningless attribute names. In this paper, we propose a " rewriting " technique to help data scientists formulate SQL queries, to rapidly and intuitively explore their big data, while keeping user input at a minimum, with no manual tuple specification or labeling. For a user specified query, we define a negation query, which produces tuples that are not wanted in the initial query's answer. Since there is an exponential number of such negation queries, we describe a pseudo-polynomial heuristic to pick the negation closest in size to the initial query, and construct a balanced learning set whose positive examples correspond to the results desired by analysts, and negative examples to those they do not want. The initial query is reformulated using machine learning techniques and a new query, more efficient and diverse, is obtained. We have implemented a prototype and conducted experiments on real-life datasets and synthetic query workloads to assess the scalability and precision of our proposition. A preliminary qualitative experiment conducted with astrophysicists is also described.