Discover our Team's recent Publications.
FIND OUT MORE ABOUT OUR WORK.
Key Publications
Publications in Int’l Journals
Mining Association Rules from COVID-19 Related Twitter Data to Discover Word Patterns, Topics and Inferences
This work utilizes data from Twitter to mine association rules and extract knowledge about public attitudes regarding worldwide crises. It exploits the COVID-19 pandemic as a use case, and analyses tweets gathered between February and August 2020. The proposed methodology comprises topic extraction and visualization techniques, such as wordclouds, to form clusters or themes of opinions. It then uses Association Rule Mining (ARM) to discover frequent wordsets and generate rules that infer to user attitudes. The goal is to utilize ARM as a postprocessing technique to enhance the output of any topic extraction method. Therefore, only strong wordsets are stored after discarding trivia ones. We also employ frequent wordset identification to reduce the number of extracted topics. Our findings showcase that 50 initially retrieved topics are narrowed down to just 4, when combining Latent Dirichlet Allocation with ARM. Our methodology facilitates producing more accurate and generalizable results, whilst exposing implications regarding social media user attitudes.
Citation: P. Koukaras, C. Tjortjis and D. Rousidis, “Mining Association Rules from COVID-19 Related Twitter Data to Discover Word Patterns, Topics and Inferences”, Information Systems, p. 102054, Elsevier, 2022, Scimago Q2
An Interdisciplinary Approach on Efficient Virtual Microgrid to Virtual Microgrid Energy Balancing Incorporating Data Preprocessing Techniques
A way to improve energy management is to perform balancing both at the Peer-to-peer (P2P) level and then
at the Virtual Microgrid-to-Virtual Microgrid (VMG2VMG) level, while considering the intermittency of
available Renewable Energy Source (RES). This paper proposes an interdisciplinary analytics-based
approach for the formation of VMGs addressing energy balancing. Our approach incorporates Computer
Science methods to address an Energy sector problem, utilizing data preprocessing techniques and Machine
Learning concepts. It features P2P balancing, where each peer is a prosumer perceived as an individual
entity, and Virtual Microgrids (VMGs) as clusters of peers. We conducted several simulations utilizing
clustering and binning algorithms for preprocessing energy data. Our approach offers options for generating
VMGs of prosumers, prior to using a customized Exhaustive brute-force Balancing Algorithm (EBA). EBA
performs balancing at the cluster-to-cluster level, perceived as VMG2VMG balancing. To that end, the
study simulates on data from 94 prosumers, and reports outcomes, biases, and prospects for scaling up and
expanding this work. Finally, this paper outlines potential ideal usages for the approach, either standalone
or integrated with other toolkits and technologies.
Citation: P. Koukaras, C. Tjortjis, P. Gkaidatzis, N. Bezas, D. Ioannidis, and D. Tzovaras, “A Multidisciplinary Approach on Efficient Virtual Microgrid to Virtual Microgrid Energy Balancing Incorporating Data Preprocessing Techniques”, Computing, (SpringerThis is a post-peer-review, pre-copyedit version of an article published in Computing.), Vol. 104, No. 1, pp. 209-250, 2022, Scimago Q2.
Using Classification for Traffic Prediction in Smart Cities, Featuring the Impact of COVID-19
This paper presents a novel methodology using classification for day-ahead traffic prediction. It addresses the research question whether traffic state can be forecasted based on meteorological conditions, seasonality, and time intervals, as well as COVID-19 related restrictions. We propose reliable models utilizing smaller data partitions. Apart from feature selection, we incorporate new features related to movement restrictions due to COVID-19, forming a novel data model. Our methodology explores the desired training subset. Results showed that various models can be developed, with varying levels of success. The best outcome was achieved when factoring in all relevant features and training on a proposed subset. Accuracy improved significantly compared to previously published work.
Citation: S. Liapis, K. Christantonis, V. Chazan-Pantzalis, A. Manos, D.E. Filippidou, and C. Tjortjis, “Using Classification for Traffic Prediction in Smart Cities, Featuring the Impact of COVID-19”, Integrated Computer-Aided Engineering (ICAE), ), Vol. 28, pp. 417-435, IOS Press, 2021, Scimago Q1.
A Methodology for Stock Movement Prediction Using Sentiment Analysis on Twitter and StockTwits Data
Application of Machine Learning (ML) and sentiment analysis on data from microblogging services has become a common approach for stock market prediction. In this paper, we propose a methodology using sentiment analysis on Twitter and StockTwits data for Stock movement prediction. The methodology was evaluated by analyzing stock movement and sentiment data. We present a case study focusing on Microsoft stock. We collected tweets from Twitter and StockTwits along with financial data extracted from Finance Yahoo. Sentiment analysis was applied on tweets, and two ML models namely SVM and Logistic Regression were implemented. Best results were achieved when using tweets from Twitter with VADER and SVM. Top F-score was 76.3% and top Area Under Curve (AUC) was 67%. SVM also achieves the greatest accuracy equal to 65.8%, when using StockTwits with TextBlob on this imbalanced data set.
Citation: C. Nousi and C. Tjortjis, “A Methodology for Stock Movement Prediction Using Sentiment Analysis on Twitter and StockTwits Data,” 2021 6th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM), 2021, pp. 1-7, doi: 10.1109/SEEDA-CECNSM53056.2021.9566242.
An Approach Utilizing Linguistic Features for Fake News Detection
Easy propagation and access to information on the web has the potential to become a serious issue when it comes to disinformation. The term “fake news” describes the intentional propagation of news with the intention to mislead and harm the public and has gained more attention recently. This paper proposes a style-based Machine Learning (ML) approach, which relies on the textual information from news, such as manually extracted lexical features e.g. part of speech counts, and evaluates the performance of several ML algorithms. We identified a subset of the best performing linguistic features, using information-based metrics, which tend to agree with the literature. We also, combined Named Entity Recognition (NER) functionality with the Frequent Pattern (FP) Growth association rule algorithm to gain a deeper perspective of the named entities used in the two classes. Both methods reinforce the claim that fake and real news have limited differences in content, setting limitations to style-based methods. Results showed that convolutional neural networks resulted in the best accuracy, outperforming the rest of the algorithms.
Citation: Kasseropoulos D.P., Tjortjis C. (2021) An Approach Utilizing Linguistic Features for Fake News Detection. In: Maglogiannis I., Macintyre J., Iliadis L. (eds) Artificial Intelligence Applications and Innovations. AIAI 2021. IFIP Advances in Information and Communication Technology, vol 627. Springer, Cham. https://doi.org/10.1007/978-3-030-79150-6_51
Predicting CO2 Emissions for Buildings Using Regression and Classification
This paper presents the development of regression and classification algorithms to predict greenhouse gas emissions caused by the building sector, and identify key building characteristics, which lead to excessive emissions. More specifically, two problems are addressed: the prediction of metric tons of CO2 emitted annually by a building, and building compliance to environmental laws according to its physical characteristics, such as energy, fuel, and water consumption. The experimental results show that energy use intensity and natural gas use are significant factors for decarbonizing the building sector.
Citation: Avramidou A., Tjortjis C. (2021) Predicting CO2 Emissions for Buildings Using Regression and Classification. In: Maglogiannis I., Macintyre J., Iliadis L. (eds) Artificial Intelligence Applications and Innovations. AIAI 2021. IFIP Advances in Information and Communication Technology, vol 627. Springer, Cham. https://doi.org/10.1007/978-3-030-79150-6_43
Paper » Artificial Intelligence Applications and Innovations, June 2021
Predicting Covid-19 ICU needs using Deep Learning, XGBoost and Random Forest Regression with the Sliding Window technique
SThe effects of COVID-19 have caused severe strains to healthcare systems globally. Healthcare infrastructures are tested to their limits in almost every country and city, smart or not. This article utilizes deep and machine learning forecasting algorithms, such as Artificial Neural Networks (ANN), XGBoost and Random Forest. Using the sliding window technique, we predict the expected number of Intensive Care Unit (ICU) beds required for short (one week), mid (two weeks) and longer term (three weeks) time frames. We consider daily confirmed COVID-19 cases, current ICU, regular and special bed occupation, hospitalized cases, recovered and intubated patients and deaths. Results show that the models demonstrate very high coefficient of determination (R2) in the training phase, whilst providing accurate predictions in the forecasting phase. We report the weighted average output of ANN, XGBoost and Random Forest, which resulted in very low Mean Absolute Percentage Error (MAPE). The accurate and timely prediction of ICU beds can support decision making for Healthcare Systems, optimizing deployment of resources, as needed. Our approach can be enhanced by incorporating non-clinical parameters, based on smart city infrastructures, such as data from smart sensors.
Citation: A. Mystakidis, N. Stasinos, A. Kousis, V. Sarlis, P. Koukaras, D. Rousidis, I. Kotsiopoulos, C. Tjortjis, ‘ Predicting Covid-19 ICU needs using Deep Learning, XGBoost and Random Forest Regression with the Sliding Window technique’, IEEE Smart Cities, July 2021
Data Mining Algorithms for Smart Cities: A Bibliometric Analysis
Smart cities connect people and places using innovative technologies such as Data Mining (DM), Machine Learning (ML), big data, and the Internet of Things (IoT). This paper presents a bibliometric analysis to provide a comprehensive overview of studies associated with DM technologies used in smart cities applications. The study aims to identify the main DM techniques used in the context of smart cities and how the research field of DM for smart cities evolves over time. We adopted both qualitative and quantitative methods to explore the topic. We used the Scopus database to find relative articles published in scientific journals. This study covers 197 articles published over the period from 2013 to 2021. For the bibliometric analysis, we used the Biliometrix library, developed in R. Our findings show that there is a wide range of DM technologies used in every layer of a smart city project. Several ML algorithms, supervised or unsupervised, are adopted for operating the instrumentation, middleware, and application layer. The bibliometric analysis shows that DM for smart cities is a fast-growing scientific field. Scientists from all over the world show a great interest in researching and collaborating on this interdisciplinary scientific field.
Citation: A. Kousis and C. Tjortjis, ‘ Data Mining Algorithms for Smart Cities: A Bibliometric Analysis’, Algorithms, Vol. 14, no. 8, 242, 2021, MDPI.
A Data Science Approach Analysing the Impact of Injuries on Basketball Players and Team Performance
Given the recent trend in Data Science (DS) and Sports Analytics, an opportunity has arisen for utilizing Machine Learning (ML) and Data Mining (DM) techniques in sports. This paper reviews background and advanced basketball metrics used in National Basketball Association (NBA) and Euroleague games. The purpose of this paper is to benchmark existing performance analytics used in the literature for evaluating teams and players. Basketball is a sport that requires full set enumeration of parameters in order to understand the game in depth and analyze the strategy and decisions by minimizing unpredictability. This research provides valuable information for team and player performance basketball analytics to be used for better understanding of the game. Furthermore, these analytics can be used for team composition, athlete career improvement and assessing how this could be materialized for future predictions. Hence, critical analysis of these metrics are valuable tools for domain experts and decision makers to understand the strengths and weaknesses in the game, to better evaluate opponent teams, to see how to optimize performance indicators, to use them for team and player forecasting and finally to make better choices for team composition.
Citation: V. Sarlis, V. Chatziilias, C. Tjortjis, D. Mandalidis, “A Data Science Approach Analysing the Impact of Injuries on Basketball Players and Team Performance”, Information Systems, Elsevier, 2021
A Tri-layer Optimization Framework for One-day Ahead Energy Scheduling based on Cost and Discomfort Minimization
Over the past few decades, industry and academia have made great strides to improve aspects related with optimal energy management. These include better ways for efficient energy asset management, generating great opportunities for optimization of energy distribution, discomfort minimization, energy production, cost reduction and more. This paper proposes a framework for a multi-objective analysis, acting as a novel tool that offers responses for optimal energy management through a decision support system. The novelty is in the structure of the methodology, since it considers two distinct optimization problems for two actors, consumers and aggregators, with solution being able to completely or partly interact with the other one is in the form of a demand response signal exchange. The overall optimization is formulated by a bi-objective optimization problem for the consumer side, aiming at cost minimization and discomfort reduction, and a single objective optimization problem for the aggregator side aiming at cost minimization. The framework consists of three architectural layers, namely, the consumer, aggregator and decision support system
(DSS), forming a tri-layer optimization framework with multiple interacting objects, such as objective functions, variables, constants and constraints. The DSS layer is responsible for decision support by forecasting the day-ahead energy management requirements. The main purpose of this study is to achieve optimal management of energy resources, considering both aggregator and consumer preferences and goals, whilst abiding with real-world system constraints. This is conducted through detailed simulations using real data from a pilot, that is part of Terni Distribution System portfolio
Citation: P. Koukaras, P. Gkaidatzis, N. Bezas, T. Bragatto, M. Antal, F. Carere, D. Ioannidis, C. Tjortjis and D. Tzovaras, “A Tri-layer Optimization Framework for One-day Ahead Energy Scheduling based on Cost and Discomfort Minimization”, Energies, Vol. 14, no 12, 3599; MDPI, 2021
Using Classification for Traffic Prediction in Smart Cities, Featuring the Impact of COVID-19
A way to improve energy management is to perform balancing both at the Peer-to-peer (P2P) level and then
at the Virtual Microgrid-to-Virtual Microgrid (VMG2VMG) level, while considering the intermittency of
available Renewable Energy Source (RES). This paper proposes an interdisciplinary analytics-based
approach for the formation of VMGs addressing energy balancing. Our approach incorporates Computer
Science methods to address an Energy sector problem, utilizing data preprocessing techniques and Machine
Learning concepts. It features P2P balancing, where each peer is a prosumer perceived as an individual
entity, and Virtual Microgrids (VMGs) as clusters of peers. We conducted several simulations utilizing
clustering and binning algorithms for preprocessing energy data. Our approach offers options for generating
VMGs of prosumers, prior to using a customized Exhaustive brute-force Balancing Algorithm (EBA). EBA
performs balancing at the cluster-to-cluster level, perceived as VMG2VMG balancing. To that end, the
study simulates on data from 94 prosumers, and reports outcomes, biases, and prospects for scaling up and
expanding this work. Finally, this paper outlines potential ideal usages for the approach, either standalone
or integrated with other toolkits and technologies.
Citation: S. Liapis, K. Christantonis, V. Chazan-Pantzalis, A. Manos, D.E. Filippidou, and C. Tjortjis, “Using Classification for Traffic Prediction in Smart Cities, Featuring the Impact of COVID-19”, Integrated Computer-Aided Engineering (ICAE), IOS Press, 2021
An Interdisciplinary Approach on Efficient Virtual Microgrid to Virtual Microgrid Energy Balancing Incorporating Data Preprocessing Techniques
A way to improve energy management is to perform balancing both at the Peer-to-peer (P2P) level and then
at the Virtual Microgrid-to-Virtual Microgrid (VMG2VMG) level, while considering the intermittency of
available Renewable Energy Source (RES). This paper proposes an interdisciplinary analytics-based
approach for the formation of VMGs addressing energy balancing. Our approach incorporates Computer
Science methods to address an Energy sector problem, utilizing data preprocessing techniques and Machine
Learning concepts. It features P2P balancing, where each peer is a prosumer perceived as an individual
entity, and Virtual Microgrids (VMGs) as clusters of peers. We conducted several simulations utilizing
clustering and binning algorithms for preprocessing energy data. Our approach offers options for generating
VMGs of prosumers, prior to using a customized Exhaustive brute-force Balancing Algorithm (EBA). EBA
performs balancing at the cluster-to-cluster level, perceived as VMG2VMG balancing. To that end, the
study simulates on data from 94 prosumers, and reports outcomes, biases, and prospects for scaling up and
expanding this work. Finally, this paper outlines potential ideal usages for the approach, either standalone
or integrated with other toolkits and technologies.
Citation: P. Koukaras, C. Tjortjis, P. Gkaidatzis, N. Bezas, D. Ioannidis, and D. Tzovaras, “An Interdisciplinary Approach on Efficient Virtual Microgrid to Virtual Microgrid Energy Balancing Incorporating Data Preprocessing Techniques”, Computing, (Springer), 2021.
A Data Science approach analysing the Impact of Injuries on Basketball Player and Team Performance
The sports industry utilizes science to improve short to long-term team and player management regarding budget, health, tactics, training, and most importantly performance. Data Science (DS) and Sports Analytics play key roles in supporting teams, players and experts to improve performance. This paper reviews the literature to identify important attributes correlated with injuries and attempts to quantify their impact on player and team performance, using analytics in the National Basketball Association (NBA) from 2010 up to 2020. It also provides an overview of Machine Learning (ML) and DS techniques and algorithms used to study injuries. Additionally, it provides information for coaches, sports and health scientists, managers and decision makers to recognize the most common injuries and investigate possible injury patterns during competitions. We identify teams and players who suffered the most, and the type of injuries requiring more attention. We found a high impact from injuries and pathologies on performance; musculoskeletal impairments are the most common ones that lead to decreased performance. Finally, we conclude that there is a weak positive relationship between performance and injuries based on a holistic multivariate model that describes player and team performance.
Citation: V. Sarlis, V. Chatziilias, C. Tjortjis, D. Mandalidis, “A Data Science Approach Analysing the Impact of Injuries on Basketball Players and Team Performance”, Information Systems, (Elsevier), 2021
Sports Analytics – Evaluation of Basketball Players and Team Performance
Given the recent trend in Data Science (DS) and Sports Analytics, an opportunity has arisen for utilizing Machine Learning (ML) and Data Mining (DM) techniques in sports. This paper reviews background and advanced basketball metrics used in National Basketball Association (NBA) and Euroleague games. The purpose of this paper is to benchmark existing performance analytics used in the literature for evaluating teams and players. Basketball is a sport that requires full set enumeration of parameters in order to understand the game in depth and analyze the strategy and decisions by minimizing unpredictability. This research provides valuable information for team and player performance basketball analytics to be used for better understanding of the game. Furthermore, these analytics can be used for team composition, athlete career improvement and assessing how this could be materialized for future predictions. Hence, critical analysis of these metrics are valuable tools for domain experts and decision makers to understand the strengths and weaknesses in the game, to better evaluate opponent teams, to see how to optimize performance indicators, to use them for team and player forecasting and finally to make better choices for team composition.
Citation: V. Sarlis, C. Tjortjis, ‘Sports Analytics – Evaluation of Basketball Players and Team Performance‘ , Information Systems, 2020
Smart Cities Data Classification for Electricity Consumption & Traffic Prediction
Smart cities continuously develop into highly sophisticated bionetworks, providing both smart services and ground-breaking solutions. These bionetworks consider Smart Cities as a mechanism to produce data from multiple sharing engines, creating new challenges towards the implementation of effective Smart Cities and innovative services. The purpose of this paper is to relate Data Mining techniques and Smart City projects along with a systematic literature review that distinguishes the main topics and methods applied. The survey emphasizes on various components of Smart Cities, such as data harvesting and data mining activities over city data collected. It also addresses two research questions: a) can we forecast electricity consumption and traffic load based on past data, as well as meteorological conditions? b) which attributes are more suitable for prediction and / or decision support upon energy consumption issues? Results have shown that for both cases, various models can be built based on weather data collected.
Citation: K. Christantonis, C. Tjortjis, A. Manos, D Filippidou and E. Christelis, ‘Smart Cities Data Classification for Electricity Consumption
& Traffic Prediction’ , Automatics & Software Enginery, 2020
Paper » Automatics & Software Enginery, 2020
Social Media Types: introducing a data driven taxonomy
Social Media (SM) have been established as multifunctional networking tools that tend to offer an increasingly wider variety of services, making it difficult to determine their core purpose and mission, therefore, their type. This paper assesses this evolution of Social Media Types (SMTs), presents, and evaluates a novel hypothesis-based data driven methodology for analyzing Social Media Platforms (SMPs) and categorizing SMTs. We review and update literature regarding the categorization of SMPs, based on their services. We develop a methodology to propose and evaluate a new taxonomy, comprising: (i) the hypothesis that the number of SMTs is smaller than what current literature suggests, (ii) observations on data regarding SM usage and (iii) experimentation using association rules and clustering algorithms. As a result, we propose three (3) SMTs, namely Social, Entertainment and Profiling networks, typically capturing emerging SMP services. Our results show that our hypothesis is validated by implementing our methodology and we discuss threats to validity.
Citation: Koukaras P., Tjortjis C., Rousidis D., ‘Social Media Types: Intoducing a Data Driven Taxonomy’ , Computing, Springer, 2020
Paper » Computing, Vol. 2, No. 1, pp. 295-340 Springer, 2020
AI in Greece: The Case of Research on Linked Geospacial Data
Artificial Intelligence has been an active research field in Greece for over forty years, and there are more than thirty AI groups throughout the country covering almost all subareas of AI. One milestone for AI research in Greece was in 1988, when the Hellenic Artificial Intelligence Society (EETN) was founded as a non-profit, scientific organization on devoted to organizing and promo ng AI research in Greece and abroad. This article explores current lines of AI research in Greece and gives some history of Greek AI research since 1968.
Citation: Koubarakis M., Vouros G., Chalkiadakis G., Plagianakos V., Tjortjis C., Kavallieratou E., Vrakas D., Mavridis N., Petasis G., Blekas K., Krithara A., ‘AI in Greece: The Case of Research on Linked Geospatial Data’ , the AI magazine, AAAI, 2018
Paper » the AI magazine, Vol. 39, No. 2, pp. 91-96, AAAI, 2018
Evaluating data mining algorithms using molecular dynamics trajectories
Molecular dynamics simulations provide a sample of a molecule’s conformational space. Experiments on the μs time scale, resulting in large amounts of data, are nowadays routine. Data mining techniques such as classification provide a way to analyse such data. In this work, we evaluate and compare several classification algorithms using three data sets which resulted from computer simulations, of a potential enzyme mimetic biomolecule. We evaluated 65 classifiers available in the well-known data mining toolkit Weka, using ‘classification’ errors to assess algorithmic performance. Results suggest that: (i) ‘meta’ classifiers perform better than the other groups, when applied to molecular dynamics data sets; (ii) Random Forest and Rotation Forest are the best classifiers for all three data sets; and (iii) classification via clustering yields the highest classification error. Our findings are consistent with bibliographic evidence, suggesting a ‘roadmap’ for dealing with such data.
Citation: Tatsis V.A., Tjortjis C., Tzirakis P., ‘Evaluating data mining algorithms using molecular dynamics trajectories’ , Int’l Journal of Data Mining and Bioinformatics, Inderscience, 2013
Paper » Int’l Journal of Data Mining and Bioinformatics, Vol. 8, No. 2 , pp. 169-187, Inderscience, 2013
CODE QUALITY EVALUATION METHODOLOGY USING THE ISO/IEC 9126 STANDARD
This work proposes a methodology for source code quality and static behaviour evaluation of a software system, based on the standard ISO/IEC-9126. It uses elements automatically derived from source code enhanced with expert knowledge in the form of quality characteristic rankings, allowing software engineers to assign weights to source code attributes. It is flexible in terms of the set of metrics and source code attributes employed, even in terms of the ISO/IEC-9126 characteristics to be assessed. We applied the methodology to two case studies, involving five open source and one proprietary system. Results demonstrated that the methodology can capture software quality trends and express expert perceptions concerning system quality in a quantitative and systematic manner.
Citation: Kanellopoulos Y., Antonellis P., Antoniou D., Makris C., Theodoridis E., Tjortjis C., Tsirakis N.,‘Code Quality Evaluation methology using the ISOIEC 9126 Standard ‘, Int’l Journal of Software Engineering & Applications, AIRCC, 2010
Paper » Int’l Journal of Software Engineering & Applications, Vol. 1, No. 2, pp. 17-36, AIRCC, 2010
Comparing data mining methods with logistic regression in childhood obesity prediction
The epidemiological question of concern here is “can young children at risk of obesity be identified from their early growth records?” Pilot work using logistic regression to predict overweight and obese children demonstrated relatively limited success. Hence we investigate the incorporation of non-linear interactions to help improve accuracy of prediction; by comparing the result of logistic regression with those of six mature data mining techniques.
The contributions of this paper are as follows: a) a comparison of logistic regression with six data mining techniques: specifically, for the prediction of overweight and obese children at 3 years using data recorded at birth, 6 weeks, 8 months and 2 years respectively; b) improved accuracy of prediction: prediction at 8 months accuracy is improved very slightly, in this case by using neural networks, whereas for prediction at 2 years obtained accuracy is improved by over 10%, in this case by using Bayesian methods. It has also been shown that incorporation of non-linear interactions could be important in epidemiological prediction, and that data mining techniques are becoming sufficiently well established to offer the medical research community a valid alternative to logistic regression.
Citation: Zhang S., Tjortjis C., Zeng X., Qiao H., Buchan I., and Keane J., ‘Comparing Data Mining Methods with Logistic Regression in Childhood Obesity Prediction’ , Information Systems Frontiers Journal, Sprienger, 2009
Paper » Information Systems Frontiers Journal, Vol. 11, No. 4, pp. 449-460, Springer, 2009
Using T3, an Improved Decision Tree Classifier, for Mining Stroke-related Medical Data
Medical data are a valuable resource from which novel and potentially useful knowledge can be discovered by using data mining. Data mining can assist and support medical decision making and enhance clinical managementand investigative research. The objective of this work is to propose a method for building accurate descriptive and predictive models based on classification of past medical data. We also aim to compare this method with other well established data mining methods and identify strengths and weaknesses.
Citation:Tjortjis C., Saraee M., Theodoulidis B., Keane J.A, Usinf T3, an Improved Desicion Tree Classifier, for Mining Stroke Related Medical Data’ , Methods of Information in Medicine, Thieme, 2007
Paper » Methods of Information in Medicine, Vol. 46, No. 5, pp. 523-529, Thieme, 2007
An improved methodology on information distillation by mining program source code
This paper presents a methodology for knowledge acquisition from source code. We use data mining to support semi-automated software maintenance and comprehension and provide practical insights into systems specifics, assuming one has limited prior familiarity with these systems. We propose a methodology and an associated model for extracting information from object oriented code by applying clustering and association rules mining. K-means clustering produces system overviews and deductions, which support further employment of an improved version of MMS Apriori that identifies hidden relationships between classes, methods and member data. The methodology is evaluated on an industrial case study, results are discussed and conclusions are drawn.
Citation: Kanellopoulos Y., Makris C. and Tjortjis C., ‘An Improved Methology on Information Distillation by Mining Program Source Code’ , Data & Knowledge Engineering, Elsevier, 2007
Paper » Data & Knowledge Engineering, Vol. 61, No. 2, pp. 359-383, Elsevier, 2007
Mining Association Rules from Code (MARC) to Support Legacy Software Management
This paper presents a methodology for Mining Association Rules from Code (MARC), aiming at capturing program structure, facilitating system understanding and supporting software management. MARC groups program entities (paragraphs or statements) based on similarities, such as variable use, data types and procedure calls. It comprises three stages: code parsing/analysis, association rule mining and rule grouping. Code is parsed to populate a database with records and respective attributes. Association rules are then extracted from this database and subsequently processed to abstract programs into groups containing interrelated entities. Entities are then grouped together if their attributes participate to common rules. This abstraction is performed at the program level or even the paragraph level, in contrast to other approaches, that work at the system level. Groups can then be visualized as collections of interrelated entities. The methodology was evaluated using real life COBOL programs. Results showed that the methodology facilitates program comprehension by using source code only, where domain knowledge and documentation are either unavailable or unreliable.
Citation: Tjortjis C., ‘Mining Association Rules from Code (MARC) to Support Legacy Software Management’ , Software Quality Journal, Springer, 2020
Paper » Software Quality Journal, Springer, 2020
Social media prediction: a literature review
Social Media Prediction (SMP) is an emerging powerful tool attracting the attention of researchers and practitioners alike. Despite its many merits, SMP has also several weaknesses, as it is limited by data issues, like bias and noise, and the lack of confident predictions and generalizable results. The goal of this paper is to survey popular and trending fields of SMP from 2015 and onwards and discuss the predictive models used. We elaborate on results found in the literature, while categorizing the forecasting attempts, based on specific values (source of data, algorithm used, outcome of prediction etc.). Finally, we present our findings and conduct statistical analysis on our dataset and critique the outcome of the attempted prediction reported by the reviewed papers. Our research indicates that results are ambiguous, as not all forecasting models can predict with high accuracy, and prediction seems dependable on the associated field, although some of the documented attempts are promising. More than half (53.1%) of the examined attempts achieved a valid prediction, nearly one fifth (18.8%) did not, while the remaining 28.1% is characterized as plausible or partially validated. By reviewing recent and up-to-date literature and by providing statistics, this paper provides SMP researchers with a guide on methods, algorithms, techniques, prediction success and challenges on three main categories that aid SMP exploration.
Citation: Rousidis D., Koukaras P., Tjortjis C., ‘Social Media Prediction A Literature Review’ , Multimedia Tools and Applications, Springer, 2020
Paper » Multimedia Tools and Applications, Springer, 2020
A survey on association rules mining using heuristics
Association rule mining (ARM) is a commonly encountred data mining method. There are many approaches to mining frequent rules and patterns from a database and one among them is heuristics. Many heuristic approaches have been proposed but, to the best of our knowledge, there is no comprehensive literature review on such approaches, yet with only a limited attempt. This gap needs to be filled. This paper reviews heuristic approaches to ARM and points out their most significant strengths and weaknesses. We propose eight performance metrics, such as execution time, memory consumption, completeness, and interestingness, we compare approaches against these performance metrics and discuss our findings. For instance, comparison results indicate that SRmining, PMES, Ant‐ARM, and MDS‐H are the fastest heuristic ARM algorithms. HSBO‐TS is the most complete one, while SRmining and ACS require only one database scan. In addition, we propose a parameter, named GT‐Rank for ranking heuristic ARM approaches, and based on that, ARMGA, ASC, and Kua emerge as the best approaches. We also consider ARM algorithms and their characteristics as transactions and items in a transactional database, respectively, and generate association rules that indicate research trends in this area.
Citation: Ghafari, S.M., Tjortjis, C., ‘A Survey on Association Rules Mining Using Heuristics’ , WIREs Data Mining and Knowledge Discovery, Wiley, 2019
Paper » WIREs Data Mining and Knowledge Discovery, Vol. 4, No. 4, Wiley, 2019
T3C: improving a decision tree classification algorithm’s interval splits on continuous attributes
This paper proposes, describes and evaluates T3C, a classification algorithm that builds decision trees of depth at most three, and results in high accuracy whilst keeping the size of the tree reasonably small. T3C is an improvement over algorithm T3 in the way it performs splits on continuous attributes. When run against publicly available data sets, T3C achieved lower generalisation error than T3 and the popular C4.5, and competitive results compared to Random Forest and Rotation Forest.
Citation:Tzirakis P. and Tjortjis C., ‘T3C: Improving a Desicion Tree Classification Algorithm’s Interval Splits on Continuous Attributes’ , Advances in Data Analysis and Classification, Springer, 2017
Paper » Advances in Data Analysis and Classification, Vol. 11, No. 2, pp. 353-370, Springer, 2017
k-ATTRACTORS: A PARTITIONAL CLUSTERING ALGORITHM FOR NUMERIC DATA ANALYSIS
Clustering is a data analysis technique, particularly useful when there are many dimensions and little prior information about the data. Partitional clustering algorithms are efficient but suffer from sensitivity to the initial partition and noise. We propose here k-attractors, a partitional clustering algorithm tailored to numeric data analysis. As a preprocessing (initialization) step, it uses maximal frequent item-set discovery and partitioning to define the number of clusters k and the initial cluster “attractors.” During its main phase the algorithm uses a distance measure, which is adapted with high precision to the way initial attractors are determined. We applied k-attractors as well as k-means, EM, and FarthestFirst clustering algorithms to several datasets and compared results. Comparison favored k-attractors in terms of convergence speed and cluster formation quality in most cases, as it outperforms these three algorithms except from cases of datasets with very small cardinality containing only a few frequent item sets. On the downside, its initialization phase adds an overhead that can be deemed acceptable only when it contributes significantly to the algorithm’s accuracy.
Citation: Kanellopoulos Y., Antonellis P., Tjortjis C. Makris C. and Tsirakis N., ‘k-Attractors: A Partitional Clustering Algortithm for Numeric Data Analysis’ , Applied Artificial Intelligence, Taylor & Francis, 2011
Paper » Applied Artificial Intelligence, Vol. 25, No. 2, pp. 97-115, Taylor and Francis, 2011
A go-driven semantic similarity measure for quantifying the biological relatedness of gene products
Advances in biological experiments, such as DNA microarrays, have produced large multidimensional data sets for examination and retrospective analysis. Scientists however, heavily rely on existing biomedical knowledge in order to fully analyze and comprehend such datasets. Our proposed framework relies on the Gene Ontology for integrating a priori biomedical knowledge into traditional data analysis approaches. We explore the impact of considering each aspect of the Gene Ontology individually for quantifying the biological relatedness between gene products. We discuss two figure of merit scores for quantifying the pair-wise biological relatedness between gene products and the intra-cluster biological coherency of groups of gene products. Finally, we perform cluster deterioration simulation experiments on a well scrutinized Saccharomyces cerevisiae data set consisting of hybridization measurements. The results presented illustrate a strong correlation between the devised cluster coherency figure of merit and the randomization of cluster membership.
Citation: Denaxas S. and Tjortjis C., ‘A GO-driven semantic similarity measure for quantifying the biological relatedness of gene products’ , Intelligent Desicion Technologies, IOS Press, 2009
Paper » Intelligent Desicion Technologies, Vol.3, No. 4, pp. 239-248, IOS Press, 2009
Clustering for Monitoring Software Systems Maintainability Evolution
This paper presents ongoing work on using data mining clustering to support the evaluation of software systems’ maintainability. As input for our analysis we employ software measurement data extracted from Java source code. We propose a two-steps clustering process which facilitates the assessment of a system’s maintainability at first, and subsequently an in-cluster analysis in order to study the evolution of each cluster as the system’s versions pass by. The process is evaluated on Apache Geronimo, a J2EE 1.4 open source Application Server. The evaluation involves analyzing several versions of this software system in order to assess its evolution and maintainability over time. The paper concludes with directions for future work.
Citation: Antonellis P., Antoniou D., Kanellopoulos Y., Makris C., Theodoridis E., Tjortjis C., Tsirakis N., ‘Clustering for Monitoring Software Systems Maintainability Evolution’, Electronic Notes in Theoretical Computer Science, Elsevier, 2009
Paper » Electronic Notes in Theoretical Computer Science, Vol. 233, pp. 43-57, Elsevier, 2009
Scoring and summarising gene product clusters using the Gene Ontology
We propose an approach for quantifying the biological relatedness between gene products, based on their properties, and measure their similarities using exclusively statistical NLP techniques and Gene Ontology (GO) annotations. We also present a novel similarity figure of merit, based on the vector space model, which assesses gene expression analysis results and scores gene product clusters’ biological coherency, making sole use of their annotation terms and textual descriptions. We define query profiles which rapidly detect a gene product cluster’s dominant biological properties. Experimental results validate our approach, and illustrate a strong correlationbetween our coherency score and gene expression patterns.
Citation: Denaxas S. and Tjortjis C., ‘Scoring and summarising gene product clusters using the Gene Ontology ‘, Int’l Journal of Data Mining and Bioinformatics, Inderscience, 2008
Paper » Int’l Journal of Data Mining and Bioinformatics, Vol. 2, No. 3, pp. 216-235, Interscience, 2008
Re-engineering Academic Teams Toward a Network Organizational Structure
This article examines student teamwork in the academic field from a structural perspective. Student teams are often prearranged and then left to organize themselves and get on with their work, without any further structural support; this, however, can become a negative experience on teamwork. A varied contribution among team members often occurs and unavoidably leads to friction and reduced performance. The aim of this project is to explore the main problems in academic teamwork and investigate tools that provide relevant solutions. We present the concept of network organizational structure and discuss how this can improve collaboration and communication. The main tools to achieve a structural transformation from the more traditional form of team organization to the fairer network form, and their implications are discussed.
Citation: Kaldis E., Koukoravas K., and Tjortjis C., ‘Re-engineering Academic Teams towards a Network organizational Structure’ , Decision Sciences Journal of Innovative Education, Wiley, 2007
Paper » Decision Sciences Journal of Innovative Education, Vol. 5, No. 2, pp. 245-266, Wiley, 2007
Mining source code elements for comprehending object-oriented systems and evaluating their maintainability
Data mining and its capacity to deal with large volumes of data and to uncover hidden patterns has been proposed as a means to support industrial scale software maintenance and comprehension. This paper presents a methodology for knowledge acquisition from source code in order to comprehend an object-oriented system and evaluate its maintainability. We employ clustering in order to support semi-automated software maintenance and comprehension.A model and an associated process are provided, in order to extract elements from source code; K-Means clustering is then applied on these data, in order to produce system overviews and deductions. The methodology is evaluated on JBoss, a very large Open Source Application Server; results are discussed and conclusions are presented together with directions for future work.
Citation: Kanellopoulos Y., Dimopoulos T., Tjortjis C. and Makris C., ‘Mining Source Code Elements for Comprehenging Object-Oreiented Systems and Evaluating Their Maintability’ , ACM SIGKDD Explorations, ACM Press, 2006
Paper » ACM SIGKDD Explorations, Vol. 8, No. 1, pp. 33-40, ACM Press, 2006
Editorial for a Special issue on the 12th conference on software maintenance and reengineering (CSMR 2008)
Software maintenance and reengineering are vital software engineering activities for facilitating the evolution of large software systems. However, software maintenance is not only to be considered for existing systems, but also for new systems, where software models and artefacts evolve as part of iterative and incremental development processes. The Conference on Software Maintenance and Reengineering is the premier European forum to discuss the theory and practice of software maintenance, reengineering, and evolution of software systems. CSMR promotes fruitful discussion and exchange of experiences among researchers and practitioners about the development of maintainable systems, and their evolution, migration and reengineering. This special issue comprises extended versions of three papers, presented at CSMR 2008.
Citation: Kontogiannis K., Tjortjis C., and Winter A.‘Editorial for a Special issue on the 12th conference on software maintenance and reengineering (CSMR 2008)’, Journal of Software Maintnenance and Evolution: Research and Practise, Wiley, 2009
Paper » Journal of Software Maintenance and Evolution: Research and Practice, Vol. 21, No. 2, pp. 79-80, Wiley,2009
Book Chapters
Mining Data to Deal with Epidemics: Case Studies to Demonstrate Real World AI Applications
Forecasting and Prevention mechanisms using Social Media in Healthcare
Social media (SM) is establishing a new era of tools with multi-usage capabilities. Governments, businesses, organizations, as well as individuals are engaging in, implementing their promotions, sharing opinions and propagating decisions on SM. We need filters, validators and a way of weighting expressed opinions in order to regulate this continuous data stream. This chapter presents trends and attempts by the research community regarding: (a) the influence of SM on attitudes towards a specific domain, related to public health and safety (e.g. diseases, vaccines, mental health), (b) frameworks and tools for monitoring their evolution and (c) techniques for suggesting useful interventions for nudging public sentiment towards best practices. Based on the state of the art, we discuss and assess whether SM can be used as means of prejudice or esteem regarding online opinions on health care. We group the state of the art in the following categories: virus–illness outbreaks, anti-vaccination, mental health, social trends and food and environment. Furthermore, we give more weight to virus–illness outbreaks and the anti-vaccination issues/trends in order to examine disease outbreak prevention methodologies and vaccination/anti-vaccination incentives, whilst discussing their performance. The goal is to consolidate the state of the art and give well-supported directions for future work. To sum up, this chapter discusses the aforementioned concepts and related biases, elaborating on forecasting and prevention attempts using SM data.
Citation: Koukaras P., Rousidis D., Tjortjis C., ‘Forecasting and Prevention mechanisms using Social Media in Healthcare’ , Advanced Computational Intelligence in Healthcare, 2020
Book » Advanced Computational Intelligence in Healthcare, 2020
Social Media Analytics, Types and Methodology
The rapid growth of Social Media Networks (SMN) initiated a new era for data analytics. We use various data mining and machine learning algorithms to analyze different types of data generated within these complex networks, attempting to produce usable knowledge. When engaging in descriptive analytics, we utilize data aggregation and mining techniques to provide an insight into the past or present, describing patterns, trends, incidents etc. and try to answer the question “What is happening or What has happened”. Diagnostic analytics come with a pack of techniques that act as tracking/monitoring tools aiming to understand “Why something is happening or Why it happened”. Predictive analytics come with a variety of forecasting techniques and statistical models, which combined, produce insights for the future, hopefully answering “What could happen”. Prescriptive analytics, utilize simulation and optimization methodologies and techniques to generate a helping/support mechanism, answering the question “What should we do”. In order to perform any type of analysis, we first need to identify the correct sources of information. Then, we need APIs to initialize data extraction. Once data are available, cleaning and preprocessing are performed, which involve dealing with noise, outliers, missing values, duplicate data and aggregation, discretization, feature selection, feature extraction, sampling. The next step involves analysis, depending on the Social Media Analytics (SMA) task, the choice of techniques and methodologies varies (e.g. similarity, clustering, classification, link prediction, ranking, recommendation, information fusion). Finally, it comes to human judgment to meaningfully interpret and draw valuable knowledge from the output of the analysis step. This chapter discusses these concepts elaborating on and categorizing various mining tasks (supervised and unsupervised) while presenting the required process and its steps to analyze data retrieved from the Social Media (SM) ecosystem.
Citation: Koukaras P., Tjortjis C., ‘Social Media Analytics, types and methology’ , Machine Learning Paradigms:Applications of learning and Analytics in Intelligent Systems, Springer, 2019
Book » Machine Learning Paradigms: Applications of Learning and Analytics in Intelligent Systems, Springer, 2019
Referred publications in Springer-Verlag Lecture Notes and ACM Int’l Conf. Proc. Series
An Introduction to Information Network Modeling Capabilities, Utilizing Graphs
This paper presents research on Information Network (IN) modeling
using graph mining. The theoretical background along with a review of relevant
literature is showcased, pertaining the concepts of IN model types, network schemas and graph measures. Ongoing research involves experimentation and evaluation on bipartite and star network schemas, generating test subjects using Social
Media, Energy or Healthcare data. Our contribution is showcased by two proof of-concept simulations we plan to extend.
Citation: D. Rousidis, P. Koukaras and C. Tjortjis, ‘Examination of NoSQL Transition and Data Mining capabilities’, 14th Int’l Conf. Metadata and Semantics Research (MTSR2020), Communications in Computer & Information Science (CCIS), Springer 2020
Paper » Communications in Computer & Information Science , Springer, 2020
A Semi-supervised Learning Approach for Complex Information Networks
Information Networks (INs) are abstract representations of realworld interactions among different entities. This paper focuses on a special
type of Information Networks, namely Heterogeneous Information
Networks (HINs). First, it presents a concise review of the recent work on
this field. Then, it proposes a novel method for querying such networks,
using a bi-functional machine learning algorithm for clustering and ranking.
It performs and elaborates on supervised and unsupervised, proof-ofconcept modelling experiments on multi-typed, interconnected data, while
retaining their semantic importance. The results show that this method
yields promising results and can be extended and utilized, using larger, realworld datasets.
Citation: P. Koukaras, C. Berberidis, and C. Tjortjis, ‘A Semi-supervised Learning Approach for Complex Information Networks’, 3rd Int’l Conf. Intelligent Data Communication Technologies and Internet of Things (ICICI 2020) 2020
Paper » Intelligent Data Communication Technologies and Internet of Things,2020
Examination of NoSQL Transition and Data Mining capabilities
An estimated 2.5 quintillion bytes of data are created every day. This
data explosion, along with new datatypes, objects, and the wide usage of social
media networks, with an estimated 3.8 billion users worldwide, make the exploitation and manipulation of data by relational databases, cumbersome and problematic. NoSQL databases introduce new capabilities aiming at improving the
functionalities offered by traditional SQL DBMS. This paper elaborates on ongoing research regarding NoSQL, focusing on the background behind their development, their basic characteristics, their categorization and the noticeable increase in popularity. Functional advantages and data mining capabilities that
come with the usage of graph databases are also presented. Common data mining
tasks with graphs are presented, facilitating implementation, as well as efficiency.
The aim is to highlight concepts necessary for incorporating data mining techniques and graph database functionalities, eventually proposing an analytical
framework offering a plethora of domain specific analytics. For example, a virus
outbreak analytics framework allowing health and government officials to make
appropriate decisions.
Citation: D. Rousidis, P. Koukaras and C. Tjortjis, ‘Examination of NoSQL Transition and Data Mining capabilities’, 14th Int’l Conf. Metadata and Semantics Research (MTSR2020), Communications in Computer & Information Science (CCIS), Springer 2020
Paper » Communications in Computer & Information Science , Springer, 2020
The 50/50 Recommender: A Method Incorporating Personality into Movie Recommender Systems
Recommendation systems offer valuable assistance with selecting products and services. This work checks the hypothesis that taking personality into account can improve recommendation quality. Our main goal is to examine the role of personality in Movie Recommender systems. We introduce the concept of combining collaborative techniques with a personality test to provide more personalized movie recommendations. Previous research attempted to incorporate personality in Recommender systems, but no actual implementation appears to have been achieved. We propose a method and developed the 50/50 recommender system, which combines the Big Five personality test with an existing movie recommender, and used it on a renowned movie dataset. Evaluation results showed that users preferred the 50/50 system 3.6% more than the state of the art method. Our findings show that personalization provides better recommendations, even though some extra user input is required upfront.
Citation: Nalmpantis O. and Tjortjis C.,‘The 50/50 Recommender: a Method Incorporating Personality into Movie Recommender Systems’, CCIS Communications in Computer and Information Science, Springer-Verlag, 2017
Paper » CCIS Communications in Computer and Information Science, pp. 498-507, Springer-Verlag, 2017
Clustering Software Metric Values Extracted from C# Code for Maintainability Assessment
This paper proposes an automated approach for supporting software maintenance using software metrics and data mining. We gather metric values from C# source code elements, such as projects, files, namespaces, classes, and interfaces. These elements are clustered together, based on their similarity, with regards to these metrics, in order to identify problematic, complex classes that might be error prone. We applied this approach to two open source software systems in C#. Results show that it supports identification of potentially problematic code parts, which require further examination and proactive maintenance.
Citation: Arshad S., Tjortjis C.,‘Clustering Software Metric Values Extracted from C# Code for Maintainability Assessment’, ACM Int’l Conf. Proc. Series,ACM, 2016
Agent-Based Digital Networking in Furniture Manufacturing Enterprises
International competition and varying customer needs commonly cause small and medium furniture manufacturing enterprises to join dynamically- formed, ‘smart’ enterprise networks, established and operating using digital information technologies. In this paper, we propose a technological approach to support such enterprise networks which is primarily based on the use of software agents. First we outline the reasons motivating networking in furniture manufacturing enterprises and we briefly present core smart enterprise network concepts. Subsequently, we provide an overview of the main technologies currently used to support enterprise networks, and we make the case for utilising service-orientation and adaptive, (semi-) autonomous software components, such as software agents. Furthermore, we propose a four-tier software architectural framework based on software agents and web services, and we briefly describe the requirements, the architecture and main features of the e-Furn software system, which is based on that framework. Finally, we discuss the intelligent recommendation feature of e-Furn.
Citation: Karageorgos A., Avramouli D., Tjortjis C., Ntalos G.,‘Agent-based Digital Networking in Furniture Manufacturing Enterprises ‘, Communications in Computer and Information Science(CCIS) LNCS, Springer-Verlag, 2010
Paper » Communications in Computer and Information Science(CCIS) LNCS, Springer-Verlag, 2010
T3: an Improved Classification Algorithm for Data Mining
This paper describes and evaluates T3, an algorithm that builds trees of depth at most three, and results in high accuracy whilst keeping the size of the tree reasonably small. T3 is an improvement over T2 in that it builds larger trees and adopts a less greedy approach. T3 gave better results than both T2 and C4.5 when run against publicly available data sets: T3 decreased classification error on average by 47% and generalisation error by 29%, compared to T2; and T3 resulted in 46% smaller trees and 32% less classification error compared to C4.5. Due to its way of handling unknown values, T3 outperforms C4.5 in generalisation by 99% to 66%, on a specific medical dataset.
Citation: Tjortjis C. and Keane J.A., ‘T3: an Improved Classification Algorithm for Data Mining ‘, Lecture Notes Computer Science , Springer-Verlag, 2002
Paper » Lecture Notes Computer Science, Vol. 2412, pp. 50-55, Springer-Verlag, 2002
PRICES: An Efficient Algorithm for Mining Association Rules
In this paper, we present PRICES, an efficient algorithm for mining association rules, which first identifies all large itemsets and then generates association rules. Our approach reduces large itemset generation time, known to be the most time-consuming step, by scanning the database only once and using logical operations in the process. Experimental results and comparisons with the state of the art algorithm Apriori shows that PRICES very efficient and in some cases up to ten times as fast as Apriori.
Citation: Wang C., and Tjortjis C.,‘PRICES: An Efficient Algorithm for Mining Association Rules ‘, Lectures Notes Computer Science, Sprienger-Verlag, 2004
Paper » Lecture Notes Computer Science, Vol. 3177, pp. 352-358, Sprienger-Verlag, 2004
ARMICA-Improved: A New Approach for Association Rule Mining
With increasing in amount of available data, researchers try to propose new approaches for extracting useful knowledge. Association Rule Mining (ARM) is one of the main approaches that became popular in this field. It can extract frequent rules and patterns from a database. Many approaches were proposed for mining frequent patterns; however, heuristic algorithms are one of the promising methods and many of ARM algorithms are based on these kinds of algorithms. In this paper, we improve our previous approach, ARMICA, and try to consider more parameters, like the number of database scans, the number of generated rules, and the quality of generated rules. We compare the proposed method with the Apriori, ARMICA, and FP-growth and the experimental results indicate that ARMICA-Improved is faster, produces less number of rules, generates rules with more quality, has less number of database scans, it is accurate, and finally, it is an automatic approach and does not need predefined minimum support and confidence values.
Citation: Yakhchi S., Ghafari S.M., Tjortjis C., Fazeli M., ‘ARMICA-Improved: A New Approach for Association Rule Mining’ , Lecture Notes in Artificial Indigence, Springer-Verlag, 2017
Paper » Lecture Notes in Artificial Indigence, Vol. 10412, pp. 296-306, Springer-Verlag, 2017
Combining Clustering and Classification for Software Quality Evaluation
Source code and metric mining have been used to successfully assist with software quality evaluation. This paper presents a data mining approach which incorporates clustering Java classes, as well as classifying extracted clusters, in order to assess internal software quality. We use Java classes as entities and static metrics as attributes for data mining. We identify outliers and apply K-means clustering in order to establish clusters of classes. Outliers indicate potentially fault prone classes, whilst clusters are examined so that we can establish common characteristics. Subsequently, we apply C4.5 to build classification trees for identifying metrics which determine cluster membership. We evaluate the proposed approach with two well known open source software systems, Jedit and Apache Geronimo. Results have consolidated key findings from previous work and indicated that combining clustering with classification produces better results than stand alone clustering.
Citation: Papas D. and Tjortjis C., ‘Combining Clustering and Classification for Software Quality Evaluation’, LNCS 8445, Springer-Verlag, 2014
Paper » LNCS 8445, pp. 273-286, Springer-Verlag, 2014
Towards Healthy Association Rule Mining (HARM): A Fuzzy Quantitative Approach
Association Rule Mining (ARM) is a popular data mining technique that has been used to determine customer buying patterns. Although improving performance and efficiency of various ARM algorithms is important, determining Healthy Buying Patterns (HBP) from customer transactions and association rules is also important. This paper proposes a framework for mining fuzzy attributes to generate HBP and a method for analysing healthy buying patterns using ARM. Edible attributes are filtered from transactional input data by projections and are then converted to Required Daily Allowance (RDA) numeric values. Depending on a user query, primitive or hierarchical analysis of nutritional information is performed either from normal generated association rules or from a converted transactional database. Query and attribute representation can assume hierarchical or fuzzy values respectively. Our approach uses a general architecture for Healthy Association Rule Mining (HARM) and prototype support tool that implements the architecture. The paper concludes with experimental results and discussion on evaluating the proposed framework.
Citation: Muyeba M., Khan M., Malik, Z. and Tjortjis C., ‘Towards Healthy Association Rule Mining (HARM): A Fuzzy Quantitative Approach’ , Lecture Notes Computer Science, Springer, 2006
Paper » Lecture Notes Computer Science, Vol. 4224, pp. 1014-1022, Springer, 2006
Experiences of Using a Quantitative Approach for Mining Association Rules
In recent years interest has grown in “mining” large databases to extract novel and interesting information. Knowledge Discovery in Databases (KDD) has been recognised as an emerging research area. Association rules discovery is an important KDD technique for better data understanding. This paper proposes an enhancement with a memory efficient data structure of a quantitative approach to mine association rules from data. The best features of the three algorithms (the Quantitative Approach, DHP, and Apriori) were combined to constitute our proposed approach. The obtained results accurately reflected knowledge hidden in the datasets under examination. Scale-up experiments indicated that the proposed algorithm scales linearly as the size of the dataset increases.
Citation: Dong L. and Tjortjis C., ‘Experiences of Using a Quantitative Approach for Mining Association Rules’, Lecture Notes Computer Science, Springer-Verlag, 2003
Paper » Lecture Notes Computer Science, Vol. 2690, pp. 693-700, Springer-Verlag, 2003
Publications in refereed Int’l Conferences
Sports Analytics for Football League Table and Player Performance Prediction
Common Machine Learning applications in sports
analytics relate to player injury prediction and prevention,
potential skill or market value evaluation, as well as team or
player performance prediction. This paper focuses on football. Its
scope is long–term team and player performance prediction. A
reliable prediction of the final league table for certain leagues is
presented, using past data and advanced statistics. Other
predictions for team performance included refer to whether a
team is going to have a better season than the last one.
Furthermore, we approach detection and recording of personal
skills and statistical categories that separate an excellent from an
average central defender. Experimental results range between
encouraging to remarkable, especially given that predictions were
based on data available at the beginning of the season.
Citation: V. Chazan–Pantzalis, C. Tjortjis, ‘ Sports Analytics for Football League Table and Player Performance Prediction’, Proc. 11th IEEE Int’l Conf. on Information, Intelligence, Systems and Applications (IISA 20) 2020.
Paper » Proc. 11th IEEE Int’l Conf. on Information, Intelligence, Systems and Applications (IISA 20) 2020
Big Data Mining for Smart Cities: Predicting Traffic Congestion using Classification
This paper provides an analysis and proposes a
methodology for predicting traffic congestion. Several machine
learning algorithms and approaches are compared to select the
most appropriate one. The methodology was implemented using
Data Mining and Big Data techniques along with Python, SQL,
and GIS technologies and was tested on data originating from one
of the most problematic, regarding traffic congestion, streets in
Thessaloniki, the 2nd most populated city in Greece. Evaluation and
results have shown that data quality and size were the most critical
factors towards algorithmic accuracy. Result comparison showed
that Decision Trees were more accurate than Logistic Regression
Citation: A. Mystakidis, C. Tjortjis, ‘Big Data Mining for Smart Cities: Predicting Traffic Congestion using Classification’, Proc.11th IEEE Int’l Conf. on Information, Intelligence, Systems and Applications (IISA 20) 2020
Paper » Proc.11th IEEE Int’l Conf. on Information, Intelligence, Systems and Applications (IISA 20) 2020
Promoting Diversity in Content Based Recommendation using Feature Weighting and LSH
This work proposes an efficient Content-Based (CB) product recommendation methodology that promotes diversity. A heuristic CB approach incorporating feature weighting and Locality-Sensitive Hashing (LSH) is used, along with the TF-IDF method and functionality of tuning the importance of product features to adjust its logic to the needs of various e-commerce sites. The problem of efficiently producing recommendations, without compromising similarity, is addressed by approximating product similarities via the LSH technique. The methodology is evaluated on two sets with real e-commerce data. The evaluation of the proposed methodology shows that the produced recommendations can help customers to continue browsing a site by providing them with the necessary “next step”. Finally, it is demonstrated that the methodology incorporates recommendation diversity which can be adjusted by tuning the appropriate feature weights.
Citation: D. Beleveslis, C. Tjortjis, ‘Promoting Diversity in Content Based Recommendation using Feature Weighting and LSH’ , 16th Int’l Conf. on Artificial Intelligence Applications and Innovations, 2020
Paper » 16th Int’l Conf. on Artificial Intelligence Applications and Innovations, Springer, 2020
Using Twitter to Predict Chart Position for Songs
With the advent of social media, concepts such as forecasting and now casting became part of the public debate. Past successes include predicting election results, stock prices and forecasting events or behaviors. This work aims at using Twitter data, related to songs and artists that appeared on the top 10 of the Billboard Hot 100 charts, performing sentiment analysis on the collected tweets, to predict the charts in the future. Our goal was to investigate the relation between the number of mentions of a song and its artist, as well as the semantic orientation of the relevant posts and its performance on the subsequent chart. The problem was approached via regression analysis, which estimated the difference between the actual and predicted positions and moderated results. We also focused on forecasting chart ranges, namely the top 5, 10 and 20. Given the accuracy and F-score achieved compared to previous research, our findings are deemed satisfactory, especially in predicting the top 20.
Citation: E. Tsiara, C. Tjortjis, ‘Using Twitter to Predict Chart Position for Songs’ , 16th Int’l Conf. on Artificial Intelligence Applications and Innovations, 2020
Paper » 16th Int’l Conf. on Artificial Intelligence Applications and Innovations, Springer, 2020
Mining Traffic Accident Data for Hazard Causality Analysis
Over 1.25 million people are killed, and 20-50 million people are seriously injured by traffic accidents every year globally, according to the World Bank. This paper aims to identify patterns in traffic accident data, collected by Cyprus Police between 2007 and 2014. The dataset that was used includes information regarding 3 groups of accident properties: human, vehicle and general environmental or infrastructural information. Data mining techniques were used, and several patterns were identified. Five classifiers were evaluated using a preprocessed dataset, to extract accident patterns. Preliminary results indicate some of the main issues with regards to accident causalities in Cyprus that could be used for real time accident warnings.
Citation: Tasios D., Tjortjis C., Gregoriades A., ‘Mining Traffic Accident Data for Hazard Causality Analysis’ , 4th IEEE SE Europe Design Automation, Computer Engineering, Computer Networks, and Social Media Conf., SEEDA-CECNSM, 2019
Paper » 4th IEEE SE Europe Design Automation, Computer Engineering, Computer Networks, and Social Media Conf.,SEEDA-CECNSM, 2019
Sports Analytics algorithms for performance prediction
Sports Analytics is an emerging research area with several applications in a variety of fields. These could be, for example, the prediction of an athlete’s or a team’s performance, the estimation of an athlete’s talent and market value, as well as the prediction of a possible injury. Teams and coaches are increasingly willing to embed such “tools” in their training, in order to improve their tactics. This paper reviews the literature on Sports Analytics and proposes a new approach for prediction. We conducted experiments using suitable algorithms mainly on football related data, in order to predict a player’s position in the field. We also accumulated data from past years, to estimate a player’s goal scoring performance in the next season, as well as the number of a player’s shots during each match, known to be correlated with goal scoring probability. Results are very promising, showcasing high accuracy, particularly as the predicted number of goals was very close to the actual one.
Citation: Apostolou K., Tjortjis C., ‘Sports Analytics algorithms for performance prediction’ , IEEE 10th Int’l Conf. on Information, Intelligence, Systems and Applications, pp 67-73, 2019
Paper » IEEE 10th Int’l Conf. on Information, Intelligence, Systems and Applications, 2019
A Method for Predicting the Winner of the USA Presidential Elections using Data extracted from Twitter
This paper presents work on using data extracted from Twitter to predict the outcome of the latest USA presidential elections on 8th of November 2016 in three key states: Florida, Ohio and N. Carolina, focusing on the two dominant candidates: Donald J. Trump and Hillary Clinton. Our method comprises two steps: pre-processing and analysis and it succeeded in capturing negative and positive sentiment towards these candidates, and predicted the winner in these States, who eventually won the presidency, when other similar attempts in the literature have failed. We discuss the strengths and weaknesses of our method proposing directions for further work.
Citation: L. Oikonomou and C. Tjortjis, ‘A Method for Predicting the Winner of the USA Presidential Elections using Data extracted from Twitter’ , 3rd IEEE SE Europe Design Automation, Computer Engineering, Computer Networks, and Social Media Conf (IEEE SEEDA-CECNSM18), 2018
Paper » 3rd IEEE SE Europe Design Automation, Computer Engineering, Computer Networks, and Social Media Conf (IEEE SEEDA-CECNSM18), 2018
Association Rules Mining by Improving the Imperialism Competitive Algorithm (ARMICA)
Many algorithms have been proposed for Association Rules Mining (ARM), like Apriori. However, such algorithms often have a downside for real word use: they rely on users to set two parameters manually, namely minimum Support and Confidence. In this paper, we propose Association Rules Mining by improving the Imperialism Competitive Algorithm (ARMICA), a novel ARM method, based on the heuristic Imperialism Competitive Algorithm (ICA), for finding frequent itemsets and extracting rules from datasets, whilst setting support automatically. Its structure allows for producing only the strongest and most frequent rules, in contrast to many ARM algorithms, thus alleviating the need to define minimum support and confidence. Experimental results indicate that ARMICA generates accurate rules faster than Apriori.
Citation: Ghafari S.M. and Tjortjis C., ‘Association Rules Mining by Improving the Imperialism Competitive Algorithm (ARMICA)’ , IFIP AICT Proc. 12th Int’l Conf. on Artificial Intelligence Applications and Innovations (AIAI 2016). Vol. 475, pp 242-254, Springer, 2016
Paper » IFIP AICT Proc. 12th Int’l Conf. on Artificial Intelligence Applications and Innovations (AIAI 2016). Vol. 475, pp 242-254, Springer, 2016
Towards Agent-Based 'Smart' Collaboration in Enterprise Networks
International competition and dynamically changing customer demands lead SME’s to join dynamically formed, `smart’ enterprise networks aiming to increase their competitiveness and market share. Supporting such networks with decision making related to collaborations and providing adaptive user interfaces are key challenges. In this paper, we use furniture manufacturing SME’s as a case study and we provide an overview of our ongoing work on e-Furn, an agent-based system for supporting `smart’ collaboration in enterprise networks. We outline two main features of the proposed approach: (a) assisting users in typical collaboration decisions, such as product bundling and task outsourcing, and (b) providing users with dynamically adaptive user interfaces.
Citation: Karageorgos A., Avramouli D., Vasilopoulou K., Tjortjis C., Ntalos G., ‘Towards Agent-Based ‘Smart’ Collaboration in Enterprise Networks’ , 8th Int’l Workshop on Agent-based Computing for Enterprise Collaboration (ACEC) at WETICE 2010, pp. 35-40, 2010
Paper » 8th Int’l Workshop on Agent-based Computing for Enterprise Collaboration (ACEC) at WETICE, pp. 35-40, 2010
3rd International Workshop on Software Quality and Maintainability
Software is playing a crucial role in modern societies. Not only do people rely on it for their daily operations or business, but for their lives as well. For this reason correct and consistent behavior of software systems is a fundamental part of end user expectations. Additionally, businesses require cost- effective production, maintenance, and operation of their systems. Thus, the demand for software quality is increasing and is setting it as a differentiator for the success or failure of a software product. In fact, high quality software is becoming not just a competitive advantage but a necessary factor for companies to be successful. The main question that arises now is how quality is measured. What, where and when we assess and assure quality, are still open issues. Many views have been expressed about software quality attributes, including maintainability, evolvability, portability, robustness, reliability, usability, and efficiency. These have been formulated in standards such as ISO/IEC-9126 and CMMI. However, the debate about quality and maintainability between software producers, vendors and users is ongoing, while organizations need the ability to evaluate from multiple angles the software systems that they use or develop. So, is “Software quality in the eye of the beholder”? This workshop session aims at feeding into this debate by establishing what the state of the practice and the way forward is.
Citation: Tjortjis C., and Visser J., Ntalos G., ‘3rd International Workshop on Software Quality and Maintainability’ , Proc. IEEE 13th European Conf. Software Maintenance and Reengineering (CSMR 2009), IEEE Comp. Soc. Press, pp. 271-272, 2009
Paper » Proc. IEEE 13th European Conf. Software Maintenance and Reengineering (CSMR 2009), IEEE Comp. Soc. Press, pp. 271-272, 2009
Employing Clustering for Assisting Source Code Maintainability Evaluation according to ISO/IEC-9126
This paper elaborates on how to use clustering for the evaluation of a software system’s maintainability according to the ISO/IEC-9126 quality standard. More specifically it proposes a methodology that combines clustering and multicriteria decision aid techniques for knowledge acquisition by integrating groups of data from source code with the expertise of a software system’s evaluators. A process for the extraction of elements from source code and Analytical Hierarchical Processing for assigning weights to these data are provided; k-Attractors clustering algorithm is then applied on these data, in order to produce system overviews and deductions. The methodology is evaluated on Apache Geronimo, a large Open Source Application Server; results are discussed and conclusions are presented together with directions for future work.
Citation: Antonellis P., Antoniou D., Kanellopoulos Y., Makris C., Theodoridis E., Tjortjis C., Tsirakis N., ‘Employing Clustering for Assisting Source Code Maintainability Evaluation according to ISO/IEC-9126’ , Artificial Intelligence Techniques in Software Engineering Workshop (AISEW 2008) in ECAI08, 2008
Paper » Artificial Intelligence Techniques in Software Engineering Workshop (AISEW 2008) in ECAI08, 2008
Monitoring the Evolution of Software Systems Maintainability
This paper presents ongoing work on using data mining clustering to support the evaluation of software systems’ maintainability. As input for our analysis we employ software measurement data extracted from Java source code.
Citation: Antonellis P., Antoniou D., Kanellopoulos Y., Makris C., Theodoridis E., Tjortjis C., Tsirakis N., ‘Monitoring the Evolution of Software Systems Maintainability ‘ , Proc. special sessions in IEEE 12th European Conference on Software Maintenance and Reengineering (CSMR 2008), 2008
Paper » Proc. special sessions in IEEE 12th European Conference on Software Maintenance and Reengineering (CSMR 2008), 2008
HybridSet: An Effective Approach to Association Rule Mining
Citation: Tjortjis C., and Wang C., ‘HybridSet: An Effective Approach to Association Rule Mining’ , Proc. 22nd European Conf. on Operational Research EURO XXII, 2007
Paper » Proc. 22nd European Conf. on Operational Research EURO XXII, 2007
An Effective Fuzzy Healthy Association Rule Mining Algorithm (FHARM)
This paper presents an effective Healthy Association Rule Mining (HARM) algorithm by introducing new quality measures to generate more interesting rules using itemset nutrient information. Our previous method for analyzing healthy bying patterns from quantitative attributes (or nutrient information) by interval partitions using the classical Apriori algorithm was unable to generate quality rules. This was partly because using basic itemset support is not an appropriate approach as it only gives the total support of various fuzzy sets per nutrient and not the degree of support. In this paper we propose an effective and efficient new Fuzzy Healthy Association Rule Mining Algorithm (FHARM) that produces more interesting and quality rules. In this approach, edible attributes are filtered from transactional input data by projections and are then converted to Required Daily Allowance (RDA) numeric values. The average RDA database is then converted to a fuzzy database that contains normalized fuzzy attributes comprising different fuzzy sets. Analysis of nutritional information is then performance tests and interestingness measures to demonstrate the effectiveness of the approach and proposes further work on evaluating our approach with other generic fuzzy association rule algorithsm.
Citation: Khan M.S., Muyeba M., Tjortjis C., and F. Coenen, ‘An Effective Fuzzy Healthy Association Rule Mining Algorithm (FHARM)’ , 7th Annual Workshop on Computational Intelligence (UKCI 2007), 2007
Paper » 7th Annual Workshop on Computational Intelligence (UKCI 2007), 2007
A Metric of Confidence in Requirements Gathered from Legacy Systems: Two Industrial Case Studies
It is known that well over 50% of replacement projects fail. Requirements gathering go someway to contributing to this statistic; if the requirements we gather for the new system do not match those of the system to be replaced then the project is bound to fail, at least in part. This paper proposes an empirical metric that assists measuring the confidence in the requirements extracted from a legacy system. This metric capitalises on five techniques for gathering requirements from legacy systems and caters for a number of different types of project. The metric can be used to estimate the likelihood of a project’s success or failure and is evaluated by two industrial case studies; conclusions are drawn from these and directions for further work are presented
Citation: Marchant J., Tjortjis C., and Turega M., ‘A Metric of Confidence in Requirements Gathered from Legacy Systems: Two Industrial Case Studies ‘, IEEE 10th European Conf. Software Maintenance and Reengineering (CSMR 2006), pp. 353-359, 2006
Paper » IEEE 10th European Conf. Software Maintenance and Reengineering (CSMR 2006), pp. 353-359, 2006
A Metric of Confidence in Requirements Gathered from Legacy Systems: Two Industrial Case Studies
It is known that well over 50% of replacement projects fail. Requirements gathering go someway to contributing to this statistic; if the requirements we gather for the new system do not match those of the system to be replaced then the project is bound to fail, at least in part. This paper proposes an empirical metric that assists measuring the confidence in the requirements extracted from a legacy system. This metric capitalises on five techniques for gathering requirements from legacy systems and caters for a number of different types of project. The metric can be used to estimate the likelihood of a project’s success or failure and is evaluated by two industrial case studies; conclusions are drawn from these and directions for further work are presented
Citation: Marchant J., Tjortjis C., and Turega M., ‘A Metric of Confidence in Requirements Gathered from Legacy Systems: Two Industrial Case Studies ‘, IEEE 10th European Conf. Software Maintenance and Reengineering (CSMR 2006), pp. 353-359, 2006
Paper » IEEE 10th European Conf. Software Maintenance and Reengineering (CSMR 2006), pp. 353-359, 2006
Building a multi-level database for efficient information retrieval: A framework definition
With the explosive growth of the Internet and the World Wide Web, the amount of information available online is growing in an exponential manner. As the amount of information online constantly increases, it is becoming increasingly difficult and resource demanding to search and locate information in an efficient manner. Information overload has become a pressing research problem since current searching mechanisms, such as conventional search engines, suffer from both lowprecision and low-recall. It is clear that a more dynamic, scalable and accurate searching methodology needs to be developed to overcome these limitations. This paper proposes a methodology consisting of an amalgamation of several research areas such as Web mining and relational database systems. We develop a proof of concept prototype which consists of an agent used to extract information from individual Web pages and a dynamic multi-level relational schema to encapsulate this information for later processing. The prototype provides users with a higher level of scalability and flexibility and can be utilized for searching the Internet and Intranets across large-scale organizations.
Citation: Denaxas S. and Tjortjis C., ‘Building a multi-level database for efficient information retrieval: A framework definition’, IASTED Int’l Conf. on Software Engineering (SE 2005), pp. 163-170, 2005
Paper » IASTED Int’l Conf. on Software Engineering (SE 2005), pp. 163-170, 2005
A New Fast Algorithm for Mining Association Rules Using Logical Operations
Citation: Wang C., and Tjortjis C., ‘A New Fast Algorithm for Mining Association Rules Using Logical Operations‘, Proc. EPSRC/IEEE/IEE PG Research Conf. (PREP 2004), pp. 33-34, 2004
Paper » Proc. EPSRC/IEEE/IEE PG Research Conf. (PREP 2004), pp. 33-34, 2004
From System Comprehension to Program Comprehension
Program and system comprehension are vital parts of the software maintenance process. We discuss the need for both perspectives and describe two methods that may be integrated to provide a smooth transition in understanding from the system level to the program level. Results from a qualitative survey of expert industrial software maintainers, their information needs and requirements when comprehending software are initially presented. We then review existing software tools which facilitate system level and program comprehension. Two successful methods from the fields of data mining and concept assignment are discussed, each addressing some of these requirements. We also describe how these methods can be coupled to produce a broader software comprehension method which partly satisfies all the requirements. Future directions including the closer integration of the techniques are also identified.
Citation: Tjortjis C., Gold N., Layzell P.J. and Bennett K., ‘From System Comprehension to Program Comprehension’, IEEE 26th Int’l Computer Software Applications Conf. (COMPSAC 02), pp. 427-432, 2002
Paper » IEEE 26th Int’l Computer Software Applications Conf. (COMPSAC 02), pp. 427-432, 2002
A Model for Selecting CSCW Technologies for Distributed Software Maintenance Teams in Virtual Organisations
Software maintenance, just like any other software engineering activity, is being conducted in an increasingly distributed manner by teams which are often virtual. This paper critically reviews existing models for Virtual Organisations, investigates issues affecting Distributed Software Maintenance Teams (DSMT) and proposes a model for selecting the appropriate Computer Supported Cooperative Work (CSCW) and Groupware tools and technologies in order to facilitate communication and resource allocation for DSMT. This model builds on current theories, classifications and major concepts in the area of CSCW and advances the way DSMT are perceived. This theoretical model is yet to be empirically evaluated and enriched so that it includes Workflow management systems.
Citation: Tjortjis C., Dafoulas G., Layzell P.J., Macaulay L., ‘A Model for Selecting CSCW Technologies for Distributed Software Maintenance Teams in Virtual Organisations ‘, IEEE 26th Int’l Computer Software Applications Conf. (COMPSAC 02), pp.1104-1108, 2002
Paper » IEEE 26th Int’l Computer Software Applications Conf. (COMPSAC 02), pp.1104-1108, 2002
Using Data Mining to Assess Software Reliability
The paper investigates the applicability of data mining in software reliability assessment and maintenance. The proposed methodology comprises three steps. First the input models are defined by selecting parts of the source code, such as functions, routines and variables, to populate a database. Then Clustering is applied to identify sub-sets of source code that are grouped together according to custom-made similarity metrics. Finally Association rules are used to establish inter-group and intra-group relationships. Experimental results show that the methodology can assess modularity, detect complexity and predict the impact of changes.
Citation: Tjortjis C. and Layzell P.J., ‘Using Data Mining to Assess Software Reliability’, IEEE 12th Int’l Symposium on Software Reliability Engineering (ISSRE2001), pp. 221-223, 2001
Paper » IEEE 12th Int’l Symposium on Software Reliability Engineering (ISSRE2001), pp. 221-223, 2001
An Approach to Text Mining using Information Extraction
In this paper we describe our approach to Text Mining by introducing TextMiner. We perform term and event extraction on each document to find features that are likely to have meaning in the domain, and then apply mining on the extracted features labelling each document. The system consists of two major components, the Text Analysis component and the Data Mining component. The Text Analysis component converts semi structured data such as documents into structured data stored in a database. The second component applies data mining techniques on the output of the first component. We apply our approach in the financial domain (financial documents collection) and our main targets are: a) To manage all the available information, for example classify documents in appropriate categories and b) To “mine” the data in order to “discover” useful knowledge. This work is designed to primarily support two languages, i.e. English and Greek.
Citation: Karanikas H., Tjortjis C., Theodoulidis B., ‘An Approach to Text Mining using Information Extraction’, Proc. PKDD 2000 Workshop on Knowledge Management Theory & Applications (KMTA2000), pp.165-178, 2000
Paper » Proc. PKDD 2000 Workshop on Knowledge Management Theory & Applications (KMTA2000), pp.165-178, 2000
Using Classification for Traffic Prediction in Smart Cities
Smart cities emerge as highly sophisticated bionetworks, providing smart services and ground-breaking solutions. This paper relates classification with Smart City projects, particularly focusing on traffic prediction. A systematic literature review identifies the main topics and methods used, emphasizing on various Smart Cities components, such as data harvesting and data mining. It addresses the research question whether we can forecast traffic load based on past data, as well as meteorological conditions. Results have shown that various models can be developed based on weather data with varying level of success.
Citation: K. Christantonis, C. Tjortjis, A. Manos, D.E. Filippidou, Ε. Mougiakou and E. Christelis, ‘Using Classification for Traffic Prediction in Smart Cities’ , 16th Int’l Conf. on Artificial Intelligence Applications and Innovations, Springer, 2020
Paper » 16th Int’l Conf. on Artificial Intelligence Applications and Innovations, Springer, 2020
A Hybrid Method for Sentiment Analysis of Election Related Tweets
Political sentiment analysis using social media content has recently attracted significant interest. This paper focuses on analyzing tweets in Greek regarding the recent European Elections. A hybrid method that combines Greek lexicons and classification methods is presented. A probabilistic classification model, that predicts the sentiment of tweets, is combined with hashtag-based filtering. Based on the predictions, an analysis is implemented, that shows how the public sentiment was affected by specific events during the pre-election period.
Citation: Beleveslis D., Tjortjis C., Psaradelis D. and Nikoglou D., ‘A Hybrid Method for Sentiment Analysis of Election Related Tweets’ , 4th IEEE SE Europe Design Automation, Computer Engineering, Computer Networks, and Social Media Conf., 2019
Paper » 4th IEEE SE Europe Design Automation, Computer Engineering, Computer Networks, and Social Media Conf.,SEEDA-CECNSM, 2019
Data Mining for Smart Cities: Predicting Electricity Consumption by Classification
Data analysis can be applied to power consumption data for predictions that allow for the efficient scheduling and operation of electricity generation. This work focuses on the parameterization and evaluation of predictive algorithms utilizing metered data on predefined time intervals. More specifically, electricity consumption as a total, but also as main usages/spaces breakdown and weather data are used to develop, train and test predictive models. A technical comparison between different classification algorithms and methodologies are provided. Several weather metrics, such as temperature and humidity are exploited, along with explanatory past consuming variables. The target variable is binary and expresses the volume of consumption regarding each individual residence. The analysis is conducted for two different time intervals during a day, and the outcomes showcase the necessity of weather data for predicting residential electrical consumption. The results also indicate that the size of dwellings affects the accuracy of model.
Citation: Christantonis K., Tjortjis C., ‘Data Mining for Smart Cities: Predicting Electricity Consumption by Classification’ ,IEEE 10th Int’l Conf. on Information, Intelligence, Systems and Applications, pp. 67-73, 2019
Paper » IEEE 10th Int’l Conf. on Information, Intelligence, Systems and Applications, 2019
MuSIF: A Product Recommendation System Based on Multi-source Implicit Feedback
Collaborative Filtering (CF) is a well-established method in Recommendation Systems. Recent research focuses on extracting recommendations also based on implicitly gathered information. Implicit Feedback (IF) systems present several new challenges that need to be addressed. This paper reports on MuSIF, a product recommendation system based solely on IF. MuSIF incorporates CF with Matrix Factorization and Association Rule Mining. It implements a hybrid recommendation algorithm in a way that different methods can be used to increase accuracy. In addition, it is equipped with a new method to increase the accuracy of matrix factorization algorithms via initialization of factor vectors, which, as far as we know, is tested for the first time in an implicit model-based CF approach. Moreover, it includes methods for addressing data sparsity, a major issue for many recommendation engines. Evaluation shows that the proposed methodology is promising and can benefit customers and e-shop owners with personalization in real world scenarios.
Citation: Ι. Schoinas, C. Tjortjis, ‘MuSIF: A Product Recommendation System Based on Multi-source Implicit Feedback’ ,Proc. 15th Int’l Conf. on Artificial Intelligence Applications and Innovations(AIAI 19), IFIP AICT 559, pp. 660-672, Springer, 2019
Paper » Proc. 15th Int’l Conf. on Artificial Intelligence Applications and Innovations(AIAI 19), IFIP AICT 559, pp. 660-672, Springer, 2019
Short-Term Traffic Prediction under Both Typical and Atypical Traffic Conditions using a Pattern Transition Model
One of the most challenging goals of the modern Intelligent Transportation Systems comprises the accurate and real-time short-term traffic prediction. The achievement of this goal becomes even more critical when the presence of atypical traffic conditions is concerned. In this paper, we propose a novel hybrid technique for short-term traffic prediction under both typical and atypical conditions. An Automatic Incident Detection (AID) algorithm, based on Support Vector Machines (SVM), is utilized to check for the presence of an atypical event (e.g. traffic accident). If such an event occurs, the k-Nearest Neighbors (k-NN) non-parametric regression model is used for traffic prediction. Otherwise, the Autoregressive Integrated Moving Average (ARIMA) parametric model is activated for the same purpose. In order to evaluate the performance of the proposed model, we use open real world traffic data from Caltrans Performance Measurement System (PeMS). We compare the proposed model with the unitary k-NN and ARIMA models, which represent the most commonly used non-parametric and parametric traffic prediction models. Preliminary results show that the proposed model achieves larger accuracy under both typical and atypical traffic conditions.
Citation: Theodorou T.I., Salamanis A., Kehagias D., Tzovaras D., and Tjortjis C., ‘Short-Term Traffic Prediction under Both Typical and Atypical Traffic Conditions using a Pattern Transition Model ‘, 3rd Int’l Conf. Vehicle Technology and Intelligent Transport Systems (VEHITS 17), pp. 79-89, 2017
Paper » 3rd Int’l Conf. Vehicle Technology and Intelligent Transport Systems (VEHITS 17), pp. 79-89, 2017
Personalised Fuzzy Recommendation for High Involvement Products
In this paper we introduce a content-based recommendation approach for assisting buyers of high involvement products with their purchasing choice. The approach incorporates a group-based, fuzzy multi-criteria method and provides personalized recommendation to end-users of e-Furniture. E-Furniture is an agent-based system that offers decision making and process networking solutions to furniture manufacturing SMEs. Two are the main characteristics of the proposed approach: (i) it handles vagueness in customer preferences and seller evaluations on furniture products by utilizing the 2-tuple fuzzy linguistic information processing model and ii) it follows a similarity degree-based aggregation technique to derive an objective assessment for furniture bundles and individual furniture products that can match the customer preferences. A numerical example is given as a proof of concept, to demonstrate the applicability of the approach for providing recommendations to customers.
Citation: Gerogiannis V.C., Karageorgos A., Liu L., and Tjortjis C., ‘Personalised Fuzzy Recommendation for High Involvement Products’, IEEE Int’l Conf. Systems, Man, and Cybernetics (SMC 2013), pp. 4884-4890, 2013
Paper » IEEE Int’l Conf. Systems, Man, and Cybernetics (SMC 2013), pp. 4884-4890, 2013
Using Software Metrics to Evaluate Static Single Assignment Form in GCC
Over the past 20 years, static single assignment form (SSA) has risen to become the compiler intermediate representation of choice. Compiler developers cite many qualitative reasons for choosing SSA. However in this study, we present clear quantitative benefits of SSA, by applying several standard software metrics to compiler intermediate code in both SSA and non-SSA forms. The average complexity reduction achieved by using SSA in the GCC compiler is between 32% and 60% according to our software metrics, over a set of standard SPEC benchmarks.
Citation: Singer J., Tjortjis C. and Ward M., ‘Using Software Metrics to Evaluate Static Single Assignment Form in GCC’, Proc. 2nd Int’l Workshop on GCC Research Opportunities (GROW’10), pp. 73-88, 2010
Paper » Proc. 2nd Int’l Workshop on GCC Research Opportunities (GROW’10), pp. 73-88, 2010
Code4Thought Project: Employing the ISO/IEC-9126 Standard for Software Engineering-Product Quality Assessment
The aim of the Code4Thought project was to deliver a tool supported methodology that would facilitate the evaluation of a software product’s quality according toISO/IEC-9126 software engineering quality standard. It was a joint collaboration between Dynacomp S.A. and the Laboratory for Graphics, Multimedia and GIS of the Department of Computer Engineering and Informatics of the University of Patras. The Code4thought project focused its research on extending the ISO/IEC-9126standard by employing additional metrics and developing new methods for facilitating system evaluators to define their own set of evaluation attributes. Furthermore, to develop innovative and platform-free methods for the extraction of elements and metrics from source code data.Finally, to design and implement new data mining algorithms tailored for the analysis of software engineering data.
Citation: Antonellis P., Antoniou D., Kanellopoulos Y., Makris C., Theodoridis E., Tjortjis C., Tsirakis N., ‘Code4Thought Project: Employing the ISO/IEC-9126 Standard for Software Engineering-Product Quality Assessment’, 13th European Conf. Software Maintenance and Reengineering (CSMR 09), pp. 297-300, 2009
Paper » 13th European Conf. Software Maintenance and Reengineering (CSMR 09), pp. 297-300, 2009
Interpretation of Source Code Clusters in Terms of the ISO/IEC-9126 Maintainability Characteristics
Clustering is a data mining technique that allows the grouping of data points on the basis of their similarity with respect to multiple dimensions of measurement. It has also been applied in the software engineering domain, in particular to support software quality assessment based on source code metrics. Unfortunately, since clusters emerge from metrics at the source code level, it is difficult to interpret the significance of clusters at the level of the quality of the entire system. In this paper, we propose a method for interpreting source code clusters using the ISO/IEC 9126 software product quality model. Several methods have been proposed to perform quantitative assessment of software systems in terms of the quality characteristics defined by ISO/IEC 9126. These methods perform mappings of low-level source code metrics to highlevel quality characteristics by various aggregation and weighting procedures. We applied such a method to obtain quality profiles at various abstraction levels for each generated source code cluster. Subsequently, the plethora of quality profiles obtained is visualized such that conclusions about different quality problems in various clusters can be obtained at a glance.
Citation: Kanellopoulos Y., Heitlager I., Tjortjis C., and Visser J., ‘Interpretation of Source Code Clusters in Terms of the ISO/IEC-9126 Maintainability Characteristics’, 12th European Conf. Software Maintenance and Reengineering (CSMR 08), pp. 63-72, 2008
Paper » 12th European Conf. Software Maintenance and Reengineering (CSMR 08), pp. 63-72, 2008
k-Attractors: A Clustering Algorithm for Software Measurement Data Analysis
Clustering is particularly useful in problems where there is little prior information about the data under analysis. This is usually the case when attempting to evaluate a software system’s maintainability, as many dimensions must be taken into account in order to reach a conclusion. On the other hand partitional clustering algorithms suffer from being sensitive to noise and to the initial partitioning. In this paper we propose a novel partitional clustering algorithm, k-Attractors. It employs the maximal frequent itemset discovery and partitioning in order to define the number of desired clusters and the initial cluster attractors. Then it utilizes a similarity measure which is adapted to the way initial attractors are determined. We apply the k-Attractors algorithm to two custom industrial systems and we compare it with WEKA ‘s implementation of K-Means. We present preliminary results that show our approach is better in terms of clustering accuracy and speed.
Citation: Kanellopoulos Y., Antonellis P., Tjortjis C. and Makris C., ‘k-Attractors: A Clustering Algorithm for Software Measurement Data Analysis’, Proc. 19th IEEE Int’l Conf. on Tools with Artificial Intelligence (ICTAI 07), pp.358-365, 2007
Paper » Proc. 19th IEEE Int’l Conf. on Tools with Artificial Intelligence (ICTAI 07), pp.358-365, 2007
A Data Mining Methodology for Evaluating Maintainability according to ISO/IEC-9126 Software Engineering-Product Quality Standard
This paper presents ongoing work on using data mining to evaluate a software system’s maintainability according to the ISO/IEC-9126 quality standard. More specifically it proposes a methodology for knowledge acquisition by integrating data from source code with the expertise of a software system’s evaluators A process for the extraction of elements from source code and Analytical Hierarchical Processing for assigning weights to these data are provided; K-Means clustering is then applied on these data, in order to produce system overviews and deductions. The methodology is evaluated on Apache Geronimo, a large Open Source Application Server; results are discussed and conclusions are presented together with directions for future work.
Citation: Antonellis P., Antoniou D., Kanellopoulos Y., Makris C., Theodoridis E., Tjortjis C., Tsirakis N., ‘A Data Mining Methodology for Evaluating Maintainability according to ISO/IEC-9126 Software Engineering-Product Quality Standard’, Proc. special sessions in IEEE 11th European Conf. on Software Maintenance and Reengineering (CSMR 2007), pp. 81-89, 2007
Paper » Proc. special sessions in IEEE 11th European Conf. on Software Maintenance and Reengineering (CSMR 2007), pp. 81-89, 2007
Quantifying the biological similarity between gene products using GO: an application of the vector space model
Recent advances in biological experiments, such as DNA microarrays, have produced large multidimensional data sets for examination and analysis. Scientists however, heavily rely on existing biomedical knowledge in order to fully analyze and comprehend such datasets. The approach we propose combines statistical natural language processing techniques with the GO annotation ontology, for assessing the biological relatedness of gene products clusters. We explore the application of the vector space model as a means of quantifying this relatedness between gene products, based on their underlying biological properties, as indicated by the GO terms associated with them. We report on experimental results on a small subset of saccharomyces gene products. We also propose and validate a biological similarity figure of merit which can assess gene expression cluster analysis results. Finally, we deploy our approach combined with hierarchical clustering in order to illustrate its application to gene expression clustering
experiments.
Citation: Denaxas S. and Tjortjis C., ‘Quantifying the biological similarity between gene products using GO: an application of the vector space model’, IEEE Information Technology in Biomedicine (ITAB), 2006
Paper » IEEE Information Technology in Biomedicine (ITAB), 2006
Clustering data retrieved from Java source code to support software maintenance: A case study
Data mining is a technology recently used in support of software maintenance in various contexts. Our works focuses on achieving a high level understanding of Java systems without prior familiarity with these. Our thesis is that system structure and interrelationships, as well as similarities among program components can be derived by applying cluster analysis on data extracted from source code. This paper proposes a methodology suitable for Java code analysis. It comprises of a Java code analyser which examines programs and constructs tables representing code syntax, and a clustering engine which operates on such tables and identifies relationships among code elements. We evaluate the methodology on a medium sized system, present initial results and discuss directions for further work.
Citation: Rousidis D. and Tjortjis C., ‘Clustering data retrieved from Java source code to support software maintenance: A case study’, IEEE 9th European Conf. Software Maintenance and Reengineering (CSMR 2005), pp. 276-279, 2005
Paper » IEEE 9th European Conf. Software Maintenance and Reengineering (CSMR 2005), pp. 276-279, 2005
Clustering data retrieved from Java source code to support software maintenance: A case study
Data mining is a technology recently used in support of software maintenance in various contexts. Our works focuses on achieving a high level understanding of Java systems without prior familiarity with these. Our thesis is that system structure and interrelationships, as well as similarities among program components can be derived by applying cluster analysis on data extracted from source code. This paper proposes a methodology suitable for Java code analysis. It comprises of a Java code analyser which examines programs and constructs tables representing code syntax, and a clustering engine which operates on such tables and identifies relationships among code elements. We evaluate the methodology on a medium sized system, present initial results and discuss directions for further work.
Citation: Rousidis D. and Tjortjis C., ‘Clustering data retrieved from Java source code to support software maintenance: A case study’, IEEE 9th European Conf. Software Maintenance and Reengineering (CSMR 2005), pp. 276-279, 2005
Paper » IEEE 9th European Conf. Software Maintenance and Reengineering (CSMR 2005), pp. 276-279, 2005
Data Mining Source Code to Facilitate Program Comprehension: Experiments on Clustering Data Retrieved from C++ Programs
This paper presents ongoing work on using data mining to discover knowledge about software systems thus facilitating program comprehension. We discuss how this work fits in the context of tool supported maintenance and comprehension and report on applying a new methodology on C++ programs. The overall framework can provide practical insights and guide the maintainer through the specifics of systems, assuming little familiarity with these. The contribution of this work is two-fold: it provides a model and associated method to extract data from C++ source code which is subsequently to be mined, and evaluates a proposed framework for clustering such data to obtain useful knowledge. The methodology is evaluated on three open source applications, results are assessed and conclusions are presented. This paper concludes with directions for future work.
Citation: Kanellopoulos Y. and Tjortjis C., ‘Data Mining Source Code to Facilitate Program Comprehension: Experiments on Clustering Data Retrieved from C++ Programs’, IEEE 12th Int’l Workshop on Program Comprehension (IWPC 2004), pp. 214-223, 2004
Paper » IEEE 12th Int’l Workshop on Program Comprehension (IWPC 2004), pp. 214-223, 2004
Facilitating Program Comprehension by Mining Association Rules from Source Code
Program comprehension is an important part of software maintenance, especially when program structure is complex and documentation is unavailable or outdated. Data mining can produce structural views of source code thus facilitating legacy systems understanding. This paper presents a method for mining association rules from code aiming at capturing program structure and achieving better system understanding. A tool was implemented to assess this method. It inputs data extracted from code and derives association rules. Rules are then processed to abstract programs into groups containing interrelated entities. Entities are grouped together if their attributes participate in common rules. The abstraction is performed at the function level, in contrast to other approaches, that work at the program level. The method was evaluated using real, working programs. Programs are fed into a code analyser which produces the input needed for the mining tool. Results show that the method facilitates program comprehension by only using source code where domain knowledge and reliable documentation are not available or reliable.
Citation: Tjortjis C., Sinos L. and Layzell P.J., ‘Facilitating Program Comprehension by Mining Association Rules from Source Code ‘, IEEE 11th Int’l Workshop Program Comprehension (IWPC 03), pp. 125-132, 2003
Paper » IEEE 11th Int’l Workshop Program Comprehension (IWPC 03), pp. 125-132, 2003
A Method for Legacy Systems Maintenance by Mining Data Extracted from Source Code
This paper proposes a new method for understanding and maintaining legacy software systems. The method is based on the use of data mining for extracting interrelationships, patterns and groupings of code elements ranging from variables up to modules. Data mining techniques have been previously used for producing high-level system organizations of source code and legacy systems remodularization. Clustering and association rules were used to get an overview of legacy systems, when attempting understanding, maintenance or re-engineering. However, all previous approaches have addressed systems at a high level of files, programs and modules, failing to get an insight into systems at a lower level. The work presented here aims at addressing systems both at high and low level. It was motivated by the model of data mining in more conventional domains which requires data preprocessing prior to the application of algorithms. The method comprises of a systematic data preparation stage for extracting a number of data models and the relevant databases before applying data mining. Its viability is evaluated in COBOL systems by deriving records about variables, keywords and other grammatical information which are then subjected to mining association rules. Detailed examples are given; conclusions and further work are also presented.
Citation: Chen K., Tjortjis C. and Layzell P.J., ‘A Method for Legacy Systems Maintenance by Mining Data Extracted from Source Code’, IEEE 6th European Conf. on Software Maintenance and Reengineering (CSMR 2002), pp. 54-60. 2002
Paper » IEEE 6th European Conf. on Software Maintenance and Reengineering (CSMR 2002), pp. 54-60. 2002
Expert Maintainers' Strategies and Needs when Understanding Software: A Case Study Approach
Accelerating the learning curve of software maintainers working on systems with which they have little familiarity motivated this study.A working hypothesis was that automated methods are needed to provide a fast, rough grasp of a system, to enable practitioners not familiar with it, to commence maintenance with a level of confidence as if they had this familiarity.Expert maintainers were interviewed regarding theirstrategies and information needs to test this hypothesis.The overriding message is their need for a “starting point” when analyzing code.They also need standardized, reliable and communicable information about a system as an equivalent to knowledge available only to developers or experienced maintainers.These needs are addressed by the proposed “rough-cut” approach to program comprehension.Work underway assesses the suitability of using data mining techniques on data derived from source code to provide high level models of a system and module interrelationships.
Citation: Tjortjis C. and Layzell P.J., ‘Expert Maintainers’ Strategies and Needs when Understanding Software: A Case Study Approach’, IEEE 8th Asia-Pacific Software Engineering Conf. (APSEC 2001),pp. 281-287, 2001
Paper » IEEE 8th Asia-Pacific Software Engineering Conf. (APSEC 2001),pp. 281-287, 2001
Experiences of using Data Mining in a Banking Application
In recent years the ability to generate, capture and store data has increased enormously. The information contained in this data can be very important. It is recognised that, to effectively compete in increasingly competitive global markets, banks must better understand and profile their customers. An unambiguous perspective on the behaviour and attributes of customers comes from their financial history. This data can be used to enable banks to acquire and maintain good customers, where good customers are the most profitable ones. Knowledge Discovery in Databases (KDD), often called data mining, is the inference of knowledge hidden within large collections of operational data. This paper reports on experiences of applying the KDD process in a banking domain. A number of data mining techniques have been used, within the KDD process, and the results obtained have influenced the business activities of the banks. The procedures used are analysed with respect to the domain knowledge they utilise, in order to evaluate the input from a domain expert during the KDD process.
Citation: Scott R.I, Svinterikou S., Tjortjis C. and Keane J.A., ‘Experiences of using Data Mining in a Banking Application’, 2nd WSES/IEEE/IMCS Int’l Conf. on: Circuits, Systems and Computers (CSC’98), pp. 343-346, 1998
Paper » 2nd WSES/IEEE/IMCS Int’l Conf. on: Circuits, Systems and Computers (CSC’98), pp. 343-346, 1998
Publications in refereed National Conferences
Understanding Java Source Code by Using Data Mining Clustering
This paper presents ongoing work on using data mining clustering to facilitate software maintenance, program comprehension and software systems knowledge discovery. We propose a method for grouping Java code elements together, according to their similarity. The method aims at providing practical insights and guidance to maintainers through the specifics of a system, assuming they have little familiarity with it. Our method employs a preprocessing algorithm to identify the most significant syntactical and grammatical elements of Java programs and then applies hierarchical agglomerative clustering to produce correlations among extracted data. The proposed method successfully reveals similarities between classes and other code elements thus facilitating software maintenance and Java program comprehension as shown by the experimental results presented here. The paper concludes with directions for further work
Citation: Rousidis D. and Tjortjis C., ‘Understanding Java Source Code by Using Data Mining Clustering’ , Proc. 10th Pan’c Conf. on Informatics (PCI’2005), 2005
Paper » Proc. 10th Pan’c Conf. on Informatics (PCI’2005), 2005
A Complete Model of an Information System for Hospitals
Citation: Gioldasis G., Panagopoulou G. Sirmakesis S., Tjortjis C., Tsakalidis A., ‘A Complete Model of an Information System for Hospitals‘ , Proc. 3rd Nat’l Conf. of Medical Informatics, 1994
Paper » Proc. 3rd Nat’l Conf. of Medical Informatics, 1994
A Hybrid Knowledge-Driver Approach to Clustering Gene Expression Data
Microarray technology has enabled scientists to monitor and process the expression of thousands of genes in parallel, within a single experiment. However, the efficient interpretation and validation of the analysis results, based on current medical and biological knowledge, remains a challenge. Most gene expression analysis approaches do not incorporate existing background knowledge in the process, thus necessitating laborious manual interpretation. In this paper we propose a novel hybrid knowledge-driven approach for analyzing gene expression data which integrates currently available biological and medical knowledge within the actual clustering process. Existing published scientific information is correlated to create, validate and biologically interpret the resulting clusters. Some preliminary experimental results are supplied using a sample yeast genome data set.
Citation: Denaxas S. and Tjortjis C., ‘A Hybrid Knowledge-Driver Approach to Clustering Gene Expression Data ‘ , Proc. 10th Pan’c Conf. on Informatics (PCI’2005), 2005
Paper » Proc. 10th Pan’c Conf. on Informatics (PCI’2005), 2005
Technical Reports
Data Mining Code Clustering (DMCC): An Approach Supporting Software Maintainers with Program Comprehension
Software maintainers face challenges when making decisions to modify programs with little understanding of the overall source code organisation and the full impact of changes. Most software systems are structured as a number of subsystems, consisting of code that collaborates to provide the composed functionality to the program. An important aspect of program understanding is to perceive this subsystem structure. Cluster Analysis can be of use in deriving a meaningful subsystem structure of a program from its source code. The idea is to represent a program as a number of entities, which are grouped in clusters representing subsystems, based on their similarity measured by means of related functionality or data use. Central issues for any clustering-based approach are the specification of program entities and their attributes, similarity metrics, and clustering strategy. We propose here Data Mining Code Clustering (DMCC), an approach for supporting software maintenance and program comprehension that uses input data extracted from a C/C++ program and produces its abstraction as a number of subsystems. New possibilities in the form of weighting rationales and a framework for similarity metrics based on these, as well as an analysis of these from a novel point of view is a main contribution. The approach was evaluated by implementing a tool that was used for experimentation with programs of various sizes and languages, in collaboration with experts. Results showed that the approach is useful for deriving accurate subsystem abstractions and identifying interrelationships amongst modules. Factors influencing the feasibility of this approach are identified and directions for improvements are discussed.
Citation: Tjortjis C., ‘Data Mining Code Clustering (DMCC): An Approach Supporting Software Maintainers with Program Comprehension’ , Technical Report, School of Science & Technology, International Hellenic University, 2019
Report » School of Science & Technology, International Hellenic University, 2019