machine learning on small size samples: a synthetic
TRANSCRIPT
![Page 1: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/1.jpg)
Machine learning on small size samples: A synthetic knowledge synthesis
Peter Kokol, Marko Kokol*, Sašo Zagoranski*
Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor,
Slovenia
*Semantika, Maribor, Slovenia
![Page 2: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/2.jpg)
ITRODUCTION
Periods of scientific knowledge doubling have become significantly becoming shorter and
shorter (R. Buckminster Fuller, 1983; Bornmann, Anegón and Leydesdorff, 2010; McDeavitt,
2014). This phenomena, combined with information explosion, fast cycles of technological
innovations, Open Access and Open Science movements, and new Web/Internet based methods
of scholarly communication have immensely increased the complexity and effort needed to
synthesis scientific evidence and knowledge. However, above phenomena also resulted in the
growing availability of research literature in a digital, machine-readable format.
To solve the emerging complexity of knowledge synthesis and the simultaneous new possibilities
offered by digital presentation of scientific literature, Blažun et all (Blažun Vošner et al., 2017)
and Kokol et al (Kokol et al., 2020; Kokol, Zagoranski and Kokol, 2020) developed a novel
synthetics knowledge synthesis methodology based on the triangulation of (1) distant reading
(Moretti, 2013), an approach for understanding the canons of literature not by close (manual)
reading, but by using computer based technologies, like text mining and machine learning, (2)
bibliometric mapping (Noyons, 2001) and (3) content analysis (Kyngäs, Mikkonen and
Kääriäinen, 2020; Lindgren, Lundman and Graneheim, 2020). Such triangulation of technologies
enables one to combine quantitative and qualitative knowledge synthesis in the manner to extend
classical bibliometric analysis of publication metadata with the machine learning supported
understanding of patterns, structure and content of publications (Atanassova, Bertin and Mayr,
2019).
One of the increasingly important technologies dealing with the growing complexity of the
digitalisation of almost all human activities is the Artificial intelligence, more precisely machine
learning (Krittanawong, 2018; Grant et al., 2020). Despite the fact, that we live in a “Big data”
world where almost “everything” is digitally stored, there are many real world situation, where
researchers are faced with small data samples. Hence, it is quite common, that the database is
limited for example by the number of subjects (i.e. patients with rare disseises), the sample is
small comparing to number of features like in genetics or biomarkers detection, sampling, there
is a lot of noisy or missing data or measurements are extremely expensive or data iimbalanced
meaning that the size of one class in a data set has very few objects (Thomas et al., 2019; Vabalas
et al., 2019).
Using machine learning on small size datasets pose a problem, because, in general, the
“power”of machine learning in recognising patterns is proportional to the size of the
![Page 3: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/3.jpg)
dataset, the smaller the dataset, less powerful and less accurate are the machine learning
algorithms. Despite the commonality of the above problem and various approaches to solve
it we didn’t found any holistic studies concerned with this important area of the machine
learning. To fill this gap we used synthetic knowledge synthesis presented above to
aggregate current evidence relating to the »small data set problem« in machine learning. In
that manner we can overcome the problem of isolated findings which might be incomplete or
might lack possible overlap with other possible solutions. In our analysis, we aimed to
extract, synthetise and multidimensionally structure the evidence of as possible complete
corpus of scholarship on small data sets. Additionally, the study seeks to identify gaps which
may require further research.
Thus the outputs of the study might help researchers and practitioners to understand the broader
aspects of small size sample` research and its translation to practice. On the other hand, it can help a
novice or data science professional without specific knowledge on small size samples to develop a
perspective on the most important research themes, solutions and relations between them.
METHODOLOGY
Following research question research questions was set
What is the small data problem and how it is solved?
Synthetic knowledge synthesis was performed following the steps below:
1. Harvest the research publications concerning small data sets in machine learning to
represent the content to analyse.
2. Condense and code the content using text mining.
3. Analyse the codes using bibliometric mapping and induce the small data set research
cluster landscape.
4. Analyse the connections between the codes in individual clusters and map them into sub-
categories.
5. Analyse sub-categories to label cluster with themes.
6. Analyse sub-categories to identify research dimensions
7. Cross-tabulate themes and research dimension and identify concepts
![Page 4: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/4.jpg)
The research publications were harvested from the Scopus database, using the advance search
using the command TITLE-ABS-KEY(("small database" or "small dataset" or "small sample" )
and "machine learning" and not ("large database" or "large dataset" or "large sample" )) The
search was performed on 7th of January, 2021. Following metadata were exported as a CSV
formatted corpus file for each publication: Authors, Authors affiliations, Publication Title, Year
of publication, Source Title, Abstract and Author Keywords.
Bibliometric mapping and text mining were performed using the VOSViewer software (Leiden
University, Netherlands). VOSViewer uses text mining to recognize publication terms and then
employs the mapping technique called Visualisation of Similarities (VoS) to create bibliometric
maps or landscapes (van Eck and Waltman, 2014). Landscapes are displayed in various ways to
present different aspects of the research publications content. In this study the content analysis
was performed on the author keywords cluster landscape, due to the fact that previous research
showed that authors keywords most concisely present the content authors would like to
communicate to the research community (Železnik, Blažun Vošner and Kokol, 2017). In this
type of landscape the VOSviewer merges author keywords which are closely associated into
clusters, denoted by the same cluster colours (van Eck and Waltman, 2010). Using a customized
Thesaurus file, we excluded the common terms like study, significance, experiment and
eliminated geographical names and time stamps from the analysis.
RESULTS AND DISCUSSION
The search resulted in 1254 publications written by 3833 authors. Among them were 687
articles, 500 conference papers, 33 review papers, 17 book chapters and 17 other types of
publications. Most productive countries among 78 were China (n=439), United States
(n=297), United Kingdom (n=91), Germany (n=63), India (n=52), Canada (n=51) and Spain
(n=44), Australia (n=42), Japan (n=34), South Korea (n=30) and France (n=30). Most
productive institutions are located in China and United States. Among 2857 institution, China
Academy of Sciences is the most productive institution (n=52), followed by Ministry of
Education China (n=28), Harvard Medical School, USA (n=14), Harbin Institute of
Technology, China (n=14), National Cheng Kung University, China (13), Shanghai
University, China (n=13), Shanghai Jiao Tong University, China (n=12), Georgia Institute of
Technology, USA (n=12) and Northwestern Polytechnical University, China (n=11). The first
non China/USA institution is the University of Granada (n=8) on 19th rank. Majority of the
![Page 5: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/5.jpg)
productive countries are member of G9 countries, with strong economies and research
infrastructure.
Most prolific source titles (journals, proceedings, books) are the Lecture Notes In Computer
Science Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In
Bioinformatics (n=64), IEEE Access (n=17), Neurocomputing (n=17), Plos One (n=17),
Advances in Intelligent Systems and Computing (n=14), Expert System with application
(n=14), ACM International Conference Proceedings Series (n=13) and Proceedings Of SPIE The
International Society For Optical Engineering (n=13). The more productive source titles are mainly
from the computer science research area and are ranked in the first half of Scimago Journal ranking
(Scimago, Elsevier, netherlands). In average their h-indexs is between 120 an 380.
Figure 1. The research literature production dynamics
First two papers concerning the use of machine learning on small datasets indexed in Scopus
were published in 1995 (Forsström et al., 1995; Reich and Travitzky, 1995). After that the
publications were rare till 2002, trend starting to rise linearly in 2003, and exponentially in
2016. In the beginning till the year 2000 all publications were published in journals, after that
0
50
100
150
200
250
300
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
20
18
20
19
20
20
Naslov grafikona
Total number of publications Articles Conference papers Reviews
![Page 6: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/6.jpg)
conference papers started to emerge. Uninterrupted production of review papers started much
later in 2016. According to the ratio between articles, reviews and conference paper and the
exponential trend in production for the first two we asses that machine learning on small
datasets in the third stage of Schneider scientific discipline evolution model (Shneider, 2009).
That means that the terminology and methodologies are already highly developed, and that
domain specific original knowledge generation is heading toward optimal research
productivity.
On the other hand Pestana et al (Pestana, Sánchez and Moutinho, 2019) characterised the
scientific discipline maturity as a set of extensive repeated connections between researchers
that collaborate on the publication of papers on the same or related topics over time. In our
study we analysed those connection with the co-authors networks induced by VOSviewer.
The co-author network showed that among countries with the productivity of 10 or more
papers (n=27) presented in Figure 2. an elaborate international co-authorship network
emerged, confirming that maturity is already reached a high level. The most intensive co-
operation is between United states and China, and the “newest” countries joining the network
are Russian federation, Soth Korea and Brasil.
Figure 2. The co-authorship network
The content analysis of keyword cluster landscape shown in Figure 3. resulted in codes, SUB-
categories and themes presented in Table 1. The content analysis revealed four prevailing
![Page 7: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/7.jpg)
themes. The largest two themes are related to dimension reduction in complex big data
analysis and data augmentation techniques in deep learning.
Figure 3. The author keyword cluster landscape
Table 1: Small size sample research themes (numbers in parenthesis present the number of
papers in which an author keyword occurred)
Theme Colour More frequent codes Prevailing sub-categories
Dimension
reduction in
complex big data
analysis
Red Machine learning (339),
Classification (51), Feature
selection (50), Artificial neural
network (28), Neural network
(19), Natural language
processing (16), Ensemble
learning (14), Bioinformatics
(13), Decision tree (12), Breast
cancer (11), Machine learning
algorithms (10), Computer
Machine learning
algorithms in
bioinformatics,
Classification on small
samples with feature
selection, Ensemble
learning for solving under-
sampling in personalised
medicine, Solving missing
data in breast cancer using
![Page 8: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/8.jpg)
vision (10), Microarray (9),
Small datasets (9), Text
classification (9), Big data (9),
Dimensionality reduction (9),
Social media (7),
deep neural networks,
Dimensionality reduction.
with feature extraction in
cancer classification,
Natural language processing
in social media and text
analysis
Data
augmentation
techniques in
deep learning for
pattern
recognition and
classification
Blue
and
yellow
Deep learning (89),
Convolutional neural networks
(45), Transfer learning (38),
Pattern recognition (14), Image
classification (14), data
augmentation (12), Generative
adversarial networks (8),
LSTM (6), Resnet – Residual
neural network (n=4)
Deep and transfer learning
using in image
classification using data
augmentation and synthetic
data, complex neural
networks, Pattern
recognition in mobile
sensing using Principal
component analysis and
LTSM, Resnet and Desnet
based deep learning using
auto encoder on case of
glaucoma,
Support vector
machines,
Random forest
and Genetic
algorithms in
statistical
learning
Green Support vector machines (109),
Random forest (24), Prediction
(18), Fault diagnosing (18),
Forecast (10), Genetic
algorithms (8), Regression (8),
Gaussian process (7),
Optimisation (7), Time series
(6), Statistical learning theory
(6)
Support vector machines
and random forests in
prediction, forecasting and
fault diagnosis, Genetic
algorithms in optimisation
and time series analysis
Data mining on
small datasets
with support
Viollet Small datasets (2), Data mining
(15), Support vector regression
(11), Virtual sample (7)
Small datasets
augmentation with virtual
samples, Data mining with
support vector regression
![Page 9: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/9.jpg)
vectors
regression
The higher abstraction analysis of the Table 1. revealed four categories, which we call
Research dimensions. Four identified dimensions are: Small data set problem, Machine
learning algorithms, Small-data pre-processing technique and Application Area. Cross
tabulation of themes and research dimension resulted in a taxonomy presented in Table 2. The
entries in the table presents the most popular concepts in each taxonomical entity.
The most frequently reported difficulties causing the small data problem are the small size of
the dataset, high/low dimensionality of datasets and unbalanced data. Small size datasets can
cause problems when machine learning is applied in material sciences (Zhang and Ling,
2018), engineering (Babič et al., 2014; Feng, Zhou and Dong, 2019) and various omics fields
(Ko, Choi and Ahn, 2021) due to the high cost of sampling; differentiating between autistic
and non-autistic patients (Vabalas et al., 2019) and diagnosing rare diseases (Spiga et al.,
2020) due to the small number of available patients, or unavailability of other subjects for
example PhD students in prediction of their grades (Abu Zohair, 2019); and new pandemic
prediction due fact that samples are scare when pandemic occurs and there is lack of medical
knowledge (Fong et al., 2020). Similar reasons can also lead to too high or low
dimensionality of the datasets. The first case can occur even if the sample size is not small,
however the ratio between number of features and the sample size is large, like for example in
particle physics (Komiske et al., 2018) or bioinformatics (Zhang et al., 2021). The second
case can occur when the number of instances is not problematic, however the number of
features is very small, like for example in characterisation of high-entropy alloys (Dai et al.,
2020). Unbalanced data present a long standing problem in machine learning and still remains
a challenge in various applications like face recognition, credit scoring, fault diagnosing and
anomaly detection where mayor class has much more instances than one or more of remaining
classes (He et al., 2020; Liang et al., 2020; Marceau et al., 2020).
The most used machine learning algorithms used on small datasets are support vector
machines (Alvarsson et al., 2016; Cao et al., 2017; Razzak et al., 2020), decision trees/forests
(Zhang et al., 2017; Shaikhina et al., 2019), convolutional neural networks (Liu and Deng,
2016; Yamashita et al., 2018; Guo et al., 2020) and transfer learning (Hall et al., 2020; Tits,
El Haddad and Dutoit, 2020).
![Page 10: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/10.jpg)
Most frequently employed data pre-processing techniques to overcame the small size problem
are liner and nonlinear Principal component analysis (Jalali, Mallipeddi and Lee, 2017; Feng,
Zhou and Dong, 2019; Athanasopoulou, Papacharalampopoulos and Stavropoulos, 2020),
Discriminant analysis (Abu Zohair, 2019; Li et al., 2019; Silitonga et al., 2021), Data
augmentation(Han, Liu and Fan, 2018; Hagos and Kant, 2019; Fong et al., 2020), Virtual
sample (Gong et al., 2017; MacAllister, Kohl and Winer, 2020; Zhu et al., 2020), Feature
extraction (Kumar et al., 2018; Dai et al., 2020) and Auto-encoder (Feng, Zhou and Dong,
2019; Pei et al., 2021).
Most affected areas are Bioinformatics (Giansanti et al., 2019; Khatun et al., 2020; Ko, Choi
and Ahn, 2021), image classification and analysis (Han, Liu and Fan, 2018; Hagos and Kant,
2019; Yadav and Jadhav, 2019), fault diagnosing (Saufi et al., 2020; Wen, Li and Gao, 2020),
forecasting and prediction (Croda, Romero and Morales, 2019; Fong et al., 2020), social
media analysis (Khan et al., 2019; Renault, 2020) and health care (Rajpurkar et al., 2020;
Subirana et al., 2020).
Table 2: Taxonomy of Themes and Research dimensions categories (numbers in parenthesis
present the number of publications)
Theme Small data
size problem
Machine learning
algorithms
Small – data
pre-processing
technique
Application
area
Dimension
reduction in
complex big
data analysis
Missing data
(3),
Unbalanced
data (8),
Under
sampling (3)
Bagging (9), Bayesian
networks (7), Boosting
(6), Decision trees and
forests (43), deep
neural network (12),
Ensemble learning
(21), manifold learning
(5), neural network
(30), naïve Bayes (2),
semi supervised
learning (17),
supervised learning
Dimensionality
reduction (9),
Feature
extraction (12),
engineering
(4), fusion (2),
Lasso (6),
Linear and
multiple
discriminant
analysis (13),
PCA (16),
Image
analysis (18),
genetics and
bioinformatics
(27),
personalised
and precision
medicine (6)
neuroimaging
(7), mental
health (25)
![Page 11: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/11.jpg)
(19), unsupervised
learning (4), FMRI (7)
autoencoder
(10)
Data
augmentation
for deep
learning in
pattern
recognition
and
classification
Convolutional neural
network (42), Dense
CNN (3), Generative
adversarial networks
(8), LSTM (6), Resnet
(4), transfer learning
(40), semi-supervised
learning (17)
Data
augmentation
(12), Domain
adaption (5),
data fusion (3),
synthetic data
(3),
autoencoder
(10)
Image
classification
(22), genetics
(3), mobile
sensing (10),
predictive
modelling (6),
semantic
segmentation
(3)
Statistical
based machine
learning
Small sample
(98)
Support vector
machines (89). Least
square SVM (10),
genetic algorithms,
random forest (24),
Particle swarm (27),
support vector
regression (11)
Feature
engineering
(3), meta-
learning (5),
Monte Carlo
(3), Small
sample
learning (5),
virtual sample
(11)
Forecasting
(13), Fault
diagnosing
(18),
regression
(19),
prediction
(26), pattern
recognition
(17), rock
mechanics
(5), security
7), computer
vision (10),
data mining
(15)
Data mining
on small
datasets
Big data (9) Text and
social media
analysis (21),
Precision
medicine (3),
![Page 12: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/12.jpg)
Natural
language
processing
(16),
Sentiment
analysis (9),
clustering (8)
Strengths and limitations
The main strength of the study is that it is the first bibliometrics and content analysis of the
small dataset research. One of the limitation is that the analysis was limited to publications
indexed in Scopus only, however Scopus is indexing the largest and most complete set of
information titles, thus assuring the integrity of the data source. Additionally, the analysis alsp
included qualitative components, which might bias the results of our study.
Conclusion
Our bibliometric study showed the positive trend in the number of research publications
concerning the use of small dataset and substantial growth of research community dealing
with the small dataset problem, indicating that the research field is moving toward the higher
maturity levels. Despite notable international cooperation, regional concentration of research
literature production in economically more developed countries was observed.
The results of study present a multi-dimensional facet and science landscape of the small
datasets problem which can help machine learning researchers and practitioners to improve
their understanding of the area and can catalyse further knowledge development. On the other
hand, it can inform novice researchers, interested readers or research mangers and evaluators
without specific knowledge and help them to develop a perspective on the most important
small dataset research dimensions. Finally, the study output can serve as a guide to further
research.
REFERENCES
Abu Zohair, L. M. (2019) ‘Prediction of Student’s performance by modelling small dataset size’, International Journal of Educational Technology in Higher Education, 16(1), p. 27. doi: 10.1186/s41239-019-0160-3.
![Page 13: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/13.jpg)
Alvarsson, J. et al. (2016) ‘Large-scale ligand-based predictive modelling using support vector machines’, Journal of Cheminformatics, 8(1), p. 39. doi: 10.1186/s13321-016-0151-5.
Atanassova, I., Bertin, M. and Mayr, P. (2019) ‘Editorial: Mining Scientific Papers: NLP-enhanced Bibliometrics’, Frontiers in Research Metrics and Analytics, 4. doi: 10.3389/frma.2019.00002.
Athanasopoulou, L., Papacharalampopoulos, A. and Stavropoulos, P. (2020) ‘Context awareness system in the use phase of a smart mobility platform: A vision system for a light-weight approach’, in. Procedia CIRP, pp. 560–564. doi: 10.1016/j.procir.2020.05.097.
Babič, M. et al. (2014) ‘Using of genetic programming in engineering’, Elektrotehniski Vestnik/Electrotechnical Review, 81(3), pp. 143–147.
Blažun Vošner, H. et al. (2017) ‘Trends in nursing ethics research: Mapping the literature production’, Nursing Ethics, 24(8), pp. 892–907. doi: 10.1177/0969733016654314.
Bornmann, L., Anegón, F. de M. and Leydesdorff, L. (2010) ‘Do Scientific Advancements Lean on the Shoulders of Giants? A Bibliometric Investigation of the Ortega Hypothesis’, PLOS ONE, 5(10), p. e13327. doi: 10.1371/journal.pone.0013327.
Cao, J. et al. (2017) ‘An accurate traffic classification model based on support vector machines’, International Journal of Network Management, 27(1), p. e1962. doi: https://doi.org/10.1002/nem.1962.
Croda, R. M. C., Romero, D. E. G. and Morales, S.-O. C. (2019) ‘Sales Prediction through Neural Networks for a Small Dataset’, IJIMAI, 5(4), pp. 35–41.
Dai, D. et al. (2020) ‘Using machine learning and feature engineering to characterize limited material datasets of high-entropy alloys’, Computational Materials Science, 175. doi: 10.1016/j.commatsci.2020.109618.
van Eck, N. J. and Waltman, L. (2010) ‘Software survey: VOSviewer, a computer program for bibliometric mapping’, Scientometrics, 84(2), pp. 523–538. doi: 10.1007/s11192-009-0146-3.
van Eck, N. J. and Waltman, L. (2014) ‘Visualizing Bibliometric Networks’, in Ding, Y., Rousseau, R., and Wolfram, D. (eds) Measuring Scholarly Impact: Methods and Practice. Cham: Springer International Publishing, pp. 285–320. doi: 10.1007/978-3-319-10377-8_13.
Feng, S., Zhou, H. and Dong, H. (2019) ‘Using deep neural network with small dataset to predict material defects’, Materials and Design, 162, pp. 300–310. doi: 10.1016/j.matdes.2018.11.060.
Fong, S. J. et al. (2020) ‘Finding an Accurate Early Forecasting Model from Small Dataset: A Case of 2019-nCoV Novel Coronavirus Outbreak’, International Journal of Interactive Multimedia and Artificial Intelligence, 6(1), p. 132. doi: 10.9781/ijimai.2020.02.002.
Forsström, J. J. et al. (1995) ‘Using data preprocessing and single layer perceptron to analyze laboratory data’, Scandinavian Journal of Clinical and Laboratory Investigation, 55(S222), pp. 75–81. doi: 10.3109/00365519509088453.
Giansanti, V. et al. (2019) ‘Comparing Deep and Machine Learning Approaches in Bioinformatics: A miRNA-Target Prediction Case Study’, in Rodrigues, J. M. F. et al. (eds) Computational Science – ICCS 2019. Cham: Springer International Publishing (Lecture Notes in Computer Science), pp. 31–44. doi: 10.1007/978-3-030-22744-9_3.
![Page 14: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/14.jpg)
Gong, H.-F. et al. (2017) ‘A Monte Carlo and PSO based virtual sample generation method for enhancing the energy prediction and energy optimization on small data problem: An empirical study of petrochemical industries’, Applied Energy, 197, pp. 405–415. doi: 10.1016/j.apenergy.2017.04.007.
Grant, K. et al. (2020) ‘Artificial Intelligence in Emergency Medicine: Surmountable Barriers With Revolutionary Potential’, Annals of Emergency Medicine, 75(6), pp. 721–726. doi: 10.1016/j.annemergmed.2019.12.024.
Guo, F. et al. (2020) ‘Improving cardiac MRI convolutional neural network segmentation on small training datasets and dataset shift: A continuous kernel cut approach’, Medical Image Analysis, 61, p. 101636. doi: 10.1016/j.media.2020.101636.
Hagos, M. T. and Kant, S. (2019) ‘Transfer Learning based Detection of Diabetic Retinopathy from Small Dataset’, arXiv:1905.07203 [cs]. Available at: http://arxiv.org/abs/1905.07203 (Accessed: 19 February 2021).
Hall, L. O. et al. (2020) ‘Finding Covid-19 from Chest X-rays using Deep Learning on a Small Dataset’, arXiv:2004.02060 [cs, eess]. Available at: http://arxiv.org/abs/2004.02060 (Accessed: 18 February 2021).
Han, D., Liu, Q. and Fan, W. (2018) ‘A new image classification method using CNN transfer learning and web data augmentation’, Expert Systems with Applications, 95, pp. 43–56. doi: 10.1016/j.eswa.2017.11.028.
He, Z. et al. (2020) ‘Support tensor machine with dynamic penalty factors and its application to the fault diagnosis of rotating machinery with unbalanced data’, Mechanical Systems and Signal Processing, 141, p. 106441. doi: 10.1016/j.ymssp.2019.106441.
Jalali, A., Mallipeddi, R. and Lee, M. (2017) ‘Sensitive deep convolutional neural network for face recognition at large standoffs with small dataset’, Expert Systems with Applications, 87, pp. 304–315. doi: 10.1016/j.eswa.2017.06.025.
Khan, J. Y. et al. (2019) ‘A Benchmark Study on Machine Learning Methods for Fake News Detection’, arXiv:1905.04749 [cs, stat]. Available at: http://arxiv.org/abs/1905.04749 (Accessed: 19 February 2021).
Khatun, Mst. S. et al. (2020) ‘Evolution of Sequence-based Bioinformatics Tools for Protein-protein Interaction Prediction’, Current Genomics, 21(6), pp. 454–463. doi: 10.2174/1389202921999200625103936.
Ko, S., Choi, J. and Ahn, J. (2021) ‘GVES: machine learning model for identification of prognostic genes with a small dataset’, Scientific Reports, 11(1), p. 439. doi: 10.1038/s41598-020-79889-5.
Kokol, P. et al. (2020) ‘Enhancing the role of academic librarians in conducting scoping reviews’, Library Philosophy and Practice (e-journal). Available at: https://digitalcommons.unl.edu/libphilprac/4293.
Kokol, P., Zagoranski, S. and Kokol, M. (2020) ‘Software Development with Scrum: A Bibliometric Analysis and Profile’, Library Philosophy and Practice (e-journal). Available at: https://digitalcommons.unl.edu/libphilprac/4705.
![Page 15: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/15.jpg)
Komiske, P. T. et al. (2018) ‘Learning to classify from impure samples with high-dimensional data’, Physical Review D, 98(1), p. 011502. doi: 10.1103/PhysRevD.98.011502.
Krittanawong, C. (2018) ‘The rise of artificial intelligence and the uncertain future for physicians’, European Journal of Internal Medicine, 48, pp. e13–e14. doi: 10.1016/j.ejim.2017.06.017.
Kumar, I. et al. (2018) ‘A Comparative Study of Supervised Machine Learning Algorithms for Stock Market Trend Prediction’, in. Proceedings of the International Conference on Inventive Communication and Computational Technologies, ICICCT 2018, pp. 1003–1007. doi: 10.1109/ICICCT.2018.8473214.
Kyngäs, H., Mikkonen, K. and Kääriäinen, M. (2020) ‘The Application of Content Analysis in Nursing Science Research’. doi: 10.1007/978-3-030-30199-6.
Li, W. et al. (2019) ‘Detecting Alzheimer’s Disease on Small Dataset: A Knowledge Transfer Perspective’, IEEE Journal of Biomedical and Health Informatics, 23(3), pp. 1234–1242. doi: 10.1109/JBHI.2018.2839771.
Liang, X. W. et al. (2020) ‘LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM’, Knowledge-Based Systems, 196, p. 105845. doi: 10.1016/j.knosys.2020.105845.
Lindgren, B.-M., Lundman, B. and Graneheim, U. H. (2020) ‘Abstraction and interpretation during the qualitative content analysis process’, International Journal of Nursing Studies, 108, p. 103632. doi: 10.1016/j.ijnurstu.2020.103632.
Liu, S. and Deng, W. (2016) ‘Very deep convolutional neural network based image classification using small training sample size’, in. Proceedings - 3rd IAPR Asian Conference on Pattern Recognition, ACPR 2015, pp. 730–734. doi: 10.1109/ACPR.2015.7486599.
MacAllister, A., Kohl, A. and Winer, E. (2020) ‘Using high-fidelity meta-models to improve performance of small dataset trained Bayesian Networks’, Expert Systems with Applications, 139, p. 112830. doi: 10.1016/j.eswa.2019.112830.
Marceau, L. et al. (2020) ‘A comparison of Deep Learning performances with other machine learning algorithms on credit scoring unbalanced data’, arXiv:1907.12363 [cs, stat]. Available at: http://arxiv.org/abs/1907.12363 (Accessed: 18 February 2021).
McDeavitt, J. T. (2014) ‘Medical Education: Toy Airplane or Stone Flywheel?’ Available at: http://wingofzock.org/2014/12/23/medical-education-toy-airplane-or-stone-flywheel/ (Accessed: 29 April 2017).
Moretti, F. (2013) Distant Reading. London: Verso.
Noyons, E. (2001) ‘Bibliometric mapping of science in a science policy context’, Scientometrics, 50(1), pp. 83–98. doi: 10.1023/A:1005694202977.
Pei, Z. et al. (2021) ‘Data augmentation for rolling bearing fault diagnosis using an enhanced few-shot Wasserstein auto-encoder with meta-learning’, Measurement Science and Technology. doi: 10.1088/1361-6501/abe5e3.
Pestana, M. H., Sánchez, A. V. and Moutinho, L. (2019) ‘The network science approach in determining the intellectual structure, emerging trends and future research opportunities – An application to
![Page 16: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/16.jpg)
senior tourism research’, Tourism Management Perspectives, 31, pp. 370–382. doi: 10.1016/j.tmp.2019.07.006.
R. Buckminster Fuller (1983) Critical Path. Century Hutchinson Ltd.
Rajpurkar, P. et al. (2020) ‘AppendiXNet: Deep Learning for Diagnosis of Appendicitis from A Small Dataset of CT Exams Using Video Pretraining’, Scientific Reports, 10(1), p. 3958. doi: 10.1038/s41598-020-61055-6.
Razzak, I. et al. (2020) ‘Randomized nonlinear one-class support vector machines with bounded loss function to detect of outliers for large scale IoT data’, Future Generation Computer Systems, 112, pp. 715–723. doi: 10.1016/j.future.2020.05.045.
Reich, Y. and Travitzky, N. (1995) ‘Machine learning of material behaviour knowledge from empirical data’, Materials and Design, 16(5), pp. 251–259. doi: 10.1016/0261-3069(96)00007-6.
Renault, T. (2020) ‘Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages’, Digital Finance, 2(1), pp. 1–13. doi: 10.1007/s42521-019-00014-x.
Saufi, S. R. et al. (2020) ‘Gearbox Fault Diagnosis Using a Deep Learning Model with Limited Data Sample’, IEEE Transactions on Industrial Informatics, 16(10), pp. 6263–6271. doi: 10.1109/TII.2020.2967822.
Shaikhina, T. et al. (2019) ‘Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation’, Biomedical Signal Processing and Control, 52, pp. 456–462. doi: 10.1016/j.bspc.2017.01.012.
Shneider, A. M. (2009) ‘Four stages of a scientific discipline; four types of scientist’, Trends in Biochemical Sciences, 34(5), pp. 217–223. doi: 10.1016/j.tibs.2009.02.002.
Silitonga, P. et al. (2021) ‘Comparison of Dengue Predictive Models Developed Using Artificial Neural Network and Discriminant Analysis with Small Dataset’, Applied Sciences, 11(3), p. 943. doi: 10.3390/app11030943.
Spiga, O. et al. (2020) ‘Machine learning application for development of a data-driven predictive model able to investigate quality of life scores in a rare disease’, Orphanet Journal of Rare Diseases, 15(1). doi: 10.1186/s13023-020-1305-0.
Subirana, B. et al. (2020) ‘Hi Sigma, do I have the Coronavirus?: Call for a New Artificial Intelligence Approach to Support Health Care Professionals Dealing With The COVID-19 Pandemic’, arXiv:2004.06510 [cs]. Available at: http://arxiv.org/abs/2004.06510 (Accessed: 19 February 2021).
Thomas, R. M. et al. (2019) ‘Dealing with missing data, small sample sizes, and heterogeneity in machine learning studies of brain disorders’, in Machine Learning: Methods and Applications to Brain Disorders, pp. 249–266. doi: 10.1016/B978-0-12-815739-8.00014-6.
Tits, N., El Haddad, K. and Dutoit, T. (2020) ‘Exploring Transfer Learning for Low Resource Emotional TTS’, in Bi, Y., Bhatia, R., and Kapoor, S. (eds) Intelligent Systems and Applications. Cham: Springer International Publishing (Advances in Intelligent Systems and Computing), pp. 52–60. doi: 10.1007/978-3-030-29516-5_5.
Vabalas, A. et al. (2019) ‘Machine learning algorithm validation with a limited sample size’, PLoS ONE, 14(11). doi: 10.1371/journal.pone.0224365.
![Page 17: Machine learning on small size samples: A synthetic](https://reader030.vdocuments.us/reader030/viewer/2022012103/616a0b8a11a7b741a34e347d/html5/thumbnails/17.jpg)
Wen, L., Li, X. and Gao, L. (2020) ‘A transfer convolutional neural network for fault diagnosis based on ResNet-50’, Neural Computing and Applications, 32(10), pp. 6111–6124. doi: 10.1007/s00521-019-04097-w.
Yadav, S. S. and Jadhav, S. M. (2019) ‘Deep convolutional neural network based medical image classification for disease diagnosis’, Journal of Big Data, 6(1), p. 113. doi: 10.1186/s40537-019-0276-2.
Yamashita, R. et al. (2018) ‘Convolutional neural networks: an overview and application in radiology’, Insights into Imaging, 9(4), pp. 611–629. doi: 10.1007/s13244-018-0639-9.
Železnik, D., Blažun Vošner, H. and Kokol, P. (2017) ‘A bibliometric analysis of the Journal of Advanced Nursing, 1976–2015’, Journal of Advanced Nursing, 73(10), pp. 2407–2419. doi: 10.1111/jan.13296.
Zhang, Y. et al. (2021) ‘Evaluating and selecting features via information theoretic lower bounds of feature inner correlations for high-dimensional data’, European Journal of Operational Research, 290(1), pp. 235–247. doi: 10.1016/j.ejor.2020.09.028.
Zhang, Y. and Ling, C. (2018) ‘A strategy to apply machine learning to small datasets in materials science’, npj Computational Materials, 4(1). doi: 10.1038/s41524-018-0081-z.
Zhang, Z. et al. (2017) ‘Topological Analysis and Gaussian Decision Tree: Effective Representation and Classification of Biosignals of Small Sample Size’, IEEE Transactions on Biomedical Engineering, 64(9), pp. 2288–2299. doi: 10.1109/TBME.2016.2634531.
Zhu, Q.-X. et al. (2020) ‘Dealing with small sample size problems in process industry using virtual sample generation: a Kriging-based approach’, Soft Computing, 24(9), pp. 6889–6902. doi: 10.1007/s00500-019-04326-3.