Published Date : 18/07/2025
The increasing prevalence of artificial intelligence has sparked considerable interest in how to extract meaningful insights from limited data, a situation known as the ‘small data’ problem. Maren Hackenberg, Sophia G Connor, and Fabian Kabus, all from the Institute of Medical Biometry and Statistics at the University of Freiburg, alongside colleagues from The Royal Society and other institutions, address this challenge by exploring the potential of small data methods in everyday life. Their work contrasts this approach with the more familiar ‘big data’ paradigm, identifying key themes and successful applications across diverse fields, from public policy to personal health monitoring. By bridging conceptual understanding with practical techniques, including statistical modelling and computer science approaches, the researchers demonstrate what is currently achievable and outline a roadmap for fully realising the benefits of small data analysis.
Recent advances in artificial intelligence have sparked renewed interest in extracting meaningful insights from limited data, a field known as small data analysis. This is increasingly important as researchers and policymakers recognise the limitations of relying solely on “big data” approaches, and seek ways to include under-represented groups in data-driven decision-making. While big data leverages vast quantities of information to identify broad trends, small data focuses on extracting knowledge from smaller, often more focused datasets, offering unique opportunities to address specific questions and uncover nuanced understandings. This shift is particularly relevant in areas like healthcare, where data on rare diseases is scarce, and in the development of assistive technologies, such as wearables, where individualised data is paramount.
The core difference between big and small data lies in the approach to analysis and the types of insights they yield. Big data excels at identifying general patterns, but can struggle with extreme values or subgroups within a population, potentially overlooking critical information. This means that minority or under-represented groups can become invisible when decisions are informed by big data outputs, raising concerns about fairness and inclusivity. Small data, conversely, prioritises detailed understanding within limited contexts, allowing researchers to focus on specific characteristics and address diversity in a targeted way.
This is not to say that big data is obsolete, but rather that small data offers a complementary approach, particularly when large datasets are unavailable or when unique data points hold significant value. The growing interest in small data also stems from its potential to overcome limitations inherent in big data methodologies. While big data relies on identifying overarching trends, it can reinforce existing biases if the data itself is not representative. Small data, by focusing on specific subgroups or individual cases, can help to mitigate these biases and ensure that diverse perspectives are considered.
This is crucial in areas like policy development, where decisions can have a significant impact on vulnerable populations. Furthermore, small data methods are often more adaptable to complex questions and can provide deeper insights when dealing with limited information, making them valuable in a wide range of applications. However, working with small data presents its own set of challenges. Researchers must carefully consider the potential for overfitting, where a model is too closely tailored to the available data and fails to generalise to new cases. Validating findings from small datasets also requires innovative approaches, as traditional statistical methods may not be reliable. To address these challenges, researchers are drawing on expertise from diverse fields, including statistics, computer science, and mathematics, fostering interdisciplinary collaboration and developing new techniques for analysing limited data. This collaborative effort is essential for unlocking the full potential of small data and ensuring its responsible application in a variety of contexts.
Researchers are increasingly focusing on the potential of small data, recognising its value as a complement to big data approaches. This shift stems from the realisation that vast datasets aren’t always available, or may exclude important subgroups, and that meaningful insights can be derived from limited information. The methodology centres on extracting knowledge from datasets where the number of observations is limited, or where inherent variability within the data requires careful analysis, acknowledging that what constitutes “small” depends heavily on the context and complexity of the research question. The approach distinguishes itself from traditional big data analytics by prioritising depth over breadth, focusing on understanding specific patterns within smaller, potentially more homogenous datasets.
Unlike big data, which excels at identifying overarching trends, small data methods aim to uncover nuanced insights that might be obscured in larger datasets, particularly within under-represented populations. This is crucial for addressing biases inherent in large-scale analyses and ensuring that data-driven decisions are equitable and inclusive, as small data can highlight the unique characteristics of specific groups. A key aspect of this methodology is its interdisciplinary nature, drawing on techniques from computer science, statistics, and mathematics. Researchers recognise that a fragmented landscape of approaches hinders progress, and are actively working to establish a common language and framework for small data analysis.
Recent research highlights the significant potential of applying advanced artificial intelligence (AI) techniques to situations where data is limited, often referred to as “small data” settings. While much attention has focused on the benefits of AI with large datasets, this work demonstrates that substantial gains are also possible when information is scarce, with implications for areas like personalized medicine, assistive technologies, and inclusive policy-making. This is particularly important for representing under-represented groups where comprehensive data collection is challenging. A key challenge in small data scenarios is “data leakage”, where models perform well on the available data but fail to generalise to new, unseen data.
This issue is more pronounced with small datasets because the model can essentially memorize the training data rather than learning underlying patterns, leading to overoptimistic performance estimates. Researchers emphasize the importance of rigorous validation, including testing models on independent datasets, to ensure reliability and avoid biased results, though obtaining sufficient external data can be difficult. To overcome these limitations, the research advocates for a fusion of approaches traditionally used in statistics and computer science. Statistical methods, which prioritize knowledge-driven modelling, can provide a strong foundation for understanding data even with limited samples. These can be combined with data-driven techniques, such as deep learning, to extract maximum information from the available data. Foundation models, like large language models, are emerging as particularly promising tools, offering a way to incorporate external knowledge and improve performance in small data settings.
This work provides a conceptual overview of small data, contrasting it with big data approaches and identifying key themes across various application areas. The research highlights that while big data excels at identifying general trends, it can marginalise individuals and scenarios that deviate from the norm, potentially leading to ineffective or inequitable outcomes. This is particularly relevant given the historical roots of statistical methods, which often prioritise the ‘average’ and may obscure important nuances within populations. The study demonstrates the value of considering alternatives to big data, especially when dealing with limited information or under-represented groups. The authors acknowledge that big data approaches are not universally superior and can struggle with extreme values or unique scenarios. Future work, as suggested by this analysis, involves a more nuanced application of statistical methodologies and a greater awareness of the limitations inherent in relying solely on large datasets, particularly when addressing complex social phenomena or individual needs.
Q: What is the main difference between small data and big data?
A: Small data focuses on extracting detailed insights from limited datasets, while big data leverages vast amounts of information to identify broad trends. Small data is particularly useful for addressing specific questions and uncovering nuanced understandings, especially in under-represented groups.
Q: Why is small data important in healthcare?
A: Small data is crucial in healthcare for understanding rare diseases and individualised patient care. It allows for the extraction of meaningful insights from limited datasets, which can be particularly valuable when comprehensive data is scarce.
Q: What are the challenges of working with small data?
A: Working with small data presents challenges such as the risk of overfitting, where models are too closely tailored to the available data and fail to generalise. Validation of findings from small datasets also requires innovative approaches, as traditional statistical methods may not be reliable.
Q: How does AI help in small data analysis?
A: AI techniques, such as deep learning and large language models, can help extract maximum information from limited data. These methods can incorporate external knowledge and improve performance in small data settings, making them valuable for a wide range of applications.
Q: What are the future directions for small data research?
A: Future research in small data involves developing more nuanced statistical methodologies and a greater awareness of the limitations of big data. This includes fostering interdisciplinary collaboration and establishing a common language and framework for small data analysis to ensure its responsible application in various fields.