Dr Li Xiaoli, Head of Data Analytics Department, Institute for Infocomm Research (I2R), A*Star
Big data analytics has become critical in recent years with the increasing data that is collected from various application domains. Enterprises and Business owners want to analyse these large volume of data to extract knowledge and insights that could potentially be used to improve their productivity and profitability, and also to assist business decisions making.
For example, in the manufacturing domain, various sensors (e.g. vibration sensors, acoustic sensors, temperature sensors) have been deployed to collect huge amount of time-series sensory data to monitor manufacturing equipment on their performance, predicting potential failures and the remaining useful life, as well as forecasting product yield in complex manufacturing processes.
Clearly, there are many technical challenges in big data analytics that need to be addressed before we can reap great benefits. We will elaborate some generic and specific challenges separately.
Generic challenges for big data analytics1) Scalability issue: the volume or sheer size of the data that is larger than the capacity of current computer. We will need to deploy scalable big data analytics platforms (with the help of distributed systems) to handle such issue.
2) Complexity issue: we may need to handle variety or different types of data, including structured data or unstructured data, while many analytics techniques are designed for structured data only. In other words, for certain applications, heterogeneous data, including transaction data, free text data, image and video data, social network data etc, need to be integrated and processed effectively. Often, data integration is a very powerful tool for many applications, i.e. link and merge data from multiple sources, as it could provide comprehensive overview or full picture of actual application scenario. In addition, it could simply provide more rich useful data to be leveraged for various applications.
3) Velocity issue: data are not static but rather dynamically change over time. We will need to design new algorithms so that we can quickly crunch or process data to provide prompt response and rapid model update with new incoming data.
4) Privacy issue: In healthcare application, electronic health records from different hospitals/clinics can be merged to build diagnostic models for certain disease investigation.
Typically Rich Relevant Data Can Potentially Lead To More Useful Insights Or Accurate Prediction Models
It is also interesting to understand users’ travel patterns based on integrating traffic data, CDR data, and Wi-Fi data for various transportation and smart nation applications. However, at the same time, people now have serious concerns on the privacy issue of their personal health data, trajectory or location data. How to perform privacy preserving data analytics needs more research.
Specific challenges for big data analytics1) Handle imbalance data applications: In several application domains, such as fraud detection in financial data analytics, machine or aircraft component failure diagnosis and prognostic in manufacturing and aerospace, disease diagnosis in healthcare, earthquake prediction, intrusion detection in network forensics etc, there is an imbalanced data issue that needs more attention. For example, most of the data represent normal class (i.e. non-fraud, non-disease or negative class) and few examples are available to train a model to recognize abnormal class conditions (i.e. fraud, disease or positive class). Here, normal class examples far exceed abnormal class examples. In other words, fraud, diseases and abnormal data are often rare, compared with normal cases, which leads to highly imbalanced data for learning. In fact, here we have big data for normal class but small data for abnormal class, and big data cases are better represented in data than minority or small class cases. Clearly, how to accurately predict the abnormal cases (e.g. fraud cases) is very important. However, analytics models that are designed to optimize prediction accuracy for the above applications will be biased and will make wrong predictions as it likely classifies unknown cases into majority or normal class (we thus cannot effectively detect fraud, disease, faults or failures).As such, the techniques that can effectively handle imbalanced data are urgently needed. While there are few research work aim to balance data by generating artificial minority class examples (by performing oversampling), this problem is far from sufficiently addressed and we need more research to tackle this challenging issue.
2) Edge analytics and cloud computing: In manufacturing, aerospace& healthcare environment, big data (typically from sensors or other devices) are typically collected and traditional centralized data analytics mechanism (using cloud computing) is not suitable for such environment for these time-critical applications. However, edge analytics is very appealing and attractive, as it can perform analytical computation at a sensor, network switch or other device level, instead of waiting for the data to be sent back to a centralized data server or cloud. In other words, edge analytics pushes data analytics to edge devices where data are generated, so that heavy data communication and transmission cost can be largely reduced, and thus big data analytics can be conducted more promptly and even in real time. As such, recently, edge analytics has become very popular (e.g. in industry 4.0 applications) due to its clear technical and practical advantages and benefits. Nevertheless, we will need design analytics techniques that can be performed in resource constrain computational environment (with less computing power, small memory, less electricity supply) to achieve efficient but relatively accurate results. On the other hand, if we can find ‘interesting’ data using edge analytics, we can also send these small amounts of data to cloud to conduct more detailed and advanced data analytics, i.e. edge and cloud collaboration mechanism for sensor big data analytics.
Overall, there are many challenging issues and plenty of opportunities for big data analytics. Typically rich relevant data can potentially lead to more useful insights or accurate prediction models. When we really have huge amount of big data for model building, either classification models or regression models, we can also perform intelligent pre-processing to turn big data into small data keeping those important data points only to achieve fairly accurate prediction results.