علم داده و سواد آماری

One of the most interesting research area in new statistical approaches is defined on the base of class imbalance problem in the classification algorithms of Big Data.

What does it mean?

Let me divide it into three different sections: 1-Class imbalance problem 2-Imbalanced Big Data challenges and 3-credit scoring. By following these three steps, I hope to make it clear how making an effective technical solutions and scientific suggestions could lead to an effective result for credit scoring systems based on Big Data imbalanced learning.

Ready?

Basically, classification is an important task in machine learning. A classifier, trained from a set of training examples with class labels, can then be used to predict the class labels of new examples. A class is a collection of things that might reasonably be grouped together. If we discover something belongs to a class, we suddenly know quite a lot about it even if we have not encountered that particular example before. Isn’t it interesting? Anyway, there are some problems in this useful machine learning era named Class Imbalance Problem.
Data are said to suffer the Class Imbalance Problem when the class distributions are highly imbalanced or the classes show a skewed distribution, i.e., there is a minority class, and a majority one. This may be due to rarity of occurrence of a given concept, or even because of some restrictions during the gathering of data for a particular class.

A good news and a bad news!
Good news is there are some solutions. To successfully address the task of imbalanced classification, a number of different solutions have been proposed, which mainly fall into three categories. 1-Family of pre-processing techniques, 2-Algorithmic approaches that alter the learning mechanism by taking into account the different class distribution, 3-Category comprises cost-sensitive learning approaches. Bad news is these approaches have low predictive accuracy for the infrequent class in new Big Data era.

Actually, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style and newness of the subject. As the result, imbalanced learning is still a recent discipline in Big Data and needs more research and development.

But is it really important?

Absolutely YES! For instance, Big Data provides banking industry a chance to boost business outcomes and it plays a great competitive advantage in the risk management systems. It is of extreme importance to design novel approaches to deal with Imbalanced learning problems of Big Data to note the huge practical perspective such as credit scoring in the risk management systems. Recently, cellphones, core banking and payment systems provide Big Data sources for banks. They can be used to monitor different kind of risks but distress situations are relatively infrequent events! The very limited information for distinguishing dynamic fraud from genuine customer in an extremely sparse and imbalanced data environment is making credit scoring more and more challenging nowadays.

What is the solution?

The new researches that focus on the class imbalance problem in the classification algorithms of Big Data!

Afshin Ashofteh

559 views16:30