Class Imbalance in Fault Detection Benchmarks


Due to the rarity and importance of faults, fault detection in engineering systems is a typical problem of learning from imbalanced data. We explored the ability of existing online class imbalance methods on fault detection applications from real-world projects. The chosen data sets are highly imbalanced. We design and look into a series of practical scenarios, including not only data streams that are constantly imbalanced, but also data streams suffering short-term fluctuations of class imbalance status.

The benchmarks used in these experiments - compound of three real-world data sets - are available under open-source license. They were collected in real time from complex engineering systems.

The common features of the datasets are:

  1. the basic task is to discriminate between two classes faulty and non-faulty;
  2. the faulty examples present to be the minority in the long run, but may arrive in high frequency within a short period of time depending on the property of faults;
  3. they aim to find faults accurately without degrading the performance on the other class; thus, high G-mean  (i.e. a good performance balance between classes) is desired for the obtained online predictor.




For more details among datasets and scenarios of use please refer to the following publication:

Wang, S., Minku, L.L., and Yao, X. (2013), "Online Class Imbalance Learning and Its Applications in Fault Detection", Special Issue of International Journal of Computational Intelligence and Applications (To appear)

Gearbox Benchmark


Gearbox is the fault detection competition data from PHM society 2009. The task is to detect faults in a running gearbox using accelerometer data and information about bearing geometry. Data were sampled synchronously from accelerometers mounted on both the input and output shaft retaining plates. The original data contain more than one type of faults that can happen to gears and bearings inside the gearbox. To simplify the problem, we select one type of gear faults - the gear with a chipped tooth, which exists in the helical (spiral cut) gear with 24 teeth. The data set is thus a 2-class fault detection problem - the gear either in good condition (nonfaulty class) or having a chipped tooth (faulty class).

Download the Gearbox Benchmark



Smart building

Smart building is a 2-class fault detection data set, aiming to identify sensor faults in smart buildings. The sensors monitor the concentration of the contaminant of interest (such as CO2) in different zones in a building environment, in case any safety-critical events happen. In this data set, the sensor placed in the kitchen can be faulty. A wrong signal can lead to improper ventilation and unfavorable working conditions.


Downlaod the Smart Building benchmark




iNemo Benchmark

iNemo is a multi-sensing platform developed by STMicroelectronics for robotic applications, human machine interfaces and so on. It combines accelerometers, gyro- scopes and magnetometers with pressure and temperature sensors to provide 3-axis sensing of linear, angular and magnetic motion in real time, complemented with temperature and barometer/altitude readings 39. To avoid any functional disruption caused by signalling faults in iNemo, a fault emulator is developed for producing and analysing different types of faults. A fault is defined as an unpermitted deviation of at least one characteristic property or parameter of the system from the acceptable/usual/standard condition. It can be introduced into any sensor of iNemo by using the emulator given the real data sequence coming from the sensors. For this study, we generate offset faults for the feature of gyroscope x-axis, by adding an offset to its normal signal.

Downlaod the iNemo benchmark






We are committed to maintaining a public repository of i-Sense benchmarks in the spirit of cooperative scientific progress as promoted by digital open access guidelines to EU funded research oriented towards the dissemination of results and outcomes.

You are free to download a portion of the datasets for non-commercial research and educational purposes.  Work based on the dataset should cite our references papers listed with each benchmark.

All datasets and benchmarks on this page are published under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.

Creative Commons License