SafetyPredictor focuses on the prediction to the safety level of driving behavior for drivers on online ride-hailing.

It’s the project I worked on as a software engineer intern in Didi Chuxing at Beijing, China, where I was supervised by Yashu Liu.


1. Problems and Motivations

Driving safety is an important problem, closely related to the daily life of residents, and the business of companies providing online ride-hailing. Data shows that among various factors affecting traffic accidents, about 93% are related to people, and about 35% are related to roads.

Thanks to Didi’s Big Data Platform, we are able to collect information on both drivers and roads. Our intuition is that certain information has a strong relationship with driving safety:

  1. Roads information: road width, number of lanes, road speed limit, etc.
  2. Dangerous driving behaviors: rapid acceleration, rapid turn, over speed, etc.
  3. The combinations: certain behaviors happened on certain roads. For example, rapid acceleration on bridges may be more dangerous than normal roads.

2. Aims

We want to build systems which:

  1. Collect raw data in a previous time window;
  2. Generate numeric features of roads information, dangerous driving behaviors, and the combination of the two;
  3. Build prediction models to predict driving safety in the future;

The systems are running on millions of online ride-hailing orders on DiDi’s Big Data Platform, so it must support:

  1. the parallelism when generating features — Spark and Hadoop;
  2. the parallelism when training model — XGBoost and LightGBM.

3. Methods and Solutions

We are applying a patent for algorithms. I will present detailed algorithms after the patent comes to the world (it will be soon). The following parts are high-level algorithms.

1. Feature Engineering

The most complex part of algorithms is about how to combine roads information and dangerous driving behaviors. It heavily leverages Spark RDD, especially actions related to RDD in key-value pairs (since we want to generate features of each driver).

We generate numeric features like the average number of times a driver braked sharply on a road with a speed limit of 80 km/h in the most recent month.

We mark the drivers who have a traffic accident in a certain time window in the future as dangerous drivers with label 1. On the other hand, drivers without traffic accidents are in label 0.

2. Prediction Model

Finally, we generate about 800 features for each driver and for a month there are about 5 million monthly active drivers.

Given generated features and labels, we regard the prediction as a regression problem. We build our models in XGBoost (we also tried LightGBM) and apply parameters tuning (we write auto-tuning scripts, available here).

4. Performance

The AUCs of prediction reached 0.803, and the Top 10% Recall reached 46.11%.

It means if we take 100 drivers that our model predicts positive, 46 of them will have traffic accidents in the future. It is very important for online ride-hailing companies to take some actions (like special educations) for those dangerous drivers.

Compared to the previous model which only uses dangerous behaviors, the AUC gained 3pp higher and the Top 10% Recall gained 5pp higher.

Related Technologies

  1. Spark and Hadoop: All the data is on Didi’s Data Platform which is based on Spark and Hadoop. We use Spark heavily to generate features.
  2. LightGBM and XGBoost: We utilize lightGBM and XGBoost to build our prediction models.
  3. Algorithms: We build the whole project from scratch. The algorithms in this include typical machine learning processes: data cleaning, feature engineering, feature selections, models validation, and parameters tuning.

Programming Languages

  1. Java & Python: 90%. Most of the codes are related to actions of RDD in Java. The other parts of codes are related to prediction models in Python.
  2. Bash: 5%. I write bash scripts for testing.
  3. HQL: 5%. I write small HQL snippets in Hive to get raw data.

Links to Codes

The auto-tuning scripts now are available here.