Product Announcement

The Challenges of Developing a Mandarin Model for AI Crisis Detection


The project at a glance

Hi! I’m Sky Sun, I’m one of the Data Intelligence Managers here at Samdesk. Today I’m excited to announce the launch of our native Mandarin event detection models. Here is a bit about the project and our journey in getting launched today.

China has the world's second-largest nominal GDP in current dollars, and the biggest in terms of Purchasing Power Parity, making it the second-largest economy after the U.S.

For many people, it is critical to monitor the Greater China area for its importance and unique position in the global supply chain, geopolitics, and more. It remains challenging however to obtain proper situational awareness during crises within China for a number of reasons.

While most analysts rely on traditional media monitoring, an increasing number of analysts look to Open Source Intelligence (OSINT) to try and understand what is happening to their supply chains and other interests in the area. If we separate sources of information into two categories: media and OSINT, we can see how these two categories play out differently across the landscape of China:

1. Media Monitoring

News and media in China can be biased and adopted as a vehicle to disseminate information that is favorable to the government and its ideology. Official news sources are also not the fastest way to obtain breaking news in China, as social media has dominated information sharing, with over 930 million people actively using social media platforms in China as of January 2021.

2. OSINT Monitoring

China puts a tremendous amount of energy into protecting and advocating for local technology and innovation. E-commerce companies such as Alibaba and Jingdong dominate online shopping while mobile payment apps such as Wechat pay and Alipay have become the main method of transactions day-to-day. A similar protective approach can be seen in the realm of social media in China. While much of the western world uses Facebook and Twitter, China adopts its home-grown Weibo and Wechat platforms. According to the 2020 Weibo Users Development Report, in September 2020 Weibo reached over five hundred million users while Wechat reported 1.2 billion monthly active users in its 2020 Financial Statement. As seen from the figures, these two platforms have dominated the social media industry and become a ubiquitous part of the daily lives of Chinese people. While locally run sites are important to monitor, they are difficult to automate the extraction of data. Other social media platforms and key news sites and blogs offer insights into the larger region.


How we're supporting greater coverage in Asia

Here at samdesk we support an ever-growing number of companies, supply chains, and travelers in the Asia region as a whole. We’ve been hard at work expanding our language and coverage in Asia and more specifically in Greater China.

Our goal at samdesk is to alert our clients to crisis events all over the world, helping them make confident decisions that can have the greatest impact. That’s why today we’re excited to announce that our reach will extend even further to include enhanced coverage in China with the launch of our native support for Mandarin.

Taking on Mandarin was a huge task for our Data Intelligence and Machine Learning teams. Languages are complicated and ever-evolving but add in the complexities of the Chinese system already mentioned, and an artificial intelligence system quickly comes up against some challenges. As project lead on this initiative, I wanted to share some of the challenges we faced and ultimately how we’re supporting greater coverage.

Source Types

One of the biggest challenges is the triangulation of information. At samdesk, we are proud to provide speedy and accurate alerts to clients and we hold ourselves to the highest possible standards with every new language addition we make. With the abundance of information in popular languages on social media such as English we are able to obtain enough information to corroborate and confirm the legitimacy of the crisis, however, that becomes a challenge for Mandarin as there is not a lot of usage on western social media. We combat this by integrating a variety of sources, from social media to the web to RSS feeds (with more on the way). A combination of data sources helps corroborate data that might be sparse in a particular region - ensuring we catch more critical events but also validate them with higher confidence.


Spoken Language Detection

Another challenge we came across during the development phase is the language detection itself. When Chinese the language is being discussed, it is usually reduced to Simplified Chinese and Traditional Chinese (which are the two versions of Chinese that Twitter recognizes). While the two written forms of Chinese are recognized, what’s often neglected is how Chinese is spoken. Spoken language is truly what matters when it comes to language model training. Our model learns from training data and obtains an understanding of how a topic is talked about by humans. We needed to find a way to distinguish how Chinese is spoken. Spoken Chinese has hundreds of local varieties, among which Mandarin and Cantonese are the two dominant varieties. When we realized Twitter doesn’t differentiate between Mandarin and Cantonese, we had to get creative and configure this on our own. So for this launch, we want to define it more accurately and be transparent: What we launched is Mandarin with a simplified script which is most widely used in China Mainland amongst the Han ethnicity. We will continue to develop Mandarin with traditional script that is commonly used in Taiwan, and will develop Cantonese with traditional script that is commonly used in Hong Kong and Guangdong province.

With the ever-increasing political and economic importance of China as a region, we’re excited to expand our ability to detect and cover critical events. A great example of our expanded coverage in Mandarin is a recent event that the samdesk platform picked up in Shenzhen, Guangdong on June 1, 2021. We alerted clients only 3 minutes after the first mention of a fire and explosion in the Foxconn factory where leading electronics brands’ parts are manufactured. Our AI detection system not only picked up on the tweet in the Chinese Mainland script but was also able to corroborate details quickly by natively ingesting a real-time video of the incident. This event didn’t start getting picked up by western and international news outlets until almost an hour after we triggered an alert.


Are you a current samdesk customer?

To set up a China-based Samdesk Stream simply head over to your Stream settings and set the location to China, Taiwan and Hong Kong, and let our system do the rest! Hope this launch enriches your overall experience with Samdesk Alerts!

About the Author
Sky Sun is a Data Intelligence Manager at samdesk.