Data science has become the backbone of today’s intelligent world. From data acquisition and analysis to real-world deployment and transformation, the discipline is increasingly embedded across every sector of society. In response to this accelerating momentum, the University of Macau (UM) established the Centre for Data Science (CDS) as an interdisciplinary platform dedicated to advancing data-driven research and innovation. Through cross-disciplinary collaboration, UM brings together expertise spanning the entire data science pipeline and is committed to fostering a more sustainable and intelligent future.

Uncovering Patterns Through Large-Scale Data

Data is a critical resource driving social development, and data mining is the key technology that unlocks its value. Through advanced algorithms and modelling techniques, researchers can move beyond surface‑level information to uncover hidden patterns, reshaping how complex social systems are understood and analysed. A research team led by Cai Tianji, associate dean of UM’s Faculty of Social Sciences and professor in CDS, has applied data mining techniques to examine judicial decision‑making using a vast corpus of court judgments.

Since 2015, Prof Cai’s team has extracted publicly available effective judgments—excluding cases unsuitable for online publication—from China’s People’s Courts at all levels via the China Judgments Online platform. To date, team members have collected more than 20 million judgments and conducted large-scale, systematic analyses to examine factors influencing sentencing and identify patterns of crime behaviour across China. Drawing on this extensive dataset, team members have published more than ten papers in leading international journals, including Journal of Quantitative Criminology and the International Journal of Drug Policy, covering topics such as human trafficking, drug offences, and sexual assault. These studies provide quantitative insights into sentencing practices across different categories of criminal cases.

For example, the paper ‘Paying Money for Freedom: Effects of Monetary Compensation on Sentencing for Criminal Traffic Offenses’, co-authored by Prof Cai and his former doctoral student Xin Yanyu (now a lecturer at the Research Institute of Social Development, Southwestern University of Finance and Economics) uses data mining to obtain over 140,000 valid sentencing documents as the basis for analysis. Employing a joint modelling approach that considers both sentence length and probation as outcomes, the study adopts a zero-truncated generalised inflated Poisson model to address distributional characteristics of sentence length, such as the absence of zero values and the concentration of observations at specific points, thereby revealing the potentially significant impact of monetary compensation on sentencing in traffic accident cases.

Transitioning from an era of ‘data scarcity’ to one of data-driven research has not been without challenges. Following several redesigns of the China Judgments Online platform, the team has continuously refined its web-scraping code and data extraction strategies to improve efficiency while ensuring legal compliance. To date, they have collected over 10 terabytes of data and upgraded their infrastructure from traditional hard drives to a more secure and scalable Network Attached Storage (NAS) system. Extracting meaningful structured information from massive volumes of unstructured legal text presents an additional technical challenge. For example, identifying prison terms requires the application of Named Entity Recognition (NER) techniques, followed by text processing and value normalisation to accurately extract sentencing information.

Benefiting from UM’s open and collaborative research environment, the team has continued to refine its methodologies through academic exchange. Members are now exploring the use of advanced artificial intelligence (AI) technologies to further enhance text extraction, moving closer to their goal of building a comprehensive and reliable sentencing database to support future research.

Transforming Large-Scale Data Into Intelligent Decisions

At the intersection of social development and technological innovation, large-scale data processing provides an evidence-based foundation for collective decision-making and advanced research. By enabling data to be measured, modelled, and continuously refined, data processing expands the boundaries of human understanding.

A research team led by U Leong Hou, head of CDS, focuses on frontier challenges in large-scale data processing and intelligent decision-making. The team develops efficient methods for analysing and optimising structured, spatiotemporal, and graph‑structured data, while addressing scalability and data-efficiency bottlenecks in reinforcement learning for high-dimensional and dynamic environments. One representative contribution is LIBKDV, a kernel density visualisation library that enables high-resolution geospatial analytics on large-scale datasets.

The team has integrated this work with intelligent decision-making technologies for data-intensive scenarios, including traffic signal control, transportation systems, and resource allocation. By reducing data requirements and computational costs, the project accelerates the deployment of reinforcement learning systems in real-world settings, contributing to advances in smart mobility and urban governance.

At the academic level, the team has made substantial contributions to algorithm design, theoretical analysis, and scalable system architecture, thereby establishing a forward-looking technical foundation for data-intensive intelligent decision-making systems. The team’s work has been recognised at leading international AI conferences, with multiple papers selected for spotlight distinction and oral presentation. Recent studies further extend to key topics such as cross-scenario generalisation and large-scale real‑world deployment.

The project has progressed through a multi-pronged strategy addressing two major challenges. The first is the high demand for computational resources. The team prioritises data-efficient and computation-aware algorithm design to reduce resource consumption while maintaining performance. It has also developed reusable software frameworks and modular components to enhance research productivity and manage engineering complexity. The second challenge is the shortage of interdisciplinary engineering talent. To address this, the team strengthens interdisciplinary training to cultivate researchers capable of bridging theory, systems, and implementation.

Prof U emphasises that UM’s institutional support—including sustained research funding, strong doctoral training programmes, and the advanced GPU infrastructure of the Super Intelligent Computing Centre—has been instrumental in the project’s progress. Looking ahead, the team will continue advancing the ‘AI for Data’ paradigm, integrating reinforcement learning with real-world data systems to optimise adaptive control strategies. A key focus will be traffic trend analysis and intelligent signal control in Macao, alongside broader validation of data-centric reinforcement learning frameworks in complex, dynamic environments.

Advancing Multilingual Intelligence Through Data Science

From academic scholarship to the preservation of ancient tribal songs, from the standardisation of multilingual medical research data to language mediation at emergency rescue sites, machine translation is quietly transforming the flow of knowledge and cross-cultural communication. Wong Fai, head of UM’s Natural Language Processing & Portuguese‑Chinese Machine Translation Laboratory (NLP2CT) and professor in CDS, draws on his expertise in computational linguistics and natural language processing to lead research focused on overcoming the challenges of translating low-resource languages in response to Macao’s multilingual needs.

Leveraging the laboratory’s research strengths, the team has developed Macao’s first Chinese-Portuguese machine translation system, PCT, as well as the online Chinese-Portuguese-English computer-aided translation platform (UM-CAT), which incorporates Macao’s linguistic and cultural context. These systems have been adopted by various departments of the Macao SAR government for official document translation. Moreover, these technological innovations received the Macao Science and Technology Award in 2012 and 2022. In addition, UTran-i Technology Ltd., founded by graduates of NLP2CT and incubated by UM’s Centre for Innovation and Entrepreneurship, has developed a range of multilingual products and services in Portuguese, English, Mandarin, and Cantonese based on independently developed core technologies.

Behind these achievements lie more than 20 years of sustained research and development by Prof Wong and his team. As he explains, Macao, as a multilingual yet relatively low-resource environment, lacks large‑scale, high-quality parallel corpora for language pairs such as Chinese-Portuguese and Cantonese‑Mandarin. In addition, different professional domains exhibit fragmented terminology, stylistic variation, and distinct pragmatic conventions, all of which increase the complexity of translation tasks. Real-world applications further demand integration with multilingual, multimodal, and domain-specific knowledge bases, raising the technical threshold for system development. In this context, UM’s Institute of Collaborative Innovation promotes interdisciplinary collaboration across computer science, linguistics, and data science, enabling the team to mitigate data scarcity and expand the scope of its research. Coordinated internal and external research funding mechanisms have also provided sustained financial support for the team.

As language demands evolve, the team remains committed to positioning machine translation as a bridge for sharing Chinese culture with the world. This vision is reflected in the translation pipeline developed specifically for Chinese literature, which won first place in the literary translation task at the Conference on Machine Translation (WMT 2024). Looking ahead, NLP2CT will further leverage its expertise to develop high-precision detection tools for AI-generated content. By addressing emerging concerns about AI credibility, the team aims to promote the responsible application of AI in scientific research, education, and cultural tourism.

Innovation Beyond Constraints

Extracting patterns from vast datasets, enabling informed decisions from complex information, and creating opportunities within resource constraints require not only technological innovation but also vision and perseverance. Through forward‑looking research and sustained interdisciplinary collaboration, UM contributes to sustainable social development in the era of data intelligence.

Text: Bell Leong, UM reporter Zhang Tongyu

Photo: Editorial board, with some provided by the interviewees

English Translation: Bess Che

Source: UMagazine Issue 33

All articles in Celebrating UM’s 45th Anniversary: Research Powering the Future:

Rector’s Words

Advancing the Scientific Foundations of Chinese Medicine

Three Decades of Dedication to Chip Innovation and Real-World Impact

Connecting Everything to Build Smarter Cities

Opening New Pathways to Personalised Health

Driving Breakthroughs in Materials Engineering

Safeguarding Sustainable Marine Development Through Research

Shaping a Safe and Intelligent Future

Advancing the Frontiers of Brain Science and Human Cognition

Promoting Social Development Through Interdisciplinary Research

Blueprint and Vision for the Future

Appendix in Celebrating UM’s 45th Anniversary: Research Powering the Future:

45 Years of Milestones at UM

Anniversary Logo and Slogan