2021eResearchReport - Flipbook - Page 21
Reclaiming Africa’s
histories through
machine learning
Artificial intelligence and machine
learning have a built-in Western
bias because of the source of their
training. Using data from EMANDULO,
postgraduate researcher Jarryd Dunn is
using UCT’s High Performance Computing
(HPC) facilities to train machine-learning
systems in isiZulu to make them better
versed in languages of the Global South.
EMANDULO is a digital research platform
(currently in a testing phase) developed
by UCT’s Archive and Public Culture
(APC) research initiative. The APC byline
is that there can be no transformation of
knowledge without interrogation of the
archive. EMANDULO and the APC’s other
platform, the Five Hundred Year Archive
(FHYA) challenge an enduring archival
focus on European colonialism with a
focus on archival matter relevant to the
five hundred years of southern African
history before European colonialism.
EMANDULO prioritises material in African
languages and uses digital innovations to
challenge Eurocentric assumptions built
into the very “DNA” of archiving, and into
its gold-standard archival software.
“We re-curate archival material online
in novel ways to open it up to new
kinds of historical enquiries. We do this
both in how we identify and present
archival content, and through critical
software innovation, achieved through a
dynamic partnership with digital archive
20 eResearch Report | 2019-2020
researchers in computer science.” says
Debra Pryor, FHYA Archival Content
Manager.
A practical issue when accessing an
archive like this is ambiguous names
contained in the archived materials.
This can be due to people having the
same names as well as one person being
referred to by different names.
Jarryd Dunn, who is doing his master’s
in Data Science with a dissertation in
computer science, is using the HPC
facilities to try and address this issue for
the FHYA and other local archival efforts.
“Part of the thinking with using the
FHYA data is that it is a useful problem to
solve,” says Dunn.
Using the HPC facilities, Dunn is using
transformer-based language models,
which are a kind of deep-learning tool that
can be pre-trained on large datasets to
understand natural language texts. These
models work by capturing the context
surrounding a word to identify what that
word means. The language models are
used to provide representations of natural
language which can be used to build
a Named Entity Disambiguation (NED)
system to disambiguate names appearing
in the text.
“Here, dealing with a name is very
helpful because context tells you a lot
about it. For instance, if you get a letter in
your postbox with your first name on it,
you know it is for you, whereas a letter at
the post office with only your first name
does not mean very much at all.”
While it is very straightforward to use
these language models on English data,
it becomes much trickier if working on
languages like isiZulu.
“All these language models really are is
a statistical representation of language,”
explains Dunn. “And they are really good
because of the sheer volume of English
data available online. isiZulu, for instance,
is much more challenging because there
is almost no data available to train the
language models. These are referred to as
low-resource languages.”
Practically what Dunn is doing is
training the system on a small part of
the FHYA data and then liaising with
historical experts to feed the nuanced
rules for local name recognition into the
system to determine how these rules
might be used to make up for the lack
of training data. He and his colleagues
hope that the same system will perform
well on a larger data set. To do this, he is
using a Python library called PyTorch as
the primary framework. This allows him
to run code on a Graphics Processing Unit
(GPU) which offers an advantage.
“The GPU gives you a massive speedup
especially for the transformer models,”
explains Dunn. “It is maybe 10 to 30 times
faster, running on a GPU compared to
running it on a CPU [Central Processing
Unit]. And that is a lifesaver.”
UCT’s HPC facility
A wide array of researchers, across
disciplines, rely on UCT’s HighPerformance Computing (HPC) facility
for their research. The HPC team asks
researchers who use the facilities,
to acknowledge UCT HPC in their
publications. In this reporting period UCT
HPC received approximately 20 such
acknowledgments. These include:
Malaria control interventions in Ghana
This research focused on developing
population-level mathematical models for
malaria transmission for three different
malaria transmission zones in Ghana.
All three models were calibrated using
5 malaria-related data sets including
diagnosed uncomplicated malaria,
severe malaria, malaria in pregnancy and
malaria-attributable deaths in children
and adults. Each of these case-series data
spans the period from 2008 to 2017 with
a prediction period from 2018 to 2030.
For each model, 11 parameters
were simulated and 15,000 iterations
were performed for each intervention
scenario that was tested for each zone.
These interventions included testing
the impact of long-lasting insecticidetreated bednets, indoor residual spraying,
seasonal malaria chemotherapy and
mass screening and treated as single
interventions and then combined. In all,
76 scenarios were investigated using the
three models respectively for all three
zones.
The findings of the research points to
various scenarios under which malaria
incidence in Ghana could be eliminated,
if not reduced, to not being a public
health concern. These results will support
policy makers to make relevant decisions
regarding which interventions to deploy
in various regions that optimises the
usage of limited resources to achieve
reduced morbidity in the population.
Awine T, Sital S. (2020 Nov 23). Accounting
for regional transmission variability and the
impact of malaria control interventions in Ghana:
a population level mathematical modelling
approach. doi.org/10.1186/s12936-020-03496-y
Supporting reproductive health of African
women
Common wisdom is that Lactobacillus
species are a hallmark of healthy female
genital tract bacterial communities.
They can promote various aspects of
cervicovaginal health by reducing the
risk of bacterial vaginosis (BV), a vaginal
disorder, and sexually transmitted
infections (STIs).
Most African women lack
Lactobacillus-dominated cervicovaginal
microbiota (CVM). Instead they have
high-diversity CVM that are known to
be associated with BV and STIs such
as cancer-causing (high-risk) human
papillomavirus (HR-HPV), which is a global
health concern. Of the few African women
with Lactobacillus-dominated CVM, (L.
iners type), a less protective CVM, is the
most prevalent. CVM functions remain
poorly characterised, yet we know that
functional profiling of the CVM is vital
for investigating human host-microbiota
interactions in health and disease.
In this study, we therefore investigated
the functional potential of CVM of
75 African women with and without
lactobacilli (L. iners) dominance, BV, and
HR-HPV infection.
Supporting research 21