A bipartite network of an author group and research topics
In spring of 2023, I gave my faculty application job presentation for what would end up being my current appointment at the University of Waterloo. At the end of my presentation, I had a final section showing a plot like this one:
This is a bipartite network of two types of entities: the group of faculty members at the School of Public Health Sciences and a collection of research topics (in red) derived from the abstracts published by the faculty members (in blue), and myself (in green) for the last 10 years. The nodes of the network represent each of the faculty members and the research topics, and the edges (the links between nodes) mark whether a given faculty member is associated with a research topic given their published work.
The names of the faculty have been de-identified, but I used their real names in my presentation.
I hope the goal of having something like this is clear. First, I did this for myself, wanting to know where and how I belong among the faculty group (regarding my research). But then I thought it would be a great part of my presentation to show how do I fit, what new I bring, and my skills in a fun and original way. I remember getting good laughs and jiggles when I reached this part of the talk. I wanted to post how I did it in case it is helpful to anyone. So, letโs get coding!
This is what we need:
Using Large Language Models. When I was preparing this for my talk, I didnโt have the time to learn how to set up an LLM for this (I know now!), but you can definitely use one to group papers and extract research topics, too.
Here the list of all the libraries I used
library(scholar) # For parsing Google Scholar information from authors
library(tidyverse) # For data wrangling and other tasks
library(easyPubMed) # For querying PubMed
# dealing with text and topic modeling
library(textclean) # useful text cleaning functions
library(tm) # text mining functions such as data structures
library(tidytext) # more text cleaning and mining functions
library(topicmodels) # for topic modeling functions
library(textmineR) # further text mining functions
# plotting
library(igraph) # Extensive graph library
library(ggnetwork) #ggplot extension for graph plotting
# Parallel
library(parallel) # for parallel processing
We start by extracting the list of papers from each author we want. The raw_df dataset was created manually and contains each individualโs name, research interest, and Google ID. I collected this information by painstakingly going through each faculty memberโs profile!
# Filter the raw_df dataset containing names, research interests, and Google IDs.
g_id_df<-raw_df%>%
filter(!is.na(g_scholar))
# API call to get publications ids for each author
pubs<-lapply(g_id_df$g_scholar, get_publications)
# Add names of the author to each df
pubs_name<-lapply(1:length(pubs), function(i){
pubs[[i]]$name<-g_id_df$name[i]
pubs[[i]]$id<-g_id_df$g_scholar[i]
pubs[[i]]
})
# collapse to a single dataset and save in disk
pubs_df<-bind_rows(pubs_name)
saveRDS(pubs_df, file = "pubs_df.Rds")
Now, we can extract the info per paper. The Google Scholar URL connection has a very low time/frequency limit for making requests, which makes it difficult to extract all the information per author. However, we can use the PubMed API to query per paper.
# Filter for the last 10 years (from 2013 to 2023) to keep the network relevant
pubs_df_f<-pubs_df%>%
filter(year>=2013)
# Create a set of clusters for parallel computing
cl<-makeCluster(14)
parallel::clusterEvalQ(cl, expr = {
library(easyPubMed)
})
clusterExport(cl, varlist = "pubs_df_f")
# Iterate the paper list in parallel to extract all information (title, abstract, etc) from PubMed
pubmed_data<-pblapply(1:nrow(pubs_df_f), function(i){
pubmed_ids<-get_pubmed_ids_by_fulltitle(pubs_df_f$title[i])
if(pubmed_ids$Count==0){
return(NA)
}else{
fetch_pubmed_data(pubmed_ids, format = "medline" )
}
}, cl = cl)
# Save for later use. This process can take a while. Saving the results so we don't need to rerun the API calls
saveRDS(pubmed_data, "pubmed_data.Rds")
Now, we can process each abstract. We do two things. First, isolate the abstract for each paper. Then, we clean the text to facilitate downstream analysis. This is a common task in natural language processing. For example, we transform all text to lowercase. Both functions for extraction (extract_abs()) and text cleaning (clean_text()) are shown at the end of the document.
Extract abstracts
Clean the abstracts and the titles
pubs_df_f$clean_abs<-clean_text(pubs_df_f$clean_abs)$t
pubs_df_f$clean_title<-clean_text(pubs_df_f$title)$t
saveRDS(pubs_df_f, "pubs_df_f.Rds")
Now, we need to add my abstracts to the list. I could have added my Google Scholar information in the process above and processed as every other author, but I did not.
ATE_pubs_ids<-get_pubmed_ids("abel torres espin")
ATE_pubmed_data<-fetch_pubmed_data(ATE_pubs_ids, format = "medline")
# get PMIDs
starts<-which(str_detect(ATE_pubmed_data, "^PMID"))
pmid<-ATE_pubmed_data[starts]
#extract papers
starts<-c(starts, length(ATE_pubmed_data))
ATE_pubmed_data_list<-list()
for(i in 1:(length(starts)-1)){
ATE_pubmed_data_list[[i]]<-ATE_pubmed_data[starts[i]:(starts[i+1]-1)]
}
pubmed_data<-readRDS("pubmed_data.Rds")
pubmed_data<-append(pubmed_data,ATE_pubmed_data_list)
# Save again with the addition of my abstracts
saveRDS(pubmed_data, "pubmed_data.Rds")
Adding my information on the pub_df_f dataframe
ATE_pubmed_data_list<-readRDS("ATE_pubmed_data_list.Rds")
ATE_pub_df<-data.frame(name = "Abel Torres Espin",
clean_abs = unlist(lapply(ATE_pubmed_data_list, extract_abs)),
pubid = pmid)
ATE_pub_df$clean_abs<-clean_text(ATE_pub_df$clean_abs)$t
pubs_df_f<-bind_rows(pubs_df_f, ATE_pub_df)
saveRDS(pubs_df_f, "pubs_df_f.Rds")
Now that we have all the data that we need, we can start the process of clustering all abstracts given their similarities.
# Read previous dataframe
pubs_df_f<-readRDS("pubs_df_f.Rds")
# Filter out those that are corrections or the abstract is empty
pubs_df_clustering<-pubs_df_f%>%
filter(!str_detect(clean_abs, "this corrects the article")|is.na(clean_abs))
#If the abstract is empty, use the title. Also, remove some highly common words.
#This is often performed because otherwise, these words might drive the whole clustering
#solution, but they are likely present in each abstract
pubs_df_clustering<-pubs_df_clustering%>%
mutate(clean_abs2 = ifelse(is.na(clean_abs), clean_text(title)$t, clean_abs),
clean_abs2 = removeWords(clean_abs2, c("health","care",
"use","studi", "data", "associ", "result",
"research")))
pubs_df_clustering<-pubs_df_clustering%>%
filter(!str_detect(clean_abs2, " des | de | la "))
Now, we can create a document term matrix, a data structure that stores the documents by tokens. In this case, we use an n-gram window of 1, which means tokens correspond to a single word.
# create document term matrix (tokens)
set.seed(55)
dtm_clean_abs<-CreateDtm(pubs_df_clustering$clean_abs2,pubs_df_clustering$pubid,
ngram_window = c(1,1),
stopword_vec = stopwords("en"),
verbose = F,
stem_lemma_function = function(x)
SnowballC::wordStem(x, "en"))
## Save so that this doesn't need to run every time
saveRDS(dtm_clean_abs, "dtm_clean_abs.Rds")
We create a frequency matrix counting the times each token shows up across abstracts. We do some further filtering and calculate the cosine similarity between abstracts. This is a measure of how similar the documents are given the frequency of words used, with the understanding that closer abstracts would tend to use more similar words. Finally, we perform hierarchical clustering on the distance (1 - similarity).
dtm_clean_abs<-readRDS("dtm_clean_abs.Rds")
tf_mat <- TermDocFreq(dtm_clean_abs)
## filter for tokens in more than one document
tf_mat_f<-tf_mat%>%
filter(doc_freq>1 & doc_freq<663)%>%
filter(!term%in%c("n", "p", "patient",
#"รข",
"factor", "increas", "use", "intervent",
"studi", "result", ""))
# TF-IDF and cosine similarity
tfidf <- t(dtm_clean_abs[ , tf_mat_f$term ]) * tf_mat_f$idf
tfidf <- t(tfidf)
csim <- tfidf / sqrt(rowSums(tfidf * tfidf))
csim <- csim %*% t(csim)
# calculate distance
cdist <- as.dist(1 - csim)
cdist[is.na(cdist)]<-1
# Hierarchical clustering
hc <- hclust(cdist, method = "ward.D2")
plot(hc, labels = FALSE, hang = -1)
Now that we have clusters of documents based on their similarity, we can attempt to explain what each cluster means by checking the most frequent words in the documents for each cluster.
After extracting that list of words per cluster, I manually named each cluster, given the themes I considered appropriate from the list. You can see the five most common words per cluster
tfidf_df<-as.data.frame(as.matrix(tfidf))
tfidf_df$clust<-cutree(hc, 60)
tfidf_df_sum<-tfidf_df%>%
group_by(clust)%>%
summarise(across(everything(), \(x) mean(x,na.rm = TRUE)))%>%
pivot_longer(cols = -clust)
tfidf_df_sum_f<-tfidf_df_sum%>%
group_by(clust)%>%
slice_max(value, n = 5)%>%
filter(value>0)
#write.csv(tfidf_df_sum_f, "tfidf_df_sum_f2.csv")
tfidf_df_sum_f %>%
mutate(name = reorder_within(name, value, clust)) %>%
ggplot(aes(value, name, fill = factor(clust))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ clust, scales = "free") +
scale_y_reordered()
Now we can plot a graph by associating each document and the author of it to a cluster/
set.seed(123)
# Format the data for a network
cluster_names<-read.csv("tfidf_df_sum_f2.csv", na.strings = "")%>%
mutate(clust_level2 = ifelse(is.na(clust_level2), clust_name, clust_level2))%>%
group_by(clust)%>%
summarise(clust_name = unique(clust_name),
clust_level2 = unique(clust_level2))
pubs_df_clustering$clust <- cutree(hc, 60)
pubs_df_clustering2<-pubs_df_clustering%>%
left_join(cluster_names%>%
select(clust, clust_level2))%>%
filter(!is.na(clust_level2))
pubs_df_clustering2$name<-str_trim(pubs_df_clustering2$name)
net_df<-pubs_df_clustering2%>%
select(name, clust_level2)
# Extract a graph
g<-graph_from_data_frame(net_df, directed = F)
# Create the bipartite graph and format as a ggnetwork for plotting with ggplot
V(g)$type <- bipartite_mapping(g)$type
V(g)$size <- degree(g)
g<-simplify(g)
gg_net<-ggnetwork(g)
# Add further information on the network dataset
gg_net<-gg_net%>%
mutate(type2 = case_when(
name == "Abel Torres Espin"~"me",
type == FALSE~"name",
type == TRUE~"cluster"
),
size2 = ifelse(type == FALSE, 1, 1.1),
line_size = ifelse(type2 == "me", 0.15, 0.1))%>%
group_by(name)%>%
mutate(name = case_when(
type == FALSE~paste0(sample(letters, 5, replace = T), collapse = ""),
TRUE~name
))
#Finally, plot the network
ggplot(gg_net, aes(x = x, y = y, xend = xend, yend = yend)) +
geom_edges(aes(color = type2), alpha=0.4,
curvature = 0.1, linewidth = 0.2) +
geom_nodes(aes(color = type2, shape = type2))+
geom_nodelabel_repel(aes(label = name, fill = type2),
show.legend = F, force = 0.04,size = 6)+
scale_color_manual(values = c("grey","green3", "grey"))+
scale_fill_manual(values = c("lightpink2", "green3", "steelblue1"))+
theme_blank()+
theme(legend.position = "none")
# ggsave(filename = "bipartite_network.png", width = 16, height = 12)
Here another way to plot the same network.
gg_net<-ggnetwork(g, layout = igraph::layout_as_bipartite(g))
gg_net<-gg_net%>%
mutate(type2 = case_when(
name == "Abel Torres Espin"~"me",
type == FALSE~"name",
type == TRUE~"cluster"
),
size2 = ifelse(type == FALSE, 1, 1.1),
line_size = ifelse(type2 == "me", 0.15, 0.1))%>%
group_by(name)%>%
mutate(name = case_when(
type == FALSE~paste0(sample(letters, 5, replace = T), collapse = ""),
TRUE~name
))
ggplot(gg_net, aes(x = x, y = y, xend = xend, yend = yend)) +
geom_edges(aes(color = type2), alpha=0.4,
curvature = 0.1, linewidth = 0.2) +
geom_nodes(aes(color = type2, shape = type2))+
geom_nodelabel_repel(aes(label = name, fill = type2),
show.legend = F, force = 0.04,size = 4)+
scale_color_manual(values = c("grey","green3", "grey"))+
scale_fill_manual(values = c("lightpink2", "green3", "steelblue1"))+
theme_blank()+
theme(legend.position = "none")
And that it is! I hope this is useful so someone. I cannot say I got the job because it, but I can tell I got some good smiles in the room! I achieved what I wanted: an original way to show my skills, how I am as a researcher, and how I fit among all these incredible people.
#Function to extract abstracts
extract_abs<-function(x){
if (is.null(x)) return (NA)
start<-which(str_detect(x, "^AB -"))
if (length(start) == 0){
return(NA)
}
count = start + 1
while( TRUE ){
if (str_detect(x[count], "^ ")){
count = count +1
}else{
break()
}
}
abstract<-paste0(x[start:(count-1)],collapse = "")
abstract<-str_remove(abstract,"^AB -")
return(abstract)
}
# Function for cleaning
clean_text<- function(t){
t <- as.character(t)
t <- t %>%
str_to_lower() %>% # convert all the string to low alphabet
replace_contraction() %>% # replace contraction to their multi-word forms
replace_internet_slang() %>% # replace internet slang to normal words
replace_word_elongation() %>% # replace informal writing with known semantic replacements
replace_number(remove = T) %>% # remove number
replace_date(replacement = "") %>% # remove date
replace_time(replacement = "") %>% # remove time
str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
str_squish() %>% # reduces repeated whitespace inside a string.
str_trim() # removes whitespace from start and end of string
return(as.data.frame(t))
}
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/ATEspin/distill_ATE_blog, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Espin (2025, Feb. 18). health.data DRIVEN Lab: Who are you among all of us?. Retrieved from https://atespin.github.io/posts/2024-12-16-bipartite-network-of-authors-and-research-themes/
BibTeX citation
@misc{espin2025who, author = {Espin, Abel Torres}, title = {health.data DRIVEN Lab: Who are you among all of us?}, url = {https://atespin.github.io/posts/2024-12-16-bipartite-network-of-authors-and-research-themes/}, year = {2025} }