Week 6

Learning Objectives

Clustering with Leiden, creating legible graphs

The Leiden algorithm is an algorithm used to cluster networks that guarantees connected communities, unlike its predecessor, the Louvain algorithm. Creating clusters allows for the indentification and observation of community structure within networks; items within a cluster are more likely to have commonalities than items between clusters. This is important when examinining citation networks, as clusters of citations are likely centered around a specific research question or topic.

Randomizing a citation network

Random networks or null networks are created to form pseudo-data to compare results and methods against. The data are fake, but should resemble the original network as much as possible. For example, it is impossible for a paper published in 2021 to cite a paper published in 2022. My attempts at randomization focused on i) maintaining the year of publication for the cited paper and ii) maintaining the in-degree for each cited paper. This way, if a paper cites a paper from 1990, it is guaranteed to cite a paper from 1990 after randomization. Additionally, if a paper has been cited 20 times, it will be cited 20 times in the random network.

The team discussed various options for creating citation graphs and the pros and cons of different options.

We discussed Cytoscape, an open source software platform built by biologists for visualizing networks. This program is free to use but also crashes frequently with exceptionally large networks. Being open source and built by volunteers, it lacks some of the power that commercial options might offer, but those can be expensive. We also discussed Citespace, but as citespace was built by a single person, Chaomei Chen, the software cannot be maintained, updated, documented as frequently or thoroughly as software produced by a team.
We also discussed why someone would visualize a network. Visualization is best used to answer some question. That way, it's easier to filter, sort, and arrange the data in advance, which can cut down on the computational power required to build the visualization, as well as produce a cleaner, more legible image.

Journal Articles

Bu, Y., Waltman, L., & Huang, Y. (2021). A multidimensional framework for characterizing the citation impact of scientific publications. Quantitative Science Studies, 2(1), 155–183. 10.1162/qss_a_00109

Bu et al. proposes four new measures for citation research: breadth, depth, dependence, and indepence. These metrics are not an evaluation of the quality of the work, simply a way of measuring impact outside of citation count. Citation count alone provides a very limited view of the impact of a paper, which is why these metrics are introduced.
Breadth and depth measure whether papers that cite paper A also cite each other. If they do, these papers are aware of each other and likely in the same research community, so paper A is said to have depth due to its deep impact on one community. If citing papers don't cite each other, paper A has breadth, as it has been cited by a range of research communities.
Dependence and independence are measures of how many references are shared between paper A and its citing papers. If there is significant overlap, paper A is dependent as citing paper A's references is essential to contextualizing paper A. If there is little or no overlap, paper A is independent from the previous literature.
The authors apply their metrics to i) all WoS articles published between 1980 and 2017 and then ii) articles within this search determined to be related to scientometrics through algorithmic clustering. This approach seems unfocused as the authors are discussing new metrics -- how do they know their results are reasonable? They discuss four papers in the scientometrics literature, but why these papers? Would the results look different if the topic were more restricted?

Hug, S.E (2022). Towards Theorizing Peer Review. Quantitative Science Studies, 3(3), 1-17. 10.1162/qss_a_00195

Hug (2022) calls for more theorizing on the topic of peer review, and states that previous studies on peer review have focused too much on identifying biases and problems with peer review and not enough time dicussing why these problems occur or how to address them. Hug argues that peer review needs a unifying theory to address these problems, but it remains unclear how this could be achieved or why.

Things I Made

I wrote a report on graphing networks and creating clusters with the Leiden algorithm and my results doing so, which can be found here. My code for clustering the networks can be found here.
Although I didn't end up using it in my report, I created this graph of my entire citation network! The nodes are filtered to only include papers cited more than 500 times (n=21) and their citing papers. Much of the graph is connected (barring a cluster on the right-hand edge) and some node clusters share many edges between them, indicating that these clusters are more closely connected than node clusters with few edges between them.

A visualization of a citation network. The nodes are marked in red with edges drawn between nodes if they are connected through direct citation.

Other

Failed randomization strategies

My first attempt at randomization included saving all the citations in a dictionary by year such that the dictionary had the form {publication_year: {'citation1', 'citation2'...}} where the citations were stored in a set. For each row in the data frame, I would use the publication year to retrieve the set of citations for that year, make a copy of the set, delete the entry that matched the item I wanted to replace, then pull a random item from the set as the replacement. This approach worked for maintaining the publication year, but not for the in-degree of each node.
My second approach substituted the set with a list which stored every citation for a given year, allowing duplicates for papers cited more than once. It would pull a random item out from the list, and if the list item matched the item I wanted to replace, then it would run recursively until it picked an item that did not match. If it found an item that didn't match, it would delete the item from the list so it couldn't be used again. This approach caused many indexation errors, as well as infinite loops in the case where the options in the list all matched the item to be replaced.

Written on June 24, 2022