Week 7
Learning Objectives
- More data wrangling, removing reciprocal citations and self-citations from a data set
- Reciprocal citations occur when two papers cite each other. This is very difficult to do without collusion. In my data, I’ve observed this happening when an author is working on multiple papers at the same time, and so cites their other papers ahead of publishing. Unfortunately, this creates a cycle in the directed graph, which shouldn’t occur. The Leiden algorithm will count these instances as duplicates and reject the input network.
- One solution is to sort the data such that if A -> B and B -> A are both in the network, after sorting there will be two instances of A -> B, at which point removing duplicates is easy. The problem with this solution is that detecting duplicates in advance is a challenge, so this method currently requires sorting the entire network, after which this graph could no longer in good faith be considered a directed graph. If paper D cites paper C, after sorting, the data will show paper C cites paper D, which is not true.
- Working with code developed to calculate Bu et al.’s metrics, developed by Franklin Moy at UIUC
- I haven’t implemented code written by others before so this is a challenge in reading documentation, understanding the parameters, and interpreting error messages as they arise. Franklin’s pipeline runs in three stages and my files are being rejected at the second stage. I’m currently in the process of figuring out why.
Journal Articles
Small, H. (2022). The confirmation of scientific theories using Bayesian causal networks and citation sentiments. Quantitative Science Studies, 3(2), 393–419. https://doi.org/10.1162/qss_a_00189
- Smalls (2022) aims to identify instances of the confirmation of a scientific theory using Bayesian probability methods and citing sentence analysis. Citing sentences are examined via machine learning for sentiments pertaining to “evidence” or “uncertainty,” two terms used to indicate the degree of confidence that the evidence supports a given theory.
- I found this paper quite difficult to follow. I think this is likely because I’m unfamiliar with the theory behind it (Bayesian causal inference) as well as machine learning methodology.
- A section of the paper was dedicated to figuring out which words in a sentence were related to the theory and which were related to the evidence, since word-order is not a fool-proof indicator of causation. For example, passive voice uses an “X is caused by Y” construction while active voice uses “Y causes X.” Small argues that in some fields, this may cause some ambiguity in terms of which item is the cause and which is the effect. I suspect this could be the case when examining sentences from a field one is unfamiliar with, but I’m skeptical an expert in a field would be unable to determine cause from effect, even if passive voice is used.
- Another item of note is that Small includes several review papers in his set of “top 20 highly-cited papers.” Review papers consist of synopses of many different papers, so including them relies on the assumption that the author of the review paper both understood the objective and results of the papers being reviewed and was then able to communicate these findings effectively.
Other
UIUC Seminar
This week’s UIUC Research seminar was focused on academic writing and publishing, with Professors Reyhaneh Jabbarvand, Charith Mendis, and Tandy Warnow as speakers. Some things discussed included what the structure of a journal article looks like, different responses you may receive when you begin publishing (e.g. “Accepted with minor revisions”), and some of the motivations one might have for publishing. One thing that stuck out to me was how important it is to write throughout the research process and keep notes on what you’ve done, why you did it, and what you found out. While it’s hard to do in the moment, it’s easy to look back and miss details when relying purely on your own memory.
Struggles
This week I’ve been struggling with developing a research question that I’m happy with. I know the type of project and data I want to work with, but I’m just not sure what I want to do with it, what I can do with it, and whether there’s any overlap between the two. Since the end of the project is approaching soon, I’ll just have to pick something and work with it, even if it’s not the research idea of my dreams.