Week 12

Objectives

Submit my final report!
- My final report is finished and can be found here. I’m pleased with how the project turned out and think there’s opportunity to expand on my results in the future and potentially create a manuscript out of it. It’s been a very intense but rewarding project to take on, and I hope to be able to develop it more over the fall semester.
One last figure
- This was a shortened and writing-intensive week so I don’t have much to report. However, I did make this figure this week, which compares citations counts of Paul et al. (1963) to Kannel et al. (1964), two highly cited papers published around the same time, both focused on coronary heart disease research. This figure shows that both papers were receiving citations over the course of twenty years (which is somewhat unusual, but not unheard of) with a sharp decline in citations received in 1984.

Objectives

Making progress on finishing the final report
- This week my objectives are less learning-focused and more task-focused. The items on the agenda for this week are completing a rough draft of my final report and creating a conference-esque presentation on my final report. George and I talked about potentially turning my final report into a manuscript and ways to archive my work in the case that I don’t. I still haven’t decided one way or the other whether I’d like to turn my report into a manuscript or not. I think it may depend on the time I have available this next year as I complete the Masters program in CS here at UIUC. However, I’d like to archive my data and code at the very least. The easiest option is to host it here, but there are options to consider. George suggested looking into PubPub if I choose not to publish my work in a journal.

Objectives

Refining figures to be used in my final report
- The figure below shows the number of citations that Paul et al. (1963) earned per year over the course of its first 20 years. It shows that the maximum citations received per year hit a peak in 1967, three years after publication, followed by peaks and valleys occurring over the remaining years. Paul et al. receives more citations than expected over the course of 20 years, as many papers stop receiving citations after 10 years (de Solla Price, 1965). Leng (2022) attributes this spike in citations from the mid-to-late 70s to an increased interest in how alcohol consumption may contribute to the development of coronary heart disease, as Paul et al. (1963) was one of the first to report findings related to alcohol.

Objectives

Data Wrangling and Lo-Fi visualizations
- This week I’ve been focusing on refining my research question and generating some lo-fi data visualization prototypes, as seen below.
- I created this cluster graph, which I’m not sure I’ll use in my final report as I don’t belive it to be relevant to my research question(s) at this point. However, I think it’s a cool image as it shows different clusters which represent the different communities citing Paul et al’s findings. Most of these clusters are linked by citations, indicating inter-community awareness.

Objectives

Duplicates problem
- As mentioned in the previous week’s entry, I was dealing with issues in duplicates in the data. I solved this issue by removing any instance where A -> B and B -> A are present in the data. The best solution would be to include only one instance, either A -> B or B -> A, but the solution I came up with excludes both. This is computationally easier and the number of duplicates in the histone data was negligible (n= 13,000 out of 250,000). However, due to changes in the project, I am no longer using this data set and the duplication issue is no longer relevant (see Project below).
Segmentation Faults
- The code I was using to calculate the metrics was returning the error message “Segmentation 11,” which I discovered comes up when the code tries to retrieve an item from memory that cannot be accessed. Due to how vague the error message was, I was having a lot of issues figuring out what the exact issue was. I found the line the code was failing on, which was a line that retrieves the in-degree value of a node, but still was unable to determine why. Franklin Moy, the author of the code, was able to diagnose the issue as a failure to properly handle singleton clusters. Previous versions of the code operated on the assumption that singleton clusters had been removed from the data, which did not occur to me to do in advance. The issue is now resolved, and I am able to move forward with data analysis.

Learning Objectives

Clustering with Leiden, creating legible graphs

The Leiden algorithm is an algorithm used to cluster networks that guarantees connected communities, unlike its predecessor, the Louvain algorithm. Creating clusters allows for the indentification and observation of community structure within networks; items within a cluster are more likely to have commonalities than items between clusters. This is important when examinining citation networks, as clusters of citations are likely centered around a specific research question or topic.

Randomizing a citation network

Random networks or null networks are created to form pseudo-data to compare results and methods against. The data are fake, but should resemble the original network as much as possible. For example, it is impossible for a paper published in 2021 to cite a paper published in 2022. My attempts at randomization focused on i) maintaining the year of publication for the cited paper and ii) maintaining the in-degree for each cited paper. This way, if a paper cites a paper from 1990, it is guaranteed to cite a paper from 1990 after randomization. Additionally, if a paper has been cited 20 times, it will be cited 20 times in the random network.

The team discussed various options for creating citation graphs and the pros and cons of different options.

We discussed Cytoscape, an open source software platform built by biologists for visualizing networks. This program is free to use but also crashes frequently with exceptionally large networks. Being open source and built by volunteers, it lacks some of the power that commercial options might offer, but those can be expensive. We also discussed Citespace, but as citespace was built by a single person, Chaomei Chen, the software cannot be maintained, updated, documented as frequently or thoroughly as software produced by a team.
We also discussed why someone would visualize a network. Visualization is best used to answer some question. That way, it's easier to filter, sort, and arrange the data in advance, which can cut down on the computational power required to build the visualization, as well as produce a cleaner, more legible image.

Learning Objectives

Finish histone data co-citation frequency counts and graphs.

Final product linked below under "Co-citation count table and rudimentary graphs"

Read and present Leng (2021), more information provided below.
Start on creating a citation network using the Leiden algorithm

The Leiden algorithm is used to create connected clusters within networks, but it takes an input file with a very specific format. My databases so far have been of the form [citing_doi, cited_doi], where each doi is a string. The Leiden algorithm takes a tsv file without a header or indices as an input, and each node must be represented by an integer starting at 0. One of my challenges this week will be taking my data an manipulating it so that it's accepted by the algorithm.

Learning Objectives

Present Uzzi et al. (2013)
Count the number of co-citation pairs in the histone data generated previously, then graph the co-citation network.
Create a script that will generate keyword summaries of graph clusters using frequently occurring title words. These summaries would provide insight into the topics defining a research cluster.
The team discussed research ethics and went over tricky but common ethical dilemmas in research.

Learning Objectives

Data wrangling with PostgreSQL and Python: working with joins and data aggregation.
Creating a presentation on Uzzi et al. (2013).

Learning Objectives

Filtering and aggregations in PostgreSQL, used to find citation counts
Creating csv files in Linux; moving files between my machine and the team server.
Learning about the Habanero Python library, which can be used to get the titles from a list of dois.

Journal Articles

Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404. 10.1002/asi.21419

This study aimed to identify the most accurate way to cluster over 2 million biomedical articles published between 2004 and 2008 by comparing the clustering results of direct citation, co-citation, bibliographic coupling, and bibliographic coupling hybrid approach. "Accuracy" here was determined by "textual coherence" and a new measure introduced in this paper which uses grant funding information. The logic behind the new method is that articles funded by the same grant are more likely to focus on similar topics than those that do not. The researchers concluded that the bibliographic coupling hybrid was most accurate, followed by bibliographic coupling. Given that the time window of papers included in the study was very small (5 years, if the boundaries are inclusive), then I find this unsurprising as bibliographic coupling is a static measure. Additionally, I am also skeptical of the use of a new accuracy measure as a means to make broad conclusions about the accuracy of bibliographic coupling and co-citation in general.

Learning Objectives

Introduction to Scientometrics
Working in Linux
Intro to PostgreSQL and creating databases
Working with EntrezAPI to extract metadata from PubMed (or other databases)

Journal Articles

The team discussed measures like co-citation and bibliographic coupling by reading “Networks of Scientific Papers” by Derek J. de Solla Price (1965) 10.1126/science.149.3683.510 and “Co-citation in the scientific literature” by Henry Small (1973) 10.1002/asi.4630240406.

Direct Citation, co-citation, and bibliographic coupling are two ways to establish a connection between papers. Direct citations are simple: a direct citation is created when one paper cites another. A co-citation occurs when a single paper links two previous papers by including both of them in the bibliography. Bibliographic coupling cites two newer papers who both cite a single, previous paper.

Some take-aways from these two papers: Bibliographic coupling is a static measure of how strongly two newer papers are connected as new citations cannot be added to an existing bibliography. Co-citation is a dynamic measure which can change over time depending on how authors creating new papers perceive the connection between two existing papers (Small, 1973). Most papers receive all the citations they will receive within the first 10 years of publication and the vast majority of papers receive few to no citations. Papers that cite the most frequently are review papers, which serve to summarize an entire body of research within a single paper (de Solla Price, 1965).

Elaina Wittmer

Week 12

Objectives

Week 11

Objectives

Week 10

Objectives

Week 9

Objectives

Week 8

Objectives

Week 7

Learning Objectives

Week 6

Learning Objectives

Week 5

Learning Objectives

Week 4

Learning Objectives

Week 3

Learning Objectives

Week 2

Learning Objectives

Journal Articles

Week 1

Learning Objectives

Journal Articles