Week 2

Learning Objectives

Filtering and aggregations in PostgreSQL, used to find citation counts
Creating csv files in Linux; moving files between my machine and the team server.
Learning about the Habanero Python library, which can be used to get the titles from a list of dois.

Journal Articles

Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404. 10.1002/asi.21419

This study aimed to identify the most accurate way to cluster over 2 million biomedical articles published between 2004 and 2008 by comparing the clustering results of direct citation, co-citation, bibliographic coupling, and bibliographic coupling hybrid approach. "Accuracy" here was determined by "textual coherence" and a new measure introduced in this paper which uses grant funding information. The logic behind the new method is that articles funded by the same grant are more likely to focus on similar topics than those that do not. The researchers concluded that the bibliographic coupling hybrid was most accurate, followed by bibliographic coupling. Given that the time window of papers included in the study was very small (5 years, if the boundaries are inclusive), then I find this unsurprising as bibliographic coupling is a static measure. Additionally, I am also skeptical of the use of a new accuracy measure as a means to make broad conclusions about the accuracy of bibliographic coupling and co-citation in general.

Klavans, R., & Boyack, K. W. (2016). Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge? Journal of the Association for Information Science and Technology, 68(4), 984–998. 10.1002/asi.23734

The objective of this paper was to accurately identify taxonomies, or "the history of a topic," in contrast to Boyack & Klavans (2010) above, which focused on identifying research fronts. Here, the authors use "gold standard" papers as a means to section the data by topic. The gold standard papers were papers with more than 100 references, selected as these papers make up a small percentage of data and are usually written by authors at the "core" of a research community. The paper concludes that direct citation is the most accurate way to identify a taxonomy and co-citation analyses produced poor cluster coverage and accuracy. However, the data used for each analysis varied wildly: direct citation analysis was conducted on data from 1996-2012, co-citation on data from 2011-2013, and bibliographic coupling on data from 2010 alone. Because of this, I'm not sure the results between the three approaches can be compared in a meaningful way.

Things I made

In collaboration with George, I developed a script using the Habanero Python library that will take a list of dois and return a csv file containing both the doi searched and the title of the paper. Because the Crossref module in habanero will only allow you to search so many dois at one time, doi searches need to be done in batches. This script will read a csv file containing dois, run doi searches in batches of a desired increment, and return the total results as a csv file.

Written on May 27, 2022