23 case study: the network structure of computational social science

Bibliometric networks are models of the structure of scholarly disciplines. There are a variety of methods to developing such networks. We’ll use an approach which is both simple and powerful, and that is to examine shared references or bibliographic couplings. In such a network, papers will be strongly linked if they have many references in common, and distant from each other when they share no or few references.

We begin with a collection of records of research articles that we have downloaded from the Web of Science, using the phrase “computational social science” as a search term. (A video on YouTube illustrates each step of this process).

A second video accompanies the code below, which was originally developed for the Summer Institute of Computational Social Science held in Jupiter, Florida in 2023.

After loading the needed libraries, we begin with a dataframe consisting of a list of records downloaded from the Web of Science. These are all papers which included the phrase ‘computational social science,’ as downloaded on 6/5/2023. The data include a ‘source’ field and a set of cited references as well as a bunch of other fields (metadata).

# for large jobs, set to TRUE to run this on only one reference file
debugging <- FALSE
dataDir <- "data/CompSocSci2023"
filenames <- list.files(dataDir, pattern = "*.txt", 
                        full.names = TRUE)
if (debugging == TRUE) {filenames = filenames[1]}
biblioDF <- data.frame(as.list(filenames)) %>% 
    convert2df()
## 
## Converting your wos collection into a bibliographic dataframe
## 
## Done!
## 
## 
## Generating affiliation field tag AU_UN from C1:  Done!
biblioDF %>%  
    rownames_to_column("source") %>% 
    select(source,CR) %>% 
    head(1) %>% 
    kable(caption = "Cited refs in one document") %>% 
    kable_styling()
Table 23.1: Table 23.2: Cited refs in one document
source CR
HOX JJ, 2017, METHODOLOGY-EUR ATTEWELL P., 2015, DATA MINING SOCIAL S;AXELROD R, 1981, SCIENCE, V211, P1390, DOI 10.1126/SCIENCE.7466396;BOND RM, 2012, NATURE, V489, P295, DOI 10.1038/NATURE11421;CAMSTRA A, 1992, SOCIOL METHOD RES, V21, P89, DOI 10.1177/0049124192021001004;CIOFFI-REVILLA CLAUDIO., 2014, INTRO COMPUTATIONAL;DAAS P., 2012, NYENR S BIG DAT;DAWSON R. J. M., 1995, J STAT ED, V3;DONOHO D., 2015, 50 YEARS DATA SCI;FOSTER I., 2016, BIG DATA SOCIAL SCI;FRED MORSTATTER, 2013, IS SAMPLE GOOD ENOUG;GHANI R, 2017, STAT SOC BEHAV SCI, P147;GINSBERG J, 2009, NATURE, V457, P1012, DOI 10.1038/NATURE07634;GONCALVES-SA J., 2015, 2015 WINT S COMP SOC;GRANOVETTER MS, 1973, AM J SOCIOL, V78, P1360, DOI 10.1086/225469;GRISHENKO A., 2014, TWITTER ARCHITECTU 1;IBM, 2016, 4 VS BIG DAT;KOLLANYI B., 2016, 2016219 COMPROP;LANEY D., 2001, 3 D DATA MANAGEMENT, DOI DOI 10.1016/J.INFSOF.2008.09.005;LAZER D, 2014, SCIENCE, V343, P1203, DOI 10.1126/SCIENCE.1248506;LOEHLIN JC, 1965, J PERS SOC PSYCHOL, V2, P580, DOI 10.1037/H0022457;MADDEN M., 2013, TEENS SOCIAL MEDIA P;ONNELA JP, 2007, P NATL ACAD SCI USA, V104, P7332, DOI 10.1073/PNAS.0610245104;OREILLY RC, 2000, COMPUTATIONAL EXPLOR;PHELPS S., 2012, EMERGENCE SOCIAL NET, DOI 10.2139/SSRN.2109553, DOI 10.2139/SSRN.2109553;PRESENTI J., 2015, 5 NEW SERVICES EXPAN;ROBERTS M. E., 2016, COMPUTATIONAL SOCIAL, P51;SILBERZAHN R., 2015, CROWDSOURCING DATA A;TUKEY JW, 1962, ANN MATH STAT, V33, P1, DOI 10.1214/AOMS/1177704711;TUKEY JW, 1977, EXPLORATORY DATA ANA;VO H, 2017, STAT SOC BEHAV SCI, P125;WEBB EJ, 1966, UNOBTRUSIVE MEASURES;ZHU XIAOJIN, 2009, INTRO SEMISUPERVISED, DOI 10.2200/S00196ED1V01Y200906AIM006

23.0.1 from citation network to structural network

A citation network is a type of two-mode or bipartite network which consists of source papers, referenced papers, and the directed edges which link a subset of these. Here’s a fragment of our citation network, first as a list of citations (edges), then as a graph showing links between the two types of vertices or nodes (i.e., source papers in red, cited papers in green). Notice that the edges are directed (e.g., Hox 2017 cites Lazer 2014).

# illustration of a small fragment of the citation network
someSources <- biblioDF %>% 
# take the first four sources
    slice (1:4)

smallBiblioDF <- someSources %>% 
# make them into a rectangular source * citation matrix
    cocMatrix(Field = "CR", sep = ";") %>% 
    as.matrix() %>% 
# then a list of edges
    reshape2::melt() %>% 
    filter (value > 0) %>% 
    rename(source = 1, citation = 2) %>% 
    select(-value) %>% 
    group_by(citation) %>% 
# then choose only papers cited more than once
    filter(n()>1) %>% 
    ungroup

smallBiblioDF %>% 
    kable (caption = "A small citation network: 
           Edge list of papers cited > 1 time in four sources in CSS") %>% 
    kable_styling()
Table 23.3: Table 23.4: A small citation network: Edge list of papers cited > 1 time in four sources in CSS
source citation
TORNBERG P, 2021, BIG DATA SOC CENTOLA D 2010 SCIENCE
ZHANG J, 2020, BIG DATA RES CENTOLA D 2010 SCIENCE
TORNBERG P, 2021, BIG DATA SOC CONTE R 2012 EUR PHYS J-SPEC TOP
BALTAR R, 2021, SIMBIOTICA CONTE R 2012 EUR PHYS J-SPEC TOP
TORNBERG P, 2021, BIG DATA SOC EDELMANN A 2020 ANNU REV SOCIOL
BALTAR R, 2021, SIMBIOTICA EDELMANN A 2020 ANNU REV SOCIOL
TORNBERG P, 2021, BIG DATA SOC KITCHIN R 2014 BIG DATA SOC
BALTAR R, 2021, SIMBIOTICA KITCHIN R 2014 BIG DATA SOC
TORNBERG P, 2021, BIG DATA SOC KRAMER ADI 2014 P NATL ACAD SCI USA
ZHANG J, 2020, BIG DATA RES KRAMER ADI 2014 P NATL ACAD SCI USA
TORNBERG P, 2021, BIG DATA SOC LAZER D 2009 SCIENCE
ZHANG J, 2020, BIG DATA RES LAZER D 2009 SCIENCE
HOX JJ, 2017, METHODOLOGY-EUR LAZER D 2014 SCIENCE
ZHANG J, 2020, BIG DATA RES LAZER D 2014 SCIENCE
TORNBERG P, 2021, BIG DATA SOC LAZER DMJ 2020 SCIENCE
BALTAR R, 2021, SIMBIOTICA LAZER DMJ 2020 SCIENCE
# plotted using qgraph
smallgraph <- smallBiblioDF %>%
    qgraph::qgraph(labels = T, 
           label.cex = 4,
           vTrans = 30,
           border.color = "white",
           edge.width = 4,
           color = c(2,2,2,2,3,3,3,3,3,3,3,3,3),
           title = "A small citation network: Common cites among four papers")

The two-mode citation network can be collapsed to one of two one-mode networks. A co-citation network is one which extracts citations (green nodes), and links them based on the number of times that they are cited by the same authors. We will look instead at bibliometric couplings, or links between the (pink) source papers. This will be a single mode, weighted, undirected network, with source papers as vertices and the number of shared references as edge weights. Here’s a graph showing the collapsed network based on just the papers cited above:

smallOneModeMatrix <- biblioNetwork(someSources, analysis = "coupling", 
                           network = "references", 
                           sep = ";") %>% 
# Normalize adjusts weights; Salton algorithm takes shared refs / product of geometric means of n refs. This is what I have always done.
    normalizeSimilarity(type = "salton") %>%
    as.matrix() %>% 
    round(2)
diag(smallOneModeMatrix) <- 0
smallOneModeMatrix %>% 
    kable(caption = "A small structural network: Matrix showing shared references among 4 papers") %>% 
    kable_styling()
Table 23.5: Table 23.6: A small structural network: Matrix showing shared references among 4 papers
HOX JJ, 2017 TORNBERG P, 2021 ZHANG J, 2020 BALTAR R, 2021
HOX JJ, 2017 0.00 0.00 0.01 0.00
TORNBERG P, 2021 0.00 0.00 0.02 0.05
ZHANG J, 2020 0.01 0.02 0.00 0.00
BALTAR R, 2021 0.00 0.05 0.00 0.00

Based on the structure of shared references, Baltar and Hox appear to be quite different, distant from each other. (But don’t trust the result too much - small networks like this aren’t particularly stable). Regardless, this is a beginning of an understanding of the structure of scholarship in computational social science.

23.0.2 looking at the whole citation network

The bibliometrix package can quickly give summary statistics for citation networks. Here are some characteristics of the entire set of papers.

twoModeStats <- biblioAnalysis(biblioDF)#, sep = ";")
twoModeSummary <- twoModeStats |> 
  summary() # summary returns a list. look at it.
## 
## 
## MAIN INFORMATION ABOUT DATA
## 
##  Timespan                              1999 : 2023 
##  Sources (Journals, Books, etc)        310 
##  Documents                             794 
##  Annual Growth Rate %                  18.76 
##  Document Average Age                  6.71 
##  Average citations per doc             21.14 
##  Average citations per year per doc    2.023 
##  References                            35298 
##  
## DOCUMENT TYPES                     
##  article                                676 
##  article; book chapter                  3 
##  article; data paper; early access      1 
##  article; early access                  45 
##  article; proceedings paper             2 
##  book review                            7 
##  book review; early access              1 
##  correction                             5 
##  editorial material                     28 
##  editorial material; book chapter       1 
##  letter                                 1 
##  letter; early access                   1 
##  review                                 22 
##  review; early access                   1 
##  
## DOCUMENT CONTENTS
##  Keywords Plus (ID)                    1400 
##  Author's Keywords (DE)                2228 
##  
## AUTHORS
##  Authors                               2055 
##  Author Appearances                    2555 
##  Authors of single-authored docs       136 
##  
## AUTHORS COLLABORATION
##  Single-authored docs                  149 
##  Documents per Author                  0.386 
##  Co-Authors per Doc                    3.22 
##  International co-authorships %        32.87 
##  
## 
## Annual Scientific Production
## 
##  Year    Articles
##     1999        1
##     2000        1
##     2002        2
##     2005        2
##     2006        2
##     2008        2
##     2009        2
##     2010        3
##     2011        7
##     2012       10
##     2013       13
##     2014       21
##     2015       26
##     2016       38
##     2017       49
##     2018       78
##     2019       82
##     2020       83
##     2021      133
##     2022      177
##     2023       62
## 
## Annual Percentage Growth Rate 18.76 
## 
## 
## Most Productive Authors
## 
##      Authors        Articles   Authors        Articles Fractionalized
## 1  CIOFFI-REVILLA C       12 CIOFFI-REVILLA C                    6.10
## 2  MOAT HS                12 BAIL CA                             3.95
## 3  PREIS T                12 O'BRIEN DT                          3.83
## 4  PENTLAND A              9 MOAT HS                             3.70
## 5  BAIL CA                 8 PREIS T                             3.70
## 6  FERRARA E               7 FARRELL J                           3.00
## 7  JUNGHERR A              7 ITO N                               2.92
## 8  ITO N                   6 JUNGHERR A                          2.92
## 9  JURGENS P               6 SASAHARA K                          2.67
## 10 KOSINSKI M              6 MA J                                2.64
## 
## 
## Top manuscripts per citations
## 
##                           Paper                                  DOI   TC TCperYear   NTC
## 1  LAZER D, 2009, SCIENCE                10.1126/science.1167742     1825     101.4  1.98
## 2  KOSINSKI M, 2013, P NATL ACAD SCI USA 10.1073/pnas.1218772110     1183      84.5  8.08
## 3  LUKE S, 2005, SIMUL-T SOC MOD SIM     10.1177/0037549705058073     584      26.5  2.00
## 4  WU YY, 2015, P NATL ACAD SCI USA      10.1073/pnas.1418680112      469      39.1  8.70
## 5  BAIL CA, 2018, P NATL ACAD SCI USA    10.1073/pnas.1804840115      467      51.9 14.43
## 6  VARGO CJ, 2018, NEW MEDIA SOC         10.1177/1461444817712086     228      25.3  7.04
## 7  FARRELL J, 2016, P NATL ACAD SCI USA  10.1073/pnas.1509433112      221      20.1  5.17
## 8  WANG Y, 2018, J PERS SOC PSYCHOL      10.1037/pspa0000098          218      24.2  6.73
## 9  CHANG RM, 2014, DECIS SUPPORT SYST    10.1016/j.dss.2013.08.008    196      15.1  3.61
## 10 CONTE R, 2012, EUR PHYS J-SPEC TOP    10.1140/epjst/e2012-01697-8  182      12.1  4.07
## 
## 
## Corresponding Author's Countries
## 
##           Country Articles   Freq SCP MCP MCP_Ratio
## 1  USA                 307 0.3906 238  69     0.225
## 2  UNITED KINGDOM       72 0.0916  41  31     0.431
## 3  CHINA                51 0.0649  29  22     0.431
## 4  GERMANY              49 0.0623  32  17     0.347
## 5  JAPAN                42 0.0534  34   8     0.190
## 6  ITALY                33 0.0420  16  17     0.515
## 7  NETHERLANDS          20 0.0254   8  12     0.600
## 8  SWITZERLAND          18 0.0229  13   5     0.278
## 9  CANADA               17 0.0216  11   6     0.353
## 10 AUSTRALIA            16 0.0204   7   9     0.562
## 
## 
## SCP: Single Country Publications
## 
## MCP: Multiple Country Publications
## 
## 
## Total Citations per Country
## 
##      Country      Total Citations Average Article Citations
## 1  USA                       8628                     28.10
## 2  UNITED KINGDOM            3051                     42.38
## 3  ITALY                      861                     26.09
## 4  GERMANY                    732                     14.94
## 5  CHINA                      420                      8.24
## 6  NETHERLANDS                331                     16.55
## 7  SWITZERLAND                304                     16.89
## 8  DENMARK                    249                     24.90
## 9  KOREA                      232                     25.78
## 10 SPAIN                      227                     18.92
## 
## 
## Most Relevant Sources
## 
##                                                                     Sources        Articles
## 1  JOURNAL OF COMPUTATIONAL SOCIAL SCIENCE                                              199
## 2  PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA       30
## 3  EPJ DATA SCIENCE                                                                      21
## 4  SOCIAL SCIENCE COMPUTER REVIEW                                                        21
## 5  ROYAL SOCIETY OPEN SCIENCE                                                            14
## 6  JASSS-THE JOURNAL OF ARTIFICIAL SOCIETIES AND SOCIAL SIMULATION                       11
## 7  BIG DATA & SOCIETY                                                                     9
## 8  NATURE HUMAN BEHAVIOUR                                                                 9
## 9  PLOS ONE                                                                               9
## 10 SCIENTIFIC REPORTS                                                                     8
## 
## 
## Most Relevant Keywords
## 
##       Author Keywords (DE)      Articles Keywords-Plus (ID)     Articles
## 1  COMPUTATIONAL SOCIAL SCIENCE      367          MODEL               43
## 2  SOCIAL MEDIA                       91          SCIENCE             43
## 3  BIG DATA                           62          BEHAVIOR            42
## 4  MACHINE LEARNING                   60          SOCIAL MEDIA        42
## 5  TWITTER                            47          COMMUNICATION       37
## 6  NATURAL LANGUAGE PROCESSING        38          BIG DATA            36
## 7  SOCIAL NETWORKS                    29          DYNAMICS            36
## 8  COVID-19                           28          MEDIA               36
## 9  AGENT-BASED MODELING               26          TWITTER             35
## 10 DATA SCIENCE                       21          NETWORKS            34
twoModeSummary |> 
  pluck(1) |> 
  unlist() |> 
  kable(
    caption = "Citation network: Key features") %>%  
    kable_styling() 
Table 23.7: Table 23.8: Citation network: Key features
x
MAIN INFORMATION ABOUT DATA
Timespan 1999 : 2023
Sources (Journals, Books, etc) 310
Documents 794
Annual Growth Rate % 18.76
Document Average Age 6.71
Average citations per doc 21.14
Average citations per year per doc 2.023
References 35298
DOCUMENT TYPES
article 676
article; book chapter 3
article; data paper; early access 1
article; early access 45
article; proceedings paper 2
book review 7
book review; early access 1
correction 5
editorial material 28
editorial material; book chapter 1
letter 1
letter; early access 1
review 22
review; early access 1
DOCUMENT CONTENTS
Keywords Plus (ID) 1400
Author’s Keywords (DE) 2228
AUTHORS
Authors 2055
Author Appearances 2555
Authors of single-authored docs 136
AUTHORS COLLABORATION
Single-authored docs 149
Documents per Author 0.386
Co-Authors per Doc 3.22
International co-authorships % 32.87

Here’s a manually-constructed table of the “most cited journals.”

sourceJournals <- biblioDF %>% 
    citations(field = "article", sep = ";") %>% 
# output of above is a list... pluck gets the first element
# which is a table of sources
    pluck(3) %>%
    as_tibble() %>%
    rename(CitedJournal = 1) 

sourceJournals %>% 
    count(CitedJournal) %>% 
    arrange(desc(n)) %>% 
    head(10) %>% 
    kable (caption = "Most cited journals in Comp Soc Sci") %>% 
    kable_styling()
Table 23.9: Table 23.10: Most cited journals in Comp Soc Sci
CitedJournal n
PLOS ONE 425
P NATL ACAD SCI USA 388
NA 319
SCIENCE 250
AM SOCIOL REV 227
J PERS SOC PSYCHOL 224
NATURE 223
AM J SOCIOL 182
LECT NOTES COMPUT SC 159
SCI REP-UK 152

23.0.2.1 the need to inspect/clean data

One drawback of bibliometrix and packages like it is that they remove us somewhat from the data. Here, ‘NA’ is the third most common source of papers. We look more closely at these records to see if there is a problem. Here’s one way…

# there are no papers where the cited field is empty or NA
# so this gets us nowhere
# sourceJournals %>% 
#        filter(CitedJournal == "" |
#               CitedJournal == "NA" )
set.seed(33458)
countsonly <- sourceJournals %>% 
    group_by(CitedJournal) %>% 
    count(CitedJournal) 
countNA <- countsonly %>% 
    filter(is.na(CitedJournal)) %>% 
    as_tibble %>% 
    select(n) %>% 
    as.integer()
sources <- sourceJournals %>% 
    left_join(countsonly)
biblioDF %>% 
    citations(field = "article", sep = ";") %>% 
    pluck(1) %>% 
    as_tibble() %>% 
    bind_cols(sources) %>% 
    rename(papercites = 2, journalcites = 4) %>% 
    filter(journalcites == countNA) %>% 
    select(CR) %>% 
    sample_n(20) %>% 
    kable (caption = "Random sample of references with source journal coded as NA") %>% 
    kable_styling()
Table 23.11: Table 23.12: Random sample of references with source journal coded as NA
CR
ANONYMOUS, REPRESENTING SOCIAL
THE UNITED NATIONS OFFICE FOR DISASTER, DISASTER STATISTICS
ANONYMOUS, ALLSIDES MED BIAS RA
U.S. DEPARTMENT OF HEALTH AND HUMAN SERVICES ADMINISTRATION FOR CHILDREN AND FAMILIES ADMINISTRATION ON CHILDREN YOUTH AND FAMILIES CHILDREN’S BUREAU (2021), CHILD MALTREATMENT
WORLD BANK, POPULATION DENSITY
ANONYMOUS, PARTITIONING PREDICT
WHO, WHO COR COVID 19 DAS
ADRIAN M., TERADATA MAGAZINE
ANONYMOUS, 10 INT AAAI C WEB SO
ROMIR, DYN HAPP IND RUSS WO
MACARTHUR A., LIFEWIRE 0206
NATIONAL AUDUBON SOCIETY, CHRISTM BIRD COUNT
ARTEFACT GROUP, ETH EXPL PACK
NAI J., J PERSONALITY SOCIAL
CRANDALL D., P 14 ACM SIGKDD INT
HARTFORD, FINANCIAL TIMES
STEPHENS N, PALGRAVE COMMUNICATI
BHARAT K, GOOGLE NEWS COVERAGE
TERRA DOTTA, TACKLING GENDER GAP
STATISTA, MOST POPS SOC NETW W

These NA appear to be mostly detritus, cites to anonymous sources, etc. There are better ways of cleaning the data (including restricting the set of papers to articles and reviews), but we’ll ignore these for now.

23.0.3 the structural network: Centrality and community structure

We now reduce the two mode network to a one-mode for the whole dataset, normalizing the edge weights as before, and setting the diagonal entries to 0.

We then use the tidygraph package to compute several measures of centrality and community structure.

# random seed to ensure reproducible results
# esp for centrality and community analyses
set.seed(33458)
oneModeMatrix <- biblioNetwork(biblioDF, 
                            analysis = "coupling", 
                            network = "references", 
                            sep = ";") %>% 
    normalizeSimilarity(type = "salton") %>%
    as.matrix()  

str(oneModeMatrix)
##  num [1:794, 1:794] 1 0 0.0128 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:794] "HOX JJ, 2017" "TORNBERG P, 2021" "ZHANG J, 2020" "BALTAR R, 2021" ...
##   ..$ : chr [1:794] "HOX JJ, 2017" "TORNBERG P, 2021" "ZHANG J, 2020" "BALTAR R, 2021" ...
# save it as a graph for igraph/tidygraph
oneModeGraph <- oneModeMatrix %>% 
    graph_from_adjacency_matrix(mode = "undirected", 
                                diag = FALSE,
                                weighted = TRUE)
# tidygraph allows us to look at graphs
# as data frames of nodes and edges
tidyBibGraph <- oneModeGraph %>% 
    as_tbl_graph() %>% 
# we activate nodes to assign new measures for each node
    activate(nodes) %>% 
    mutate(centralPR = centrality_pagerank(weights = weight)) %>% 
    mutate(nodePRRank = rank(-centralPR)) %>% 
#    mutate(centralWD = centrality_degree(weights = weight)) %>% 
    mutate(central0D = centrality_degree(weights = NULL)) %>% 
    mutate(communityLouv = group_louvain(weights = weight)) %>% 
    mutate(group = as.factor(group_louvain())) %>% 
#    mutate(communityWalk = group_walktrap(weights = weight)) %>% 
    mutate(ID = row_number()) 

edgeList <- tidyBibGraph %>% 
    activate(edges) %>% 
    as_tibble()
nodeList <- tidyBibGraph %>% 
    activate(nodes) %>% 
    as_tibble()

23.0.4 visualizing the network

I use ggraph to try a few visualizations within R - it plays nicely with tidygraph. In the first visualization, all but the 22 isolates are plotted. It suggests that there is a community structure, but nothing beyond this.

nNodesToPlot <- 794
minEdgeWeightToPlot <- .00001
library (ggraph)
tidyBibGraph %>% 
    activate(nodes) %>% 
    filter(central0D > 0) %>% 
    filter(nodePRRank < nNodesToPlot) %>% 
    activate(edges) %>% 
    filter(weight > minEdgeWeightToPlot) %>% 
  ggraph(layout = 'stress')  +
  geom_edge_link(alpha = .1) +  
  geom_node_point(aes(size = (centralPR),
                      color = as.factor(communityLouv))) +
#  geom_node_label(aes(label = name,
#                      color = as.factor(communityLouv))) +
     theme(legend.position = "none")

In the second plot, I show just the top 20 nodes, but also include labels for these.

nNodesToPlot <- 20
minEdgeWeightToPlot <- .05
tidyBibGraph %>% 
    activate(nodes) %>% 
    filter(nodePRRank < nNodesToPlot) %>% 
    activate(edges) %>% 
    filter(weight > minEdgeWeightToPlot) %>% 
  ggraph(layout = 'stress')  +
  geom_edge_link(alpha = .1) +  
  geom_node_point(aes(size = (centralPR),
                      color = as.factor(communityLouv))) +
  geom_node_label(aes(label = name,
                      color = as.factor(communityLouv))) +
     theme(legend.position = "none")

The graph shows papers in four communities, three of which are represented by just one or two papers. Note that some of the papers are odd - a book review, for example.

nodeInfo <- biblioDF %>% 
    mutate(ID = row_number()) %>% 
    select(AB, TI, DE, ID, J9) %>% 
# ID is a text field of keywords in the bib data
# I combine it with the other text fields, then
# create a new ID of text number to combine with
# the node centrality etc
    mutate(alltext = paste(TI, AB, DE, ID, sep = " ")) %>% 
    select(-AB, -DE, -ID) %>% 
    mutate(ID = row_number()) 
allNodeInfo <- nodeList %>% 
#    filter(nodePRRank < 21) %>% 
    left_join(nodeInfo, by = 'ID')

nodeList %>% 
    filter(nodePRRank < 21) %>% 
    left_join(nodeInfo) %>% 
    select (name, nodePRRank, communityLouv, TI) %>% 
    arrange(communityLouv, nodePRRank) %>% 
    kable(caption = "Top papers in CSS by PR and community") %>% 
    kable_styling()
Table 23.13: Table 23.14: Top papers in CSS by PR and community
name nodePRRank communityLouv TI
KEUSCHNIGG M, 2018 3 1 ANALYTICAL SOCIOLOGY AND COMPUTATIONAL SOCIAL SCIENCE
KOSINSKI M, 2016 12 1 MINING BIG DATA TO EXTRACT PATTERNS AND PREDICT REAL-LIFE OUTCOMES
BADHAM J, 2020 1 2 INTRODUCTION TO COMPUTATIONAL SOCIAL SCIENCE, 2ND EDITION
PREIS T, 2014 4 2 ADAPTIVE NOWCASTING OF INFLUENZA OUTBREAKS USING GOOGLE SEARCHES
BOTTA F, 2015-1 5 2 QUANTIFYING CROWD SIZE WITH MOBILE PHONE AND TWITTER DATA
CEBRIAN M, 2011 6 2 ENGINEERING TRADE-OFFS IN SOCIAL ORGANIZATION: THE BEGINNINGS OF A COMPUTATIONAL SOCIAL SCIENCE
MANN A, 2016 7 2 COMPUTATIONAL SOCIAL SCIENCE
BAIL CA, 2017-1 8 2 TAMING BIG DATA: USING APP TECHNOLOGY TO STUDY ORGANIZATIONAL BEHAVIOR ON SOCIAL MEDIA
LAZER D, 2017 10 2 DATA EX MACHINA: INTRODUCTION TO BIG DATA
LETCHFORD A, 2015 11 2 THE ADVANTAGE OF SHORT PAPER TITLES
GUBA K, 2018 13 2 BIG DATA IN SOCIOLOGY: NEW DATA, NEW SOCIOLOGY?
STELLA M, 2018 14 2 BOTS INCREASE EXPOSURE TO NEGATIVE AND INFLAMMATORY CONTENT IN ONLINE SOCIAL SYSTEMS
CURME C, 2014 15 2 QUANTIFYING THE SEMANTICS OF SEARCH BEHAVIOR BEFORE STOCK MARKET MOVES
JUNGHERR A, 2017-1 16 2 THE EMPIRICIST’S CHALLENGE: ASKING MEANINGFUL QUESTIONS IN POLITICAL SCIENCE IN THE AGE OF BIG DATA
DE ZUNIGA 2017 18 2 CITIZENSHIP, SOCIAL MEDIA, AND BIG DATA: CURRENT AND FUTURE RESEARCH IN THE SOCIAL SCIENCES
THEOCHARIS Y, 2021 19 2 COMPUTATIONAL SOCIAL SCIENCE AND THE STUDY OF POLITICAL COMMUNICATION
PREIS T, 2013 20 2 QUANTIFYING THE DIGITAL TRACES OF HURRICANE SANDY ON FLICKR
WENG LL, 2013 2 3 VIRALITY PREDICTION AND COMMUNITY STRUCTURE IN SOCIAL NETWORKS
COTA W, 2019 17 3 QUANTIFYING ECHO CHAMBER EFFECTS IN INFORMATION SPREADING OVER POLITICAL COMMUNICATION NETWORKS
CONTE R, 2016 9 5 TOWARDS COMPUTATIONAL AND BEHAVIORAL SOCIAL SCIENCE

23.0.5 exploring the communities

One way to see and understand this community structure is to explore the datausing an interactive network tool (Gephi). In order to do this, we’ll write the nodes and edges to disk in two separate writes.

Another way to understand the communities is to compare the language of different communities, using the combined text fields. That is explored in a second r markdown script; CSS2_NetworkToText.Rmd.

allNodeInfo %>% 
  write_csv(file.path(dataDir, "CSSNodesText.csv"))
edgeList %>% 
  rename(source = from, target = to) %>% 
  write_csv(file.path(dataDir, "CSSEdges.csv"))