About Robert

Robert Vesco

I'm doctoral student at the University of Maryland's Robert H. Smith School of Business.

I want people to pursue opportunities and happiness and I believe entrepreneurship is one important way people can do that.

My research interests include understanding (a) why some areas are more entrepreneurial than others, (b) what drives people to pursue entrepreneurship, and (c) how can policies foster effective and sustainable entrepreneurship.

[Read More]

Using R to Extract Excel/csv File of Citation Data from Zotero

Zotero is a handy piece of software used to manage citations. However, one of issues I have with it is that there is no convenient way to export citation into excel. There are some folks that have put together some solutions here, here, and here, but they haven't been able to accomplish what I needed. Essentially, want I wanted was a way to review my citations, abstracts and a few other details easily. Two things about my solution. One it uses R, so you probably want to have some experience using that software, and two, you may want to have Zotero closed or make a copy of the sqlite file. Sometimes there are locking issues with the database if you try to access it from two different ways at the same time.


# Purpose: To extract zotero citation data into excel.
# Version: 1.0
# Platform: Written on R 2.11, Win 7, Zotero 2.01, but should work on all platforms.
###############################################################################

############################
## The following has variables you need to replace for your setup
##########################

# Replace with the path to your Zotero sqlite file
dbZot <- "C:\\path\\to\\your\\zotero\\zotero-current.sqlite"

# Put a character of the top level folders you want to extract
# This will include subfolders as well
vTopFolders <- c("TopFolder1","TopFolder2")

##########################

#libraries
require(sqldf)
require(gdata)
require(reshape2)
require(reshape)

# SQL query that gets basic data.
sqlC <- "SELECT
items.itemID,
collections.collectionID,
collections.collectionName,
collections.parentCollectionID,
fields.fieldName,
itemDataValues.value,
creatorData.firstName,
creatorData.lastName,
itemNotes.title,
itemNotes.note
FROM
collectionItems
INNER JOIN items ON (collectionItems.itemID = items.itemID)
INNER JOIN collections ON (collectionItems.collectionID = collections.collectionID)
INNER JOIN itemCreators ON (items.itemID = itemCreators.itemID)
INNER JOIN itemData ON (itemData.itemID = items.itemID)
INNER JOIN itemDataValues ON (itemData.valueID = itemDataValues.valueID)
LEFT OUTER JOIN itemNotes ON (items.itemID = itemNotes.sourceItemID)
INNER JOIN creators ON (itemCreators.creatorID = creators.creatorID)
INNER JOIN creatorData ON (creators.creatorDataID = creatorData.creatorDataID)
INNER JOIN fields ON (itemData.fieldID = fields.fieldID)"

# This sends query to sqlite db
# Notice bug, stringAsFactor=F results in 0 vars, but T with method="raw" gets desired outcome
dfData <- sqldf(sqlC,, stringsAsFactors = F, dbname=dbZot, method="raw")

#This filter the data from sqlite and extracts the ID of the top level folder
vParentID <- unique(dfData[dfData$collectionName %in% vTopFolders,"collectionID"])

#This extracts a mapping of parent-child IDs
dfParentChildID <-  unique(dfData[vParentID %in% dfData$parentCollectionID,c("itemID","collectionID","parentCollectionID","collectionName")])

# Self join
dfParentChildID2 <- merge(dfParentChildID,dfParentChildID,by.x=c("itemID","collectionID"),by.y=c("itemID","parentCollectionID"),all.x=TRUE)

# Creates a field that concat top folder - sub folder
dfParentChildID2 <- transform(dfParentChildID2, top.sub.folder  = paste(collectionName.x,collectionName.y,sep="-"))

# Creates a unique ID
dfParentChildID2$ID <- with(dfParentChildID2,paste(itemID,collectionID.1,collectionID,sep="-"))

# Creates another unique ID for the original dataset from sqlite
dfData$ID <- with(dfData,paste(itemID,collectionID,parentCollectionID,sep="-"))

# merge the datasset
dfData2 <- merge(dfData,dfParentChildID2,by=c("ID"),all=TRUE)

# Get rid of the NA row in data and extract some key variables
dfData3 <- dfData2[!is.na(dfData2$parentCollectionID.x),c("ID","top.sub.folder", "collectionName",
				"fieldName", "value", "firstName", "lastName", "title", "note")]

#rename vars to prevent name conflict later on
dfData3 <- rename.vars(dfData3, from="title",to="notetitle")

# now need to reshape data, but trick was how to aggregate text. Other functions caused issue, but
# max seems to work for some reason.
# This take the fieldName which is the field in zotero that has vars like author, year, etc
# we want to turn this into wide form so that they become columns
dfData4 <- cast(data = dfData3, ID + top.sub.folder + collectionName +
				firstName + lastName + notetitle + note ~ fieldName, value = "value", fun.aggregate=max)

# Make author last name only
dfData4$author <- with(dfData4, lastName)
# Alternative: both names dfData4$author <- with(dfData4, paste(lastName,firstName,sep=","))

# Create essential a group_concat on author names.
# TODO: Fix so that authors are reversed.
dfNames <- ddply(dfData4, .(ID), summarise, authors = paste(author,sep=";",collapse=";"))

# Merge new names back into dataset
dfData5 <- merge(dfData4,dfNames)

# Make the prior data frame unique and extract the vars of interest
dfData6 <- unique(dfData5[,c("ID", "top.sub.folder","authors","publicationTitle","title",
 "abstractNote", "bookTitle", "date", "notetitle", "note")])

# Extract the year from the data
dfData6$date <- substr(as.character(dfData6$date),1,4)

#order by top folder
dfData6 <- dfData6[order(dfData6$top.sub.folder),]

#convert all factor to character
dfData6 <- as.data.frame(lapply(dfData6,as.character),stringsAsFactors=F)

#Combine book-article so that I know what book an article or chapter came from.
dfData6$bookjournal <- with(dfData6, ifelse(is.na(bookTitle), title,paste(bookTitle,title,sep="::")))

# Drop unnecessary Var
dfData7 <- dfData6[,-7]

write.csv(dfData7,"Zotero Citations.csv")
# TODO: find some excel package that doesn't have 256 character limit because
# some abstracts are really long....

4 comments to Using R to Extract Excel/csv File of Citation Data from Zotero

  • Joe

    Thanks Robert.

    You need to update your package loading bit from
    022 reshape(reshape2)
    023 reshape(reshape)

    to

    022 require(reshape2)
    023 require(reshape)

  • Robert Vesco

    Thanks! Not sure how that got in there ;)

  • Wouter

    Thanks, helped me a lot! I had to remove "bookTitle" from line 101 and comment out line 113 to make it work, otherwise it would give an "undefined columns selected" error (perhaps because that field is not used in any of my citations?).

  • Joe

    Wouter, I had to do the same thing as well.

    Robert, I'd like to display some of your code on a site for an R package that I am developing (http://code.google.com/p/r-systematic-review). I use it in addition to some of my functions.

    Can you please get in touch with me?

    Ta,
    Joe

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>