Publication Harvester

A Brief Guide to Troubleshooting the People File

Description

The Publication Harvester software relies on the People file, an input file that defines the people to be harvested. Sometimes errors in the People file can make it seem as if the software is not working. This brief guide can help you figure out how to troubleshoot your People file.

This page was last updated on 20-Oct-2006.

Sample Files

This example relies on two sample People files:

The remainder of this document shows you how to troubleshoot the bad People file and turn it into the good one.

How to Troubleshoot the Bad People File

The first thing to do when troubleshooting a People file is to load it into the Publication Harvester and see what publications get downloaded. It can be pretty discouraging to see everything work just fine with sample data, only to run into problems when you try to create your own people file. Did that happen to you? If so, don't worry -- there's a good chance that your problem is easy to fix by making a few small changes to your People file.

If the Publicaton Harvester gives you a message like this:

Error window: People file contains a bad setnb

then you've run into an oddity in Excel where it inserts blank lines at the end of the file. Select the last ten or so lines in your People file, right-click on them, and choose "Delete". This should remove the extra lines, so you can save this new file and load it into the Harvester.

When you run Pubtrialfinal-bad.xls through the Publication Harvester, only one of the rows in it yields any publications: the row with Michel Baudry. It doesn't get pick up publications for anyone else. If we go through the file row by row, we can start to see why that happened.

First, here's a quick reminder of how the Publication Harvester works. It executes the query in the medline_search1 column against PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=pubmed). It looks at the list of authors for each publication returned and tries to match them to each of the values in the name1, name2, name3 and name4 columns. Any time there's a match, it adds that publication to the database. Note that the matching is not case-sensitive, so both "Stellman AB" and "STELLMAN AB" will match the same publications.

Now take a look at the first row in the file:

78 GEORGE SABELA ABELA, GEORGE SABELA G S("abela, george s"[au] OR "abela gs "[au] )

Take a look at what happens when you execute the query ("abela, george s"[au] OR "abela gs "[au] ) against PubMed. Go to the PubMed web page and paste in the entire query, including parentheses. I just executed hat query and it returned 59 results. But when I looked at the author lists, I didn't see any of them that contained either "ABELA, GEORGE S" or "ABELA G S". But I saw a bunch that contained "Abela GS", with no space between the two initials. So I just added that to the Name3 column.

I did the same thing with the next row. I pasted ("aberhart, donald j"[au] OR "aberhart dj "[au] ) into the PubMed query page, saw that 20 results were returned, noticed that they all contained "Aberhart DJ" in the results, and so I added "Aberhart DJ" to the Name3 column.

I did the same thing for the third row, adding "Aberth WH" to the Name3 column, but it still didn't work when I loaded it into the Publication Harvester. Then I took another look at the name and the query:

name3: Aberth WH
medline_search1: ("aberth, william j"[au] OR "aberth wj" [au])

Did you notice how the middle initial in the Medline saerch and the name don't match? I searched Medline for "aberth wj[au]" but I didn't come up with anything. Then I searched for "aberth wh[au]" and came up with one publication! So I changed the medline search:

medline_search1: aberth wh [au]

The next row (Michel Baudry) works just fine. The reason is that he doesn't have a middle initial, so the problem with the extra space between the first and middle initial doesn't come up. But then why doesn't "CAMPISI J" work, since she doesn't have a middle initial either? Take a look at her values:

name2: 'CAMPISI, J
medline_search1: ("'CAMPISI, JUDITH "[au] OR "'CAMPISI, J"[au] )

There are two problems with this row. First, there's single quote at the start of name2 and in the medline search. But even if I remove the single quotes, the harvester doesn't turn up anything. If you paste that search into the PubMed search web page, nothing comes up. The reason is that there's a comma between the last name and first initial. If I remove the commas from name2 and medline_search1, I get:

name2: Campisi J
medline_search1: ("CAMPISI, JUDITH "[au] OR "CAMPISI J"[au] )

Now this works! I made that same change to the last two rows, adding name3 values that don't have a space between the middle and last initial.

I made all of these changes, and when I ran the harvester it got everyone. But I saw these lines in the log file:

10/18/2006 11:39:11 AM: Getting publications for ABELA (78), number 1 of 7
10/18/2006 11:39:12 AM: Publication 14743849 does not contain author 78
10/18/2006 11:39:14 AM: Wrote 58 publications, average write time 21.3

So I searched on PubMed for 14743849[pmid], and I saw the following name in the author list: Abela GS 2nd -- it looks like George Abela also publishes under that name as well. So I added that as the name4 value for his row, and the warning disappeared.

I made all of these changes to the original sample file -- you can find the new People file in Pubtrialfinal-good.xls.

Keep in mind that if your People file is not returning results, you may not have made exactly the same mistakes as the ones in the sample file for this example. It's possible that you've found an author that simply does not have any publications in PubMed. But if you troubleshoot the query by pasting it into the PubMed search page, you should be able to figure out exactly what's going on, and fix your People file appropriately.

Good luck!


License

This software is released under the GNU General Public License (GPL). This documentation is released under the GNU Free Documentation License (FDL).

Contact Information

The Publication Harvester project is maintained by Andrew Stellman of Stellman & Greene Consulting. If you have questions, comments, patches, or bug reports, please contact pubharvester@stellman-greene.com.