FindRelated

Description

FindRelated is a companion tool to Publication Harvester that works with a database of publications previously downloaded from PubMed using Publication Harvester. FindRelated uses the Related Citations search to find and harvest all of the publications related to publications already in the database.

Screenshot

Screenshot


Download

Software downloads:

The Publication Harvester software runs on Windows 7, 8, and 10 (and probably runs fine on previous versions). It was written in C#, and requires .NET Framework 4.0. (This should already be installed if you're running a current version of Windows.)

The following sample file may be helpful:

Documentation

Data sources

FindRelated uses the following data file format:

setnb,pmid
X0000001,12764489
X0000001,9474027
X0000002,17130168
X0000002,12682366
X0000002,12625820

example: sample-findrelatedi-input.csv

The Related Citations search uses the Elink query to retrieve related citation data from PubMed. The following links have additional information about this query:

Basic Operation

For each pair of setnb/PMID in the input, FindRelated uses the Elink query to retrieve the list of related articles, harvests them into the Publication Harvester database, and adds the rank and score to the related publications table specified by the user:

+-------------+---------+------+-----+---------+-------+
| Field       | Type    | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| PMID        | int(11) | NO   | PRI | NULL    |       |
| RelatedPMID | int(11) | NO   | PRI | NULL    |       |
| Rank        | int(11) | NO   |     | NULL    |       |
| Score       | int(11) | NO   |     | NULL    |       |
+-------------+---------+------+-----+---------+-------+

The user can specify filters using the FindRelated form:

Once the related publications are harvested, FindRelated can generate reports:

The linking report contains the list of pairs of source PMID and related PMID:

-- Linking Report
SELECT PMID AS source_pmid, RelatedPMID AS related_pmid,
Rank AS link_ranking, Score AS link_score
FROM relatedpublications

The related PMID report contains the harvested information for each related publications found:

-- Related PMID report
SELECT DISTINCT rp.RelatedPMID AS related_pmid, 
p.journal, p.authors, p.year, p.month, p.day, p.title, p.volume, p.issue, p.pages, p.pubtype, p.pubtypecategoryid
FROM relatedpublications rp, publications p
WHERE rp.RelatedPMID = p.PMID

The related MeSH report contains a list of MeSH headings for each related publication:

-- Related MeSH report
SELECT DISTINCT rp.RelatedPMID AS related_pmid, mh.Heading AS related_mesh
FROM relatedpublications RP, publicationmeshheadings pmh, meshheadings mh
WHERE RP.RelatedPMID = pmh.PMID
AND pmh.MeSHHeadingID = mh.ID

The extreme relvance report contains a list of all of the source PMIDs, the most relevant related PMID (eg. the one with the highest score), its relatedness score, the least relevant related PMID, and its relatedness score and rank.

-- Extreme Relevance report
SELECT PMID as source_pmid, MostRelevantPMID as most_rlvnt_pmid, MostRelevantScore as most_rlvnt_score, 
LeastRelevantPMID as least_rlvnt_pmid, LeastRelevantScore as least_rlvnt_score, LeastRelevantRank as least_rlvnt_rank
FROM relatedpublications_extremerelevance

Note: In the above queries, relatedpublications is replaced with the name of the table generated by FindRelated (eg. for the most relevant report, if the user specified relatedxyz as the table name, it would query against the table relatedxyz_mostrelevant.

Interaction with SC/Gen

FindRelated can retrieve colleagues in the "idea space" by interacting with SC/Gen. It automatically creates a view by appending _peoplepublications to the related publications table name:

CREATE OR REPLACE VIEW relatedpublications_peoplepublications AS
SELECT p.Setnb, rp.RelatedPMID AS PMID, -1 AS AuthorPosition, 6 AS PositionType
FROM people p, peoplepublications pp, relatedpublications rp
WHERE p.Setnb = pp.Setnb
AND pp.PMID = rp.PMID;

This view is used in conjunction with SC/Gen, which can use it as an alternate people publications table. This causes SC/Gen to find colleagues and harvest publications in the "idea space", where a colleague is any author in the roster that coauthored a related paper.

Once the related colleagues are found, the FindRelated idea peer report is enabled. This report shows the list of peers found for each star, with a row for each peer publication including the position type (which is documented in the Publication Harvester documentation):

-- Idea peer report, with author position and position type for the colleagues based on the related publication
SELECT sc.StarSetnb AS star_setnb, sc.setnb,
rp.PMID AS source_pmid, rp.RelatedPMID AS related_pmid,
cp.AuthorPosition as author_position, cp.PositionType as position_type
FROM starcolleagues sc, peoplepublications pp,
   relatedpublications rp LEFT JOIN colleaguepublications cp ON (cp.PMID = rp.RelatedPMID)
WHERE sc.StarSetnb = pp.Setnb
AND pp.PMID = rp.PMID
AND cp.Setnb = sc.Setnb

Fault tolerance

FindRelated is built for fault tolerance, so that its runs can be interrupeted at any time without losing data. This is done by reading the input file into a table (the table name is the derived by appending _queue to the name of the related publications table):

+-----------+---------+------+-----+---------+-------+
| Field     | Type    | Null | Key | Default | Extra |
+-----------+---------+------+-----+---------+-------+
| Setnb     | char(8) | NO   | PRI | NULL    |       |
| PMID      | int(11) | NO   | PRI | NULL    |       |
| Processed | bit(1)  | YES  |     | NULL    |       |
| Error     | bit(1)  | YES  |     | NULL    |       |
+-----------+---------+------+-----+---------+-------+

Data is loaded into this queue automatically when you specify an input filename and click the "Start" button. The program works by first reading each Setnb/PMID pair from each row in the input file, adding those pairs to the queue table, and then processing all of the pairs as usual. Each time a pair is successfully processed, its Processed column is changed from 0 to 1. If an error occurs, its Processed column is set to 0 and its Error column is set to 1. This is how FindRelated keeps track of its queue of remaining pairs to be processed.

When you select a database from the dropdown and specify a related publications table name, the program queries the database to see if any unprocessed pairs are in the queue. If there are pairs remaining, it will display an error in the log indicating the number of pairs, and how many of those pairs are errors. To resume the run where it left off, click the "Resume" button. If you click the "Start" button, the existing tables (including the queue table) will be truncated and repopulated from the beginning.

"Lite" mode

When the "lite" mode checkbox is checked, FindRelated runs in "lite" mode. This changes the behavior in the following ways:

Revision history

License

This software is released under the GNU General Public License (GPL).

Contact Information

The Publication Harvester project is maintained by Andrew Stellman of Stellman & Greene Consulting. If you have questions, comments, patches, or bug reports, please contact pubharvester@stellman-greene.com.

We gratefully acknowledgement is given to the financial support of the National Science Foundation (Award SBE-0738142).