Retrieve bibliographic data via Z39.50 with PHPYAZ

Z39.50 is a protocol common in library environments to query and retrieve bibliographic data, for example to fetch additional meta data like author, publication date etc. for a book by its ISBN. Libraries use it to import title records from other libraries or data providers.

Being confronted with the implementation of an application using this protocol as a non-specialist in library system matters requires insights in some standards used in this field, in particular MARC, MAB (English translation of fields >= 90 here) and the Bib-1-Attribute Set.

In the following I’ll describe how to set up an environment for using Z39.50 in PHP and process basic queries on the basis of PHPYAZ provided by Index Data as a free-to-use extension.

Prerequisites

Assuming a running Apache2 with PHP as developer package on a Debian machine we first need to install the packages yaz and libyaz4-dev before PHPYAZ can be installed via

Installing YAZ via pecl

pecl install yaz

Before the functions from PHPYAZ can be used, it must be configured to be loaded. For Apache2 on Debian this is achieved by creating the file /etc/php5/apache2/conf.d/yaz.ini with the following contents and then reloading Apache2:

yaz.ini

; configuration for php YAZ module
extension=yaz.so

Target profiles and authentication

The details about a Z39.50 server are given as „Target Profile“, here the one for the Austrian Library Network and another one for the catalog IDS Basel Bern. Target Profiles contain information about server URL, ports, encodings, result formats and define supported attributes for queries.

Several libraries (claim to) have an authentication mechanism in place. This does not necessarily mean costs, libraries may grant each other free access upon request. Besides this, server may say to require user-name and password but „magically“ work without – this seems valid for quite some servers.

Set up a connection and perform queries

The following lines of code show how to establish a connection to a server and execute a query:

YAZ connection setup and query execution

// Server URL with port and database descriptor
$server = "z3950.obvsg.at:9991/ACC01";
$syntax = "mab";
// for USMARC example uncomment next $server and $syntax defs
// $server = "aleph.unibas.ch:9909/IDS_UTF";
// $syntax = "usmarc";
// Mapping CCL keywords to Bib-1-Attributes (allows convenient query formulation)
$fields = array("wti" => "1=4",
                "ibn" => "1=7",
                "isn"   => "1=8",
                "wja"   => "1=31",
                "wpe"   => "1=1004",
                "wve"   => "1=1018");
// establish connection and store session identifier,
// credentials are an optional second parameter in format "<user>/<passwd>"
$session = yaz_connect($server);
// check whether an error occurred
if (yaz_error($session) != ""){
    die("Error: " . yaz_error($session));
}
// configure desired result syntax (must be specified in Target Profile)
yaz_syntax($session, $syntax);
// configure YAZ's CCL parser with the mapping from above
yaz_ccl_conf($session, $fields);
// define a query using CCL (allowed operators are 'and', 'or', 'not')
// available fields are the ones in $fields (again see Target Profile)
$ccl_query = "(wpe = Liggesmeyer) and (wpe = Peter)";
// let YAZ parse the query and check for error
if (!yaz_ccl_parse($session, $ccl_query, &$ccl_result)){
        die("The query could not be parsed.");
} else{
    // fetch RPN result from the parser
    $rpn = $ccl_result["rpn"];
    // do the actual query
    yaz_search($session, "rpn", $rpn);
    // wait blocks until the query is done
    yaz_wait();
    if (yaz_error($session) != ""){
        die("Error: " . yaz_error($session));
    }
    // yaz_hits returns the amount of found records
    if (yaz_hits($session) > 0){
        echo "Found some hits:<br>";
        // yaz_record fetches a record from the current result set,
        // so far I've only seen server supporting string format
        $result = yaz_record($session, 1, "string");
        print($result);
        echo "<br><br>";
        // the parsing functions will be introduced later
        if($syntax == "mab")
            $parsedResult = parse_mab_string($result);
        else
            $parsedResult = parse_usmarc_string($result);
        print_r($parsedResult);
    } else
        echo "No records found.";
}

After the connection is established, a call to yaz_ccl_conf is made which configures the Common Command Language parser (for CCL, see here and here). CCL is a language supported by various OPACs to formulate more complex queries. Which attributes can be used is specified in the Target Profile. E.g. for the servers linked earlier there is the Bib-1/AT-1 attribute „Title“ specified with the use number 4 – the first entry in the $fields variable in the snippet above declares the field identifier ‚wti‚ to be mapped to this attribute in queries by the CCL parser.

Processing result records

At a first glance the retrieved result looks like a mess – don’t worry, this impression remains. The good news is, this stuff is parsable if you know that in MAB format fields are separated by the non-printable record separator character.

I have written two functions, one for the MAB format which parses all fields found in the result string (for their meaning refer to the spec) and the other parsing results in MARC syntax but only specific fields due to the result’s unregular structure (field 008 is a good example here). Here they are:

Parse MAB results

function parse_mab_string($record){
    $result = array();
    // exchange some characters that are returned malformed (collected over time))
    $record = exchange_chars($record);
    // split the returned fields at the record separator character
    $record = explode(chr(0x001E),$record);
    // cut of meta data of the record
    $record[0] = substr($record[0], strpos($record[0],"h001") + 1);
    // examine all fields and extract their contents, the last entry is always empty
    for ($datnum = 0; $datnum <= count($record)-2; $datnum++){
        $data = $record[$datnum];
        // the first 4 chars are the field id
        $field = substr($data,0,4);
        if (!isset($result[$field]))
            $result[$field] = array();
        // the remaining substring is the field value
        array_push($result[$field],substr($record[$datnum],4));
    }
    return $result;
}
 
// the following helper function restores a collection of malformed characters,
// it is based on nearly 3 years experience with several Z39.50 servers
function exchange_chars($rep_string){
    $bad_chars = array(chr(0x00C9)."o", chr(0x00C9)."O", chr(0x00C9)."a",
                       chr(0x00C9)."A", chr(0x00C9)."u", chr(0x00C9)."U",
                       chr(137), chr(136), chr(251) , chr(194)."a" ,
                       chr(194)."i", chr(194)."e", chr(208)."c", chr(194)."E",
                       chr(207)."c", chr(207)."s", chr(207)."S", chr(201)."i",
                       chr(200)."e", chr(193)."e", chr(193)."a", chr(193)."i",
                       chr(193)."o", chr(193)."u", chr(195)."u", chr(201)."e",
                       chr(195).chr(194), "&amp;#263;", "Ã¤");
    $rep_chars = array(     "&ouml;"  ,    "&Ouml;"    ,    "&auml;"    ,
                            "&Auml;"  ,     "&uuml;"   ,     "&Uuml;"   ,
                          ""   ,    ""   , "&szlig;",  "&aacute;"  , 
                       "&iacute;"  , "&eacute;"  , "&ccedil;"  , "&Eacute;"  ,
                       "&#269;"    , "&#353;"    , "&#352;"    , "&iuml;"    ,
                       "&euml;"    , "&egrave;"  , "&agrave;"  , "&igrave;"  ,
                       "oegrave;"  , "&ugrave;"  , "&ucirc;"   , "&euml;"    ,
                       "&auml;"         , "&#263;"    , "&auml;");
    return str_replace($bad_chars, $rep_chars, $rep_string);
}

MAB result records have a regular structure what lowers the effort to parse them. The critical things to know are the separation character and how to restore malformed characters.

In the MARC format most fields have the same structure, they start with their id and then there are one or more subfield values. But there are exceptions like field 008 mentioned above which contains the language indicator in the characters 39 to 41, so this has to be handled separately:

Parse MARC results

function parse_usmarc_string($record){
    $ret = array();
    // there was a case where angle brackets interfered
    $record = str_replace(array("<", ">"), array("",""), $record);
    $record = utf8_decode($record);
    // split the returned fields at their separation character (newline)
    $record = explode("\n",$record);
    //examine each line for wanted information (see USMARC spec for details)
    foreach($record as $category){
        // subfield indicators are preceded by a $ sign
        $parts = explode("$", $category);
        // remove leading and trailing spaces
        array_walk($parts, "custom_trim");
        // the first value holds the field id,
        // depending on the desired info a certain subfield value is retrieved
        switch(substr($parts[0],0,3)){
            case "008" : $ret["language"] = substr($parts[0],39,3); break;
            case "020" : $ret["isbn"] = get_subfield_value($parts,"a"); break;
            case "022" : $ret["issn"] = get_subfield_value($parts,"a"); break;
            case "100" : $ret["author"] = get_subfield_value($parts,"a"); break;
            case "245" : $ret["titel"] = get_subfield_value($parts,"a");
                         $ret["subtitel"] = get_subfield_value($parts,"b"); break;
            case "250" : $ret["edition"] = get_subfield_value($parts,"a"); break;
            case "260" : $ret["pub_date"] = get_subfield_value($parts,"c");
                         $ret["pub_place"] = get_subfield_value($parts,"a");
                         $ret["publisher"] = get_subfield_value($parts,"b"); break;
            case "300" : $ret["extent"] = get_subfield_value($parts,"a");
                         $ext_b = get_subfield_value($parts,"b");
                         $ret["extent"] .= ($ext_b != "") ? (" : " . $ext_b) : "";
                         break;
            case "490" : $ret["series"] = get_subfield_value($parts,"a"); break;
            case "502" : $ret["diss_note"] = get_subfield_value($parts,"a"); break;
            case "700" : $ret["editor"] = get_subfield_value($parts,"a"); break;
        }
    }
    return $ret;
}
 
// fetches the value of a certain subfield given its label
function get_subfield_value($parts, $subfield_label){
    $ret = "";
    foreach ($parts as $subfield)
        if(substr($subfield,0,1) == $subfield_label)
            $ret = substr($subfield,2);
    return $ret;
}
 
// wrapper function for trim to pass it to array_walk
function custom_trim(& $value, & $key){
    $value = trim($value);
}

Conclusion

Z39.50 and its implementations lack convenient handling of result records by delivering them via (better) structured formats, e.g. using XML and an appropriate DTD. Although Wikipedia says there is development towards this, it seems not to be adopted so far. With e.g. the Google Books API you get results already structured in associative arrays so you don’t need to parse it on your own. But if you have no choice and must work with Z39.50 then this might help taking the first steps.

Dezember 18th, 2011 in Programming | tags: Meta data retrieval, PHP, Z39.50

Comments (3)

GeroFebruar 24th, 2016 at 13:19

Moin Jonas,

danke für den Artikel, sind im Moment kurz vor dem Release von ELTAB 2.0, Z39/50 ist eines der letzten Dinge die noch fehlen! Bin beim googlen nach „z3950 yaz mab“ dann auf deinen Blogeintrag gestoßen, der hilft mir jetzt hoffentlich weiter 🙂

Viele Grüße,

Gero

jonasFebruar 26th, 2016 at 18:28

Cool dass ELTAB überlebt hat und jetzt aus dem Prototyp ein richtiges System wird. 🙂
Aber ihr habt ja auch den kompletten Source-Code mit dem das gemacht wird. Wobei der vermutlich nicht so viele Erklärungen hat… 😛

GeroMärz 2nd, 2016 at 11:44

Genau, die Erklärungen sind sehr hilfreich! Ja es hat überlebt und wir haben ständig neue Benutzer 🙂

Jonas' weblogFoobar et al.

Retrieve bibliographic data via Z39.50 with PHPYAZ

Prerequisites

Target profiles and authentication

Set up a connection and perform queries

Processing result records

Conclusion

Comments (3)

Leave a comment

Author

Categories

Archive

Blogroll

Meta