Retrieve bibliographic data via Z39.50 with PHPYAZ

Z39.50 is a protocol common in library environments to query and retrieve bibliographic data, for example to fetch additional meta data like author, publication date etc. for a book by its ISBN. Libraries use it to import title records from other libraries or data providers.

Being confronted with the implementation of an application using this protocol as a non-specialist in library system matters requires insights in some standards used in this field, in particular MARC, MAB (English translation of fields >= 90 here) and the Bib-1-Attribute Set.

In the following I’ll describe how to set up an environment for using Z39.50 in PHP and process basic queries on the basis of PHPYAZ provided by Index Data as a free-to-use extension.

Prerequisites

Assuming a running Apache2 with PHP as developer package on a Debian machine we first need to install the packages yaz and libyaz4-dev before PHPYAZ can be installed via

  1. pecl install yaz

Before the functions from PHPYAZ can be used, it must be configured to be loaded. For Apache2 on Debian this is achieved by creating the file /etc/php5/apache2/conf.d/yaz.ini with the following contents and then reloading Apache2:

yaz.ini   
  1. ; configuration for php YAZ module
  2. extension=yaz.so

Target profiles and authentication

The details about a Z39.50 server are given as „Target Profile“, here the one for the Austrian Library Network and  another one for the catalog IDS Basel Bern. Target Profiles contain information about server URL, ports, encodings, result formats and define supported attributes for queries.

Several libraries (claim to) have an authentication mechanism in place. This does not necessarily mean costs, libraries may grant each other free access upon request. Besides this, server may say to require user-name and password but „magically“ work without – this seems valid for quite some servers.

Set up a connection and  perform queries

The following lines of code  show how to establish a connection to a server and execute a query:

  1. // Server URL with port and database descriptor
  2. $server = "z3950.obvsg.at:9991/ACC01";
  3. $syntax = "mab";
  4. // for USMARC example uncomment next $server and $syntax defs
  5. // $server = "aleph.unibas.ch:9909/IDS_UTF";
  6. // $syntax = "usmarc";
  7. // Mapping CCL keywords to Bib-1-Attributes (allows convenient query formulation)
  8. $fields = array("wti" => "1=4",
  9.                 "ibn" => "1=7",
  10.                 "isn"   => "1=8",
  11.                 "wja"   => "1=31",
  12.                 "wpe"   => "1=1004",
  13.                 "wve"   => "1=1018");
  14. // establish connection and store session identifier,
  15. // credentials are an optional second parameter in format "<user>/<passwd>"
  16. $session = yaz_connect($server);
  17. // check whether an error occurred
  18. if (yaz_error($session) != ""){
  19.     die("Error: " . yaz_error($session));
  20. }
  21. // configure desired result syntax (must be specified in Target Profile)
  22. yaz_syntax($session, $syntax);
  23. // configure YAZ's CCL parser with the mapping from above
  24. yaz_ccl_conf($session, $fields);
  25. // define a query using CCL (allowed operators are 'and', 'or', 'not')
  26. // available fields are the ones in $fields (again see Target Profile)
  27. $ccl_query = "(wpe = Liggesmeyer) and (wpe = Peter)";
  28. // let YAZ parse the query and check for error
  29. if (!yaz_ccl_parse($session, $ccl_query, &$ccl_result)){
  30.         die("The query could not be parsed.");
  31. } else{
  32.     // fetch RPN result from the parser
  33.     $rpn = $ccl_result["rpn"];
  34.     // do the actual query
  35.     yaz_search($session, "rpn", $rpn);
  36.     // wait blocks until the query is done
  37.     yaz_wait();
  38.     if (yaz_error($session) != ""){
  39.         die("Error: " . yaz_error($session));
  40.     }
  41.     // yaz_hits returns the amount of found records
  42.     if (yaz_hits($session) > 0){
  43.         echo "Found some hits:<br>";
  44.         // yaz_record fetches a record from the current result set,
  45.         // so far I've only seen server supporting string format
  46.         $result = yaz_record($session, 1, "string");
  47.         print($result);
  48.         echo "<br><br>";
  49.         // the parsing functions will be introduced later
  50.         if($syntax == "mab")
  51.             $parsedResult = parse_mab_string($result);
  52.         else
  53.             $parsedResult = parse_usmarc_string($result);
  54.         print_r($parsedResult);
  55.     } else
  56.         echo "No records found.";
  57. }

After the connection is established, a call to yaz_ccl_conf is made which configures the Common Command Language parser (for CCL, see here and here). CCL is a language supported by various OPACs to formulate more complex queries. Which attributes can be used is specified in the Target Profile. E.g. for the servers linked earlier there is the Bib-1/AT-1 attribute „Title“ specified with the use number 4 – the first entry in the $fields variable in the snippet above declares the field identifier ‚wti‚ to be mapped to this attribute in queries by the CCL parser.

Processing result records

At a first glance the retrieved result looks like a mess – don’t worry, this impression remains. The good news is, this stuff is parsable if you know that in MAB format fields are separated by the non-printable record separator character.

I have written two functions, one for the MAB format which parses all fields found in the result string (for their meaning refer to the spec) and the other parsing results in MARC syntax but only specific fields due to the result’s unregular structure (field 008 is a good example here). Here they are:

  1. function parse_mab_string($record){
  2.     $result = array();
  3.     // exchange some characters that are returned malformed (collected over time))
  4.     $record = exchange_chars($record);
  5.     // split the returned fields at the record separator character
  6.     $record = explode(chr(0x001E),$record);
  7.     // cut of meta data of the record
  8.     $record[0] = substr($record[0], strpos($record[0],"h001") + 1);
  9.     // examine all fields and extract their contents, the last entry is always empty
  10.     for ($datnum = 0; $datnum <= count($record)-2; $datnum++){
  11.         $data = $record[$datnum];
  12.         // the first 4 chars are the field id
  13.         $field = substr($data,0,4);
  14.         if (!isset($result[$field]))
  15.             $result[$field] = array();
  16.         // the remaining substring is the field value
  17.         array_push($result[$field],substr($record[$datnum],4));
  18.     }
  19.     return $result;
  20. }
  21.  
  22. // the following helper function restores a collection of malformed characters,
  23. // it is based on nearly 3 years experience with several Z39.50 servers
  24. function exchange_chars($rep_string){
  25.     $bad_chars = array(chr(0x00C9)."o", chr(0x00C9)."O", chr(0x00C9)."a",
  26.                        chr(0x00C9)."A", chr(0x00C9)."u", chr(0x00C9)."U",
  27.                        chr(137), chr(136), chr(251) , chr(194)."a" ,
  28.                        chr(194)."i", chr(194)."e", chr(208)."c", chr(194)."E",
  29.                        chr(207)."c", chr(207)."s", chr(207)."S", chr(201)."i",
  30.                        chr(200)."e", chr(193)."e", chr(193)."a", chr(193)."i",
  31.                        chr(193)."o", chr(193)."u", chr(195)."u", chr(201)."e",
  32.                        chr(195).chr(194), "&amp;#263;", "ä");
  33.     $rep_chars = array(     "&ouml;"  ,    "&Ouml;"    ,    "&auml;"    ,
  34.                             "&Auml;"  ,     "&uuml;"   ,     "&Uuml;"   ,
  35.                           ""   ,    ""   , "&szlig;",  "&aacute;"  , 
  36.                        "&iacute;"  , "&eacute;"  , "&ccedil;"  , "&Eacute;"  ,
  37.                        "&#269;"    , "&#353;"    , "&#352;"    , "&iuml;"    ,
  38.                        "&euml;"    , "&egrave;"  , "&agrave;"  , "&igrave;"  ,
  39.                        "oegrave;"  , "&ugrave;"  , "&ucirc;"   , "&euml;"    ,
  40.                        "&auml;"         , "&#263;"    , "&auml;");
  41.     return str_replace($bad_chars, $rep_chars, $rep_string);
  42. }

MAB result records have a regular structure what lowers the effort to parse them. The critical things to know are the separation character and how to restore malformed characters.

In the MARC format most fields have the same structure, they start with their id and then there are one or more subfield values. But there are exceptions like field 008 mentioned above which contains the language indicator in the characters 39 to 41, so this has to be handled separately:

  1. function parse_usmarc_string($record){
  2.     $ret = array();
  3.     // there was a case where angle brackets interfered
  4.     $record = str_replace(array("<", ">"), array("",""), $record);
  5.     $record = utf8_decode($record);
  6.     // split the returned fields at their separation character (newline)
  7.     $record = explode("\n",$record);
  8.     //examine each line for wanted information (see USMARC spec for details)
  9.     foreach($record as $category){
  10.         // subfield indicators are preceded by a $ sign
  11.         $parts = explode("$", $category);
  12.         // remove leading and trailing spaces
  13.         array_walk($parts, "custom_trim");
  14.         // the first value holds the field id,
  15.         // depending on the desired info a certain subfield value is retrieved
  16.         switch(substr($parts[0],0,3)){
  17.             case "008" : $ret["language"] = substr($parts[0],39,3); break;
  18.             case "020" : $ret["isbn"] = get_subfield_value($parts,"a"); break;
  19.             case "022" : $ret["issn"] = get_subfield_value($parts,"a"); break;
  20.             case "100" : $ret["author"] = get_subfield_value($parts,"a"); break;
  21.             case "245" : $ret["titel"] = get_subfield_value($parts,"a");
  22.                          $ret["subtitel"] = get_subfield_value($parts,"b"); break;
  23.             case "250" : $ret["edition"] = get_subfield_value($parts,"a"); break;
  24.             case "260" : $ret["pub_date"] = get_subfield_value($parts,"c");
  25.                          $ret["pub_place"] = get_subfield_value($parts,"a");
  26.                          $ret["publisher"] = get_subfield_value($parts,"b"); break;
  27.             case "300" : $ret["extent"] = get_subfield_value($parts,"a");
  28.                          $ext_b = get_subfield_value($parts,"b");
  29.                          $ret["extent"] .= ($ext_b != "") ? (" : " . $ext_b) : "";
  30.                          break;
  31.             case "490" : $ret["series"] = get_subfield_value($parts,"a"); break;
  32.             case "502" : $ret["diss_note"] = get_subfield_value($parts,"a"); break;
  33.             case "700" : $ret["editor"] = get_subfield_value($parts,"a"); break;
  34.         }
  35.     }
  36.     return $ret;
  37. }
  38.  
  39. // fetches the value of a certain subfield given its label
  40. function get_subfield_value($parts, $subfield_label){
  41.     $ret = "";
  42.     foreach ($parts as $subfield)
  43.         if(substr($subfield,0,1) == $subfield_label)
  44.             $ret = substr($subfield,2);
  45.     return $ret;
  46. }
  47.  
  48. // wrapper function for trim to pass it to array_walk
  49. function custom_trim(& $value, & $key){
  50.     $value = trim($value);
  51. }

Conclusion

Z39.50 and its implementations lack convenient handling of result records by delivering them via (better) structured formats, e.g. using XML and an appropriate DTD. Although Wikipedia says there is development towards this, it seems not to be adopted so far. With e.g. the Google Books API you get results already structured in associative arrays so you don’t need to parse it on your own. But if you have no choice and must work with Z39.50 then this might help taking the first steps.

Comments (3)

GeroFebruar 24th, 2016 at 13:19

Moin Jonas,

danke für den Artikel, sind im Moment kurz vor dem Release von ELTAB 2.0, Z39/50 ist eines der letzten Dinge die noch fehlen! Bin beim googlen nach „z3950 yaz mab“ dann auf deinen Blogeintrag gestoßen, der hilft mir jetzt hoffentlich weiter 🙂

Viele Grüße,

Gero

jonasFebruar 26th, 2016 at 18:28

Cool dass ELTAB überlebt hat und jetzt aus dem Prototyp ein richtiges System wird. 🙂
Aber ihr habt ja auch den kompletten Source-Code mit dem das gemacht wird. Wobei der vermutlich nicht so viele Erklärungen hat… 😛

GeroMärz 2nd, 2016 at 11:44

Genau, die Erklärungen sind sehr hilfreich! Ja es hat überlebt und wir haben ständig neue Benutzer 🙂

Leave a comment

Your comment

(required)