Retrieve bibliographic data via Z39.50 with PHPYAZ
Z39.50 is a protocol common in library environments to query and retrieve bibliographic data, for example to fetch additional meta data like author, publication date etc. for a book by its ISBN. Libraries use it to import title records from other libraries or data providers.
Being confronted with the implementation of an application using this protocol as a non-specialist in library system matters requires insights in some standards used in this field, in particular MARC, MAB (English translation of fields >= 90 here) and the Bib-1-Attribute Set.
In the following I’ll describe how to set up an environment for using Z39.50 in PHP and process basic queries on the basis of PHPYAZ provided by Index Data as a free-to-use extension.
Prerequisites
Assuming a running Apache2 with PHP as developer package on a Debian machine we first need to install the packages yaz
and libyaz4-dev
before PHPYAZ can be installed via
- pecl install yaz
Before the functions from PHPYAZ can be used, it must be configured to be loaded. For Apache2 on Debian this is achieved by creating the file /etc/php5/apache2/conf.d/yaz.ini
with the following contents and then reloading Apache2:
- ; configuration for php YAZ module
- extension=yaz.so
Target profiles and authentication
The details about a Z39.50 server are given as „Target Profile“, here the one for the Austrian Library Network and another one for the catalog IDS Basel Bern. Target Profiles contain information about server URL, ports, encodings, result formats and define supported attributes for queries.
Several libraries (claim to) have an authentication mechanism in place. This does not necessarily mean costs, libraries may grant each other free access upon request. Besides this, server may say to require user-name and password but „magically“ work without – this seems valid for quite some servers.
Set up a connection and perform queries
The following lines of code show how to establish a connection to a server and execute a query:
- // Server URL with port and database descriptor
- $server = "z3950.obvsg.at:9991/ACC01";
- $syntax = "mab";
- // for USMARC example uncomment next $server and $syntax defs
- // $server = "aleph.unibas.ch:9909/IDS_UTF";
- // $syntax = "usmarc";
- // Mapping CCL keywords to Bib-1-Attributes (allows convenient query formulation)
- $fields = array("wti" => "1=4",
- "ibn" => "1=7",
- "isn" => "1=8",
- "wja" => "1=31",
- "wpe" => "1=1004",
- "wve" => "1=1018");
- // establish connection and store session identifier,
- // credentials are an optional second parameter in format "<user>/<passwd>"
- $session = yaz_connect($server);
- // check whether an error occurred
- if (yaz_error($session) != ""){
- die("Error: " . yaz_error($session));
- }
- // configure desired result syntax (must be specified in Target Profile)
- yaz_syntax($session, $syntax);
- // configure YAZ's CCL parser with the mapping from above
- yaz_ccl_conf($session, $fields);
- // define a query using CCL (allowed operators are 'and', 'or', 'not')
- // available fields are the ones in $fields (again see Target Profile)
- $ccl_query = "(wpe = Liggesmeyer) and (wpe = Peter)";
- // let YAZ parse the query and check for error
- if (!yaz_ccl_parse($session, $ccl_query, &$ccl_result)){
- die("The query could not be parsed.");
- } else{
- // fetch RPN result from the parser
- $rpn = $ccl_result["rpn"];
- // do the actual query
- yaz_search($session, "rpn", $rpn);
- // wait blocks until the query is done
- yaz_wait();
- if (yaz_error($session) != ""){
- die("Error: " . yaz_error($session));
- }
- // yaz_hits returns the amount of found records
- if (yaz_hits($session) > 0){
- echo "Found some hits:<br>";
- // yaz_record fetches a record from the current result set,
- // so far I've only seen server supporting string format
- $result = yaz_record($session, 1, "string");
- print($result);
- echo "<br><br>";
- // the parsing functions will be introduced later
- if($syntax == "mab")
- $parsedResult = parse_mab_string($result);
- else
- $parsedResult = parse_usmarc_string($result);
- print_r($parsedResult);
- } else
- echo "No records found.";
- }
After the connection is established, a call to yaz_ccl_conf
is made which configures the Common Command Language parser (for CCL, see here and here). CCL is a language supported by various OPACs to formulate more complex queries. Which attributes can be used is specified in the Target Profile. E.g. for the servers linked earlier there is the Bib-1/AT-1 attribute „Title“ specified with the use number 4 – the first entry in the $fields
variable in the snippet above declares the field identifier ‚wti
‚ to be mapped to this attribute in queries by the CCL parser.
Processing result records
At a first glance the retrieved result looks like a mess – don’t worry, this impression remains. The good news is, this stuff is parsable if you know that in MAB format fields are separated by the non-printable record separator
character.
I have written two functions, one for the MAB format which parses all fields found in the result string (for their meaning refer to the spec) and the other parsing results in MARC syntax but only specific fields due to the result’s unregular structure (field 008 is a good example here). Here they are:
- function parse_mab_string($record){
- $result = array();
- // exchange some characters that are returned malformed (collected over time))
- $record = exchange_chars($record);
- // split the returned fields at the record separator character
- $record = explode(chr(0x001E),$record);
- // cut of meta data of the record
- $record[0] = substr($record[0], strpos($record[0],"h001") + 1);
- // examine all fields and extract their contents, the last entry is always empty
- for ($datnum = 0; $datnum <= count($record)-2; $datnum++){
- $data = $record[$datnum];
- // the first 4 chars are the field id
- $field = substr($data,0,4);
- if (!isset($result[$field]))
- $result[$field] = array();
- // the remaining substring is the field value
- array_push($result[$field],substr($record[$datnum],4));
- }
- return $result;
- }
- // the following helper function restores a collection of malformed characters,
- // it is based on nearly 3 years experience with several Z39.50 servers
- function exchange_chars($rep_string){
- $bad_chars = array(chr(0x00C9)."o", chr(0x00C9)."O", chr(0x00C9)."a",
- chr(0x00C9)."A", chr(0x00C9)."u", chr(0x00C9)."U",
- chr(137), chr(136), chr(251) , chr(194)."a" ,
- chr(194)."i", chr(194)."e", chr(208)."c", chr(194)."E",
- chr(207)."c", chr(207)."s", chr(207)."S", chr(201)."i",
- chr(200)."e", chr(193)."e", chr(193)."a", chr(193)."i",
- chr(193)."o", chr(193)."u", chr(195)."u", chr(201)."e",
- chr(195).chr(194), "&#263;", "ä");
- $rep_chars = array( "ö" , "Ö" , "ä" ,
- "Ä" , "ü" , "Ü" ,
- "" , "" , "ß", "á" ,
- "í" , "é" , "ç" , "É" ,
- "č" , "š" , "Š" , "ï" ,
- "ë" , "è" , "à" , "ì" ,
- "oegrave;" , "ù" , "û" , "ë" ,
- "ä" , "ć" , "ä");
- return str_replace($bad_chars, $rep_chars, $rep_string);
- }
MAB result records have a regular structure what lowers the effort to parse them. The critical things to know are the separation character and how to restore malformed characters.
In the MARC format most fields have the same structure, they start with their id and then there are one or more subfield values. But there are exceptions like field 008 mentioned above which contains the language indicator in the characters 39 to 41, so this has to be handled separately:
- function parse_usmarc_string($record){
- $ret = array();
- // there was a case where angle brackets interfered
- $record = str_replace(array("<", ">"), array("",""), $record);
- $record = utf8_decode($record);
- // split the returned fields at their separation character (newline)
- $record = explode("\n",$record);
- //examine each line for wanted information (see USMARC spec for details)
- foreach($record as $category){
- // subfield indicators are preceded by a $ sign
- $parts = explode("$", $category);
- // remove leading and trailing spaces
- array_walk($parts, "custom_trim");
- // the first value holds the field id,
- // depending on the desired info a certain subfield value is retrieved
- switch(substr($parts[0],0,3)){
- case "008" : $ret["language"] = substr($parts[0],39,3); break;
- case "020" : $ret["isbn"] = get_subfield_value($parts,"a"); break;
- case "022" : $ret["issn"] = get_subfield_value($parts,"a"); break;
- case "100" : $ret["author"] = get_subfield_value($parts,"a"); break;
- case "245" : $ret["titel"] = get_subfield_value($parts,"a");
- $ret["subtitel"] = get_subfield_value($parts,"b"); break;
- case "250" : $ret["edition"] = get_subfield_value($parts,"a"); break;
- case "260" : $ret["pub_date"] = get_subfield_value($parts,"c");
- $ret["pub_place"] = get_subfield_value($parts,"a");
- $ret["publisher"] = get_subfield_value($parts,"b"); break;
- case "300" : $ret["extent"] = get_subfield_value($parts,"a");
- $ext_b = get_subfield_value($parts,"b");
- $ret["extent"] .= ($ext_b != "") ? (" : " . $ext_b) : "";
- break;
- case "490" : $ret["series"] = get_subfield_value($parts,"a"); break;
- case "502" : $ret["diss_note"] = get_subfield_value($parts,"a"); break;
- case "700" : $ret["editor"] = get_subfield_value($parts,"a"); break;
- }
- }
- return $ret;
- }
- // fetches the value of a certain subfield given its label
- function get_subfield_value($parts, $subfield_label){
- $ret = "";
- foreach ($parts as $subfield)
- if(substr($subfield,0,1) == $subfield_label)
- $ret = substr($subfield,2);
- return $ret;
- }
- // wrapper function for trim to pass it to array_walk
- function custom_trim(& $value, & $key){
- $value = trim($value);
- }
Conclusion
Z39.50 and its implementations lack convenient handling of result records by delivering them via (better) structured formats, e.g. using XML and an appropriate DTD. Although Wikipedia says there is development towards this, it seems not to be adopted so far. With e.g. the Google Books API you get results already structured in associative arrays so you don’t need to parse it on your own. But if you have no choice and must work with Z39.50 then this might help taking the first steps.
Moin Jonas,
danke für den Artikel, sind im Moment kurz vor dem Release von ELTAB 2.0, Z39/50 ist eines der letzten Dinge die noch fehlen! Bin beim googlen nach „z3950 yaz mab“ dann auf deinen Blogeintrag gestoßen, der hilft mir jetzt hoffentlich weiter 🙂
Viele Grüße,
Gero
Cool dass ELTAB überlebt hat und jetzt aus dem Prototyp ein richtiges System wird. 🙂
Aber ihr habt ja auch den kompletten Source-Code mit dem das gemacht wird. Wobei der vermutlich nicht so viele Erklärungen hat… 😛
Genau, die Erklärungen sind sehr hilfreich! Ja es hat überlebt und wir haben ständig neue Benutzer 🙂