Skip to main content

How to fix encoding issues

Most productfeeds are UTF-8 encoded.

 

Some, csv, feeds might be Windows-1252 or ISO-8859-1 encoded. 

Sometimes , csv or xml, file contains text that is double encoded.

In all situations, the text might appear garbled on your site. Mostly only everything that is special like the euro-sign, €, and accented characters.

The character encoding can be corrected using a callback function.

 

function iso2utf8_cb(&$item){
   foreach ($item as $k => $v ) {
      if ( ! is_array($v) ) {
        #utf8_decode is deprecated in php > 8.1
        # use mb_convert_encoding, iconv, inlt or a polyfil if your system does not support utf8_decode
        #$item[$k]=utf8_decode($v);

        $item[$k]=mb_convert_encoding($v, 'UTF-8', 'ISO-8859-1');
      }
    }
    generic_cb($item);
}

 

Creating a custom parser

The component does support the major formats of CSV files. Using the mapping in feed configuration any CSV file (with a header row) should work. Problem might be that the 'description' field has a different name, and there is no mapping for the description. This can be solved easily using a callback function.

 

If the data is only available in xml format the existing parsers will  probably not work. Still you might give it a try many feeds use 'item' or 'product' as tag so just try the existing parsers.

 

Otherwise the solution is to add you own parser. The component allows to add two custom parsers:

 

Create a file xml_custom.inc in administrator/components/com_datafeeds/cron

In this file you can put your own parser. Most basic duplicating the existing parser:

Class MyXMLParser extends TradetrackerRSS{
        
}

$feed_parsers['MyXMLParser']='My Own Parser';

Next step would be to copy the relevant functions from the TradetrackerRSS class (start element and end element) and alter the code.

On the page about 'adding share a sale' feeds you can find an example of a custom parser.

Example basic parser

The parser below can be used for simple xml structures without tree's or attributes.

<?php
class CarRSS extends BasicRSS{
        var 
$item_tag='car';
}

$feed_parsers['CarRSS']='Car Parser';

Copy the code into the file administrator/components/com_datafeeds/cron/xml_custom.inc and modify the code as desired. The item_tag is the main product container in the XML file. As you can see in the example data below each item is wrapped in <car>...</car> 

<?xml version="1.0" encoding="ISO-8859-1"?>
<cars>
<car>
<regno>XXX111</regno>
<brand>Audi</brand>
<model>A4</model>
<modeldescription>1.8 T Avant, Proline</modeldescription>
<yearmodel>2004</yearmodel>
</car>
<car>
<regno>YYY222</regno>
<brand>Volvo</brand>
<model>XC4</model>
<modeldescription></modeldescription>
<yearmodel>20012</yearmodel>

</car>
</cars>

automatic import using cron jobs

For automatic updating the feeds you will need to call the feedcron periodically. On unix style systems this is usually done using a cronjob. Most hosting providers allow to set up some cron.

There are two options calling the webpage using w g e t, curl or lynx. Or using the CLI version of php. The choice depends on the availability using the CLI version is highly recommend, this avoids  max_execution_time problems.

a typical webpage request looks like:

0 4 * * * w g e t -q -O /dev/null https://www.example.com/administrator/components/com_datafeeds/cron/feedcron.php

or

0 4 * * * c u r l -silent  -o /dev/null https://www.example.com/administrator/components/com_datafeeds/cron/feedcron.php

(remove the spaces in w g e t and c u r l)

a typical command line call:

0 4 * * * (cd  JOOMLA-ROOT/administrator/components/com_datafeeds/cron;php feedcron.php)


JOOMLA-ROOT is the real location of you joomla installation on the harddisk. Finding the right path might be a bit of a hassle and depends on your server. Have a look in site -> global configuration of your joomla installation. Under 'system' or 'server' there are the paths to JOOMLA-ROOT/tmp and JOOMLA-ROOT/logs

update frequency

each feed has it's own update frequency, ( in the feed configuration ). This is the minimal time between two updates. So if the update freqnuency is set to 360000 (100 hours) and your cron calls the feedcron every day, still the update will only occur every 4 days. If you set your update frequency to 60 seconds and your cron is called once a month the update will be once a month.

 


Importing unique items/avoiding duplicates

Depending on the kind of feed you have there might be a lot of duplicates in your feed, for example the same travel trip with different departure dates, or widgets being identical except for the color.

Depending on your needs you want to skip or actually import these nearly duplicate items.

The importer determines duplicates on a hash based on the feed name, the title of each item, and the values in the select fields.

Take an simple example feed called 'Widgets feed'

# name type category
1 soft widget soft blue
2 soft widget soft red
3 big widget soft blue
4 big widget hard red

Assume the name field from the feed is assigned to the title in the feed configuration and the Select1..9 fields are left blank.

 

As we import the are now 2 unique items since the hash to determine unique items is based on the feedname+title feed, thus items number 1 and 2 and item numbers 3 and 4 are identical.

 

Assume the name field is assigned to the title and the type field is assigned to the Select 1.

 

Now there will be three unique elements. Number 1 and 2 are still identical, however 3 and 4 will differ since 'soft' and 'hard' are different.

Assume in addition we assign the category to Select 2 now we will import four items into the database since the combintion of feedname+title+Select0-9 is different for each item..

You can change this behaviour using a callback function. Assume you don't want to assign the 'category' field to a Select value but still have all items in the database.

Create a callback function and use code like below to create your own hash:

function u_cb(&$item) {
      $item['md5']=md5($item['feed'].$item['name'].
      $item['type'].$item['category']);
      generic_cb($item);
}

Often different items have different deeplinks so you could use:

$item['md5']=md5($item['href']);

unique titles

function u_cb(&$item) {
  generic_cb($item);
  $item['md5']=md5($item['feed'].$item['title']);
}