PaQu issues

or How to make PaQu-friendly queries

For a comprehensive introduction to parallel queries with PaQu, see PaQu for the user.

This page lists some known issues with PaQu. These are the situations which need more complex query reformulation, and thus users need to pay more attention to how they formulate their queries.

Some more issues, bug-related work arounds and examples can also be found in the CosmoSim Blog.

  • Duplicate data – use DISTINCT
    Currently, some small tables (Redshifts, LinkLength, TreeSnapnum, AvailHalos, AvailParticles) are copied to each node. When you query such a table, this has the effect that the same result is returned to you from each node, i.e. you get 10 times the result. This behaviour will be improved in the future, but for now you can avoid that using DISTINCT:

    SELECT DISTINCT * FROM MDR1.Redshifts
  • Aggregate functions and computations
    It may happen that PaQu does not understand an expression like MIN(x)+1 in your select statement. The MIN()-function and other aggregate functions work very well and are translated properly to the necessary statements at the slave nodes and the head node. However, there are still syntax proplems when you want to combine functions with arithmetics. You can still get what you want by wrapping your query with an outer SELECT-statement and then adding +1:

    SELECT min_value+1 as desired_value FROM 
    ( SELECT MIN(x) as min_value FROM ... ) 
    
  • Resolving “*”
    PaQu tries to resolve “*” correctly into all available columns and aliases then to unique names
    in order to avoid trouble with duplicate column names when joining tables. However, with complicated table joins this resolution could still fail, so be as explicit as possible and provide the names of the columns that you are really interested in instead of using “*”. This will avoid potential problems.

    Please note that PaQu does not support a simple “*” when joining tables (at the moment). You always need to precede the “*” with the corresponding table name or alias like this:

    SELECT b.*, r.* FROM MDR1.BDMV b, MDR1.Redshifts r
    WHERE b.snapnum = r.snapnum
    LIMIT 10
    
  • Table joins
    Take care with table joins! These are reformulated again, so check twice here, if the query execution plan has correctly translated your query.
    Do NOT use the JOIN keyword exlicitely! This is not supported yet. Use nested queries instead.
  • Nested query instead of many conditions in a row
    For queries with multiple conditions after WHERE, PaQu tries to guess which one should be executed first. It is astonishingly often right, but sometimes it fails in the sense that a not so wise choice results in a very long query execution time. If you have such more advanced queries, always check the PaQu query plan which is shown after submitting your query. Here you can see exactly which condition is applied first and then either correct it manually or rewrite the complete query and make appropriate subqueries, so that PaQu has no other choice but to execute everything in the right order.
    Here’s an example for a fast query (extracting all substructures for the given host halo):

    SELECT sub.* FROM MDR1.FOFSub host, MDR1.FOFSub sub
    WHERE host.fofSubId = 85000000000000000
    AND sub.fofSubId BETWEEN host.fofSubId AND host.lastSubId 
    

    which returns results in a few seconds. It becomes very slow (> 10 min) if you add another condition:

    SELECT sub.* FROM MDR1.FOFSub host, MDR1.FOFSub sub
    WHERE host.fofSubId = 85000000000000000
    AND sub.fofSubId BETWEEN host.fofSubId AND host.lastSubId 
    AND sub.np>50
    

    since the added condition is executed first, i.e. PaQu reformulates the query so that first all subhalos with np>50 are chosen and only then the remaining conditions are applied. Since there is no index on the np-column, this involves a full table scan, which is naturally very slow.
    You can help PaQu by reformulating the query yourself like this:

    SELECT res.* from ( 
      SELECT sub.* FROM MDR1.FOFSub host, MDR1.FOFSub sub
      WHERE host.fofSubId = 85000000000000000
      AND sub.fofSubId BETWEEN host.fofSubId AND host.lastSubId 
    ) AS res 
    WHERE res.np>50
    

    This is exactly what was intended, so if you tell PaQu this explicitly, it will be much faster.

Proudly powered by Daiquiri
©2016 The CosmoSim databaseImprint and Data Protection Statement