Introduction

Osh (Object SHell) is a tool that integrates the processing of structured data, database access, and remote access to a cluster of nodes. These capabilities are made available through a command-line interface (CLI) and a Python application programming interface (API).

Execution Model

Osh processes streams of Python objects using simple commands. Complex data processing is achieved by command sequences in which the output from one command is passed to the input of the next. This is similar to composing Unix commands using pipes. However, Unix commands pass strings from one command to the next, and the commands (grep, awk, sed, etc.) are heavily string-oriented. Osh commands process Python objects, and it is objects that are sent from one command to the next. Objects may be primitive types such as strings and numbers; composite types such as tuples, lists and maps; objects representing files, dates and times; or even user-defined objects.

The first command in an osh command sequence writes a stream of objects but does not have an input stream. Each subsequent command reads a stream of objects and writes a stream of objects. For a given command, the relationship between inputs and outputs is not necessarily one-to-one. For example, the f command reads one object from the stream, applies a function to it, and then generates one object to its output stream. The select command copies objects from the input stream to the output stream if and only if the select's predicate evaluates to true for the input object. expand generates any number of output objects for a single input object.

Commands, in addition to operating on input and output streams, may have side-effects. For example, out writes to stdout or a file; sql may update a database; and commands with function arguments (e.g. f, select) can operate on variables in the osh command sequence's namespace.

Example

Suppose you have a cluster named fred, consisting of nodes fred1, fred2, fred3. Each node has a database tracking work requests with a table named request. You can find the total number of open requests in each database as follows (using the CLI):
    jao@zack$ osh @fred [ sql "select count(*) from request where state = 'open'" ] ^ out
    ('fred1', 1)
    ('fred2', 0)
    ('fred3', 5)

Now suppose you want to find the total number of open requests across the cluster. You can pipe the (node, request count) tuples into an aggregation command:

    jao@zack$ osh @fred [ sql "select count(*) from request where state = 'open'" ] ^ agg 0 'total, node, count: total + count' $
    6
Note that this example combines remote execution on cluster nodes, database access (on each cluster node), and data processing (the aggregation step) in a single framework.

The same computation can be done using the API as follows:

    #!/usr/bin/python
    
    from osh.api import *
    
    osh(remote("fred", sql("select count(*) from request where state = 'open'")),
        agg(0, lambda total, node, count: total + count))        

Using Python Functions in Osh

A number of osh commands have function arguments. These functions are applied to objects from the input stream and determine the behavior of the command. Example (using the CLI):
    zack$ osh gen 10 ^ f 'x: x**2' $
    (0,)
    (1,)
    (4,)
    (9,)
    (16,)
    (25,)
    (36,)
    (49,)
    (64,)
    (81,)
gen 10 generates the first ten integers, 0, 1, ..., 9. These integers are passed to the next command, f. The argument to f is a function specification, x: x**2. This is a lambda expression, (the CLI permits the keyword lambda to be omitted). When the f command receives an input, it computes the square and writes the result to the output stream. The squared numbers are then passed to the out command which writes its inputs to stdout.

The streams connecting commands always contain tuples. If a command writes a single object to a stream, (e.g. gen which generates integers), the osh runtime wraps this object into a 1-tuple, (which is why the output from the above command contains 1-tuples, not integers).

Arbitrary-length argument lists work as usual. For example, suppose you have a file containing CSV (comma-separated values) data, in which each row contains 20 items. If you want to add integers in columns 7 and 18 (0-based) then you could invoke f, providing a function with 20 arguments, and add the 7th and 18th items. Or you could use an argument list:

    osh cat data.csv ^ f 's: s.split(",")' ^ f '*row: int(row[7]) + int(row[18])' $
cat data.csv writes the lines of data.csv to the output stream. Each such line contains values separated by commas; f 's: s.split(",")' splits each such line into a tuple of values. The next command, f: '*row: int(row[7]) + int(row[18])', assigns the entire tuple to row instead of assigning each tuple value to one function argument.

Osh Interfaces

Osh has two interfaces:

Command-line interface (CLI): The osh executable interprets command-line arguments as osh syntax. Any shell should be usable, however some osh CLI syntax may require escapes in some shells. (The osh CLI has been tested most extensively using the bash shell.)

Python application programming interface (API): The osh CLI invokes the osh runtime, which invokes Python modules corresponding to each command. The runtime and command modules can also be invoked from a Python API.