Skip to content
ehennum edited this page Sep 14, 2020 · 4 revisions

Exporting a TDE View

The RowBatcher interface supports efficient export for all rows from a TDE view. In the typical case, the view contains the instances of an entity, and the exported entity instances provide input for downstream data consumers such as Business Intelligence tools or legacy databases.

The RowBatcher interface gets the rows in batches, formatting each batch either as a single CSV, JSON, or XML structure or as line-delimited JSON (with one JSON structure per row).

Prerequisites

On the database server, the prerequisite is to define a TDE that matches the documents and projects the rows that populate the view.

To export different sets of rows for different purposes, construct a TDE view for each export. A single TDE can define multiple views.

Insert the TDE into the schema database either before inserting the document or followed by reindexing the documents.

Steps

On the client, the application performs the following sequence of actions to export a view:

  1. Call the DatabaseClientFactory.newClient() factory to create a DatabaseClient.

  2. Call the DatabaseClient.newDataMovementManager() factory to create a DataMovementManager.

  3. Create a sample handle from one of the implementations of ContentHandle to specify how you want to handle the row content. For instance, to consume each batch of rows as a string, use a StringHandle; to consume each batch of rows as a Jackson JsonNode, use a JacksonHandle. See the section on Providing a Sample Handle For the Rows.

  4. Call the DataMovementManager.newRowBatcher() factory with the rows batch handle to create a RowBatcher, controlling concurrency and throughput with the withThreadCount() and withBatchSize() methods.

  5. Call RowBatcher.getRowManager() to get the RowManager, which you can use to specify the data type and row structure styles for the row batches. See the section on Setting the Data Type and Row Structure Styles.

  6. Call the RowManager.newPlanBuilder() factory to create a PlanBuilder and use the PlanBuilder to create a PlanBuilder.ModifyPlan to export the view. See the section on Building a Plan For Exporting the View.

  7. Call RowBatcher.withBatchView() to initialize the RowBatcher with the plan for exporting the view.

  8. Call RowBatcher.onSuccess() with a callback (typically, a lambda function) that receives each retrieved batch of rows and call RowBatcher.onFailure() with a callback that specifies how to respond to errors. See the section on Listening For Success and Failure.

  9. Call DataMovementManager.startJob() to start receiving rows and RowBatcher.awaitCompletion() to block the application thread until the last row has been processed by the success listener.

See an Example of code that executes these steps.

Providing a Sample Handle For the Rows

In the Java API, a handle acts as an adapter in providing a common interface for heterogenous representations of content.

To indicate how to represent the exported rows, you pass a sample handle
to the DataMovementManager.newRowBatcher() factory method. For example, to get each batch of exported rows as a string, construct the RowBatcher with a StringHandle.

The handle must extend both of the ContentHandle and StructuredReadHandle abstract interfaces.

Some handles have implicit formats. For instance, JacksonHandle has an implicit format of Format.JSON. If the handle doesn't have an implicit format, you must specify the format of the sample handle before constructing the RowBatcher.

You may also specify the mime type on the handle. For instance, to export the rows as CSV, set the format to Format.TEXT and the mime type to text/csv.

Setting the Data Type and Row Structure Styles

The RowBatcher.getRowManager() method gets the RowManager for the exported rows.

You can use the RowManager setter methods to control the output:

Building a Plan For Exporting the View

The RowManager.newPlanBuilder() factory creates a PlanBuilder to build a plan.

The plan for exporting the view should ensure that each row indexed by the view yields exactly one row in the result.

The plan characteristics:

  • must begin with a fromView() accessor for the rows to be exported.
  • may project columns from the exported rows with a select() operation.
  • may add columns to the exported rows through one-to-one joins with other rows or through document joins.
  • may add expressions columns to the exported rows either before or after any joins using as() in a select() operation.

The plan limitations:

  • shouldn't sort the exported rows with an orderBy() operation. Because the assignment of rows to batches ignores any sort order, sorting the exported rows has no purpose.

  • shouldn't increase or decrease the number of exported rows as a side effect of a join or with an except(), groupBy(), intersect(), limit(), offset(), union(), where(), whereDistinct(), or other operation. Changing the number of exported rows could result in batches that have too many rows or too few rows (or even no rows).

  • cannot apply a map(), prepare(), or reduce() operation anywhere in the plan.

  • cannot use parameter placeholders anywhere in the plan.

Joined rows can originate in any accessor including other views (such as dimension tables), triples, or lexicons and can be modified by operations (including additional join operations) prior to the join with the exported rows. The only limitation is that the join with the exported rows shouldn't change the number of exported rows. Thus, where any of the joined rows has a many-to-one relation with an exported row, the plan should group the joined rows on the join key prior to the join operation.

The export can get data from documents instead of (or in addition to) getting data from indexes by joining documents in the plan. The joined documents may be the source documents for the rows or documents with document URIs provided by the rows. The view must have exactly one row per document (but may have a single column).

Instead of filtering the exported rows in the plan, the TDE can materialize the filter by producing a secondary view that matches the appropriate subset of documents and has only a primary key column. The plan can then export the view that materializes the filter and join on the primary key column with the view having the other columns.

Pass the built plan to the RowBatcher.withBatchView() method to initialize the RowBatcher with the plan.

Recommended: While developing, test the export plan to confirm that the plan produces rows with the desired shape before exporting the entire view. After each change to the plan, perform the following actions:

Listening For Success and Failure

The RowBatcher.onSuccess() method specify a function (typically, a Java lambda) that's called with each batch of successfully exported rows.

The RowBatcher passes a RowBatchSuccessListener.RowBatchResponseEvent in the call to the success listener.

The event object provides the RowBatchSuccessListener.RowBatchResponseEvent.getRowsDoc() getter method to get the batch of rows in the content representation supported by the sample handle provided when constructing RowBatcher. For example, for a RowBatcher constructed with a StringHandle, the method returns the row batch as a Java String.

That is, the generic type of the ContentHandle is also the generic type of the RowBatcher and RowBatchSuccessListener.RowBatchResponseEvent and thus the type of the return value for the RowBatchSuccessListener.RowBatchResponseEvent.getRowsDoc() method.

You should also use the RowBatcher.onFailure() method to specify the disposition of any failures during the job.

The RowBatcher passes a RowBatchFailureListener.RowBatchFailureEvent and the exception for the row batch in the call to the failure listener.

The failure listener can controls the disposition of the failure by calling RowBatchFailureListener.RowBatchFailureEvent setter methods:

The RowBatcher calls the failure listener again if the retry doesn't succeed until the maximum number of retries is reached or the failure listener specifies a disposition other than retry.

Ensuring Row Set Stability

If continuous updates during the export job can affect the documents that populate the exported view, rows could have modified values or could be deleted after the batch with the rows has been exported.

For many uses, minor inconsistency is not problematic. After all, continuous updates also mean that the exported rows won't reflect the current state of the view after the export job finishes.

Where the export should reflect a consistent snapshot of the view, however, the export job should use a point-in-time query. Before starting the job, call the RowBatcher.withConsistentSnapshot() setter method to configure the RowBatcher to retrieve every batch of rows with its state at the time that the export job started.

Example

The following example shows a simple job that exports the rows from a view in CSV format.

// get the database client (often done once per application)
DatabaseClient db =  DatabaseClientFactory.newClient(...);

// get the data movement manager (often done once per application)
DataMovementManager moveMgr = db.newDataMovementManager();

// construct a handle for how to represent the retrieved rows
StringHandle sampleHandle =
    new StringHandle().withFormat(Format.TEXT)
                      .withMimetype("text/csv");

// construct the multi-threaded exporter
RowBatcher<String> rowBatcher =
    moveMgr.newRowBatcher(sampleHandle)
           .withBatchSize(30)
           .withThreadCount(threads);

// configure the export for consistent data types
RowManager rowMgr = rowBatcher.getRowManager();
rowMgr.setDatatypeStyle(RowManager.RowSetPart.HEADER);

// build the plan for the export view
PlanBuilder planBuilder = rowMgr.newPlanBuilder();
PlanBuilder.ModifyPlan exportPlan =
    planBuilder.fromView(...)
               .select(/* project index columns and add expression columns */);

// specify processing for exported rows and for request failures   
rowBatcher.withBatchView(exportPlan)
          .onSuccess(event -> {
              try {
                  BufferedReader reader =
                      new BufferedReader(new StringReader(event.getRowsDoc()));
                      reader.readLine();  // consume the CSV header line
                      reader.lines().forEach(line -> {/*
                          client processing of exported rows
                          */});
              } catch (Throwable e) {
                  // logging for errors during client processing of exported rows
              }})
          .onFailure((event, throwable) -> {
                event.withDisposition(BatchFailureDisposition.SKIP);
                // logging for errors during retrieval of row batches
                });

// start the job and then wait for the export to complete
moveMgr.startJob(rowBatcher);
rowBatcher.awaitCompletion();
Clone this wiki locally