uniVocity-parsers 1.4.0 released with even more features!

After a few months without updates, we released another minor version of our text parsing suite, uniVocity-parsers, to introduce some useful features and minor bug fixes.

What's new

Automatic line ending detection

As of version 1.4.0, you should be able to easily process inputs coming from anywhere without having to worry about which line separator sequence is there. Some users were having trouble processing files created by different operating systems. For example, your client may use Linux to create CSV files such as this:


Year,Make,Model,Description,Price\n
1997,Ford,E350,"ac, abs, moon",3000.00\n

 

Now suppose another client, using MacOS, created this:


Year,Make,Model,Description,Price\r
1997,Ford,E350,"ac, abs, moon",3000.00\r

 

Previously, you'd have to try to identify which line ending was being used to correctly process your input data. Now, all you have to do is to tell your parser to do this work for you:

    // The settings object provides many configuration options
    CsvParserSettings parserSettings = new CsvParserSettings();

    //You can configure the parser to automatically detect what line separator sequence is in the input
    parserSettings.setLineSeparatorDetectionEnabled(true);

    // creates a CSV parser
    CsvParser parser = new CsvParser(parserSettings);

    // parses all rows in one go.
    List<String[]> allRows = parser.parseAll(getReader("/examples/example.csv"));

 

Concurrent row processing

Now you can wrap any RowProcessor in a ConcurrentRowProcessor to execute your specific processing over the rows parsed from the input in a separate thread. Let's start with the (beaten) example using annotations and a BeanListProcessor to convert rows into a list of Java beans. Here's our CSV input:


date,			amount,		quantity,	pending	,comments
10-oct-2001,	555.999,	1,			yEs		,?
2001-10-10,		,			?,			N		,"  "" something ""  "

 

Here's our TestBean:

class TestBean {

    // if the value parsed in the quantity column is "?" or "-", it will be replaced by null.
    @NullString(nulls = { "?", "-" })
    // if a value resolves to null, it will be converted to the String "0".
    @Parsed(defaultNullRead = "0")
    private Integer quantity;   // The attribute type defines which conversion will be executed when processing the value.
    // In this case, IntegerConversion will be used.
    // The attribute name will be matched against the column header in the file automatically.

    @Trim
    @LowerCase
    // the value for the comments attribute is in the column at index 4 (0 is the first column, so this means fifth column in the file)
    @Parsed(index = 4)
    private String comments;

    // you can also explicitly give the name of a column in the file.
    @Parsed(field = "amount")
    private BigDecimal amount;

    @Trim
    @LowerCase
    // values "no", "n" and "null" will be converted to false; values "yes" and "y" will be converted to true
    @BooleanString(falseStrings = { "no", "n", "null" }, trueStrings = { "yes", "y" })
    @Parsed
    private Boolean pending;
    
    ...

 

We usually create the following code to read that input file and create instances of TestBean:

    // BeanListProcessor converts each parsed row to an instance of a given class, then stores each instance into a list.
    BeanListProcessor<TestBean> rowProcessor = new BeanListProcessor<TestBean>(TestBean.class);

    CsvParserSettings parserSettings = new CsvParserSettings();
    parserSettings.setRowProcessor(rowProcessor);
    parserSettings.setHeaderExtractionEnabled(true);

    CsvParser parser = new CsvParser(parserSettings);
    parser.parse(getReader("/examples/bean_test.csv"));

    // The BeanListProcessor provides a list of objects extracted from the input.
    List<TestBean> beans = rowProcessor.getBeans();

 

Now, to execute the annotation processing and creation of TestBean instances in a separate thread, simply change this line

    parserSettings.setRowProcessor((rowProcessor));

 

To:

    parserSettings.setRowProcessor(new ConcurrentRowProcessor(rowProcessor));

 

And that's all! Here is the output produced by the toString() method of each TestBean instance:


[
    TestBean [quantity=1, comments=?, amount=555.999, pending=true], 
    TestBean [quantity=0, comments=" something ", amount=null, pending=false]
]

 

A word of advice: Just because you can split the processing of your input using a separate thread, it doesn't mean you should. uniVocity-parsers is highly optimized and processing your data sequentially will still be faster than in parallel in many cases. We recommend you to profile your particular processing scenario before blindly deciding whether to use this feature.

More details on our updated tutorial.

Well, that's about it for this new release. We hope you enjoy it!


Download version 1.4.0 here.

Check this and other projects on our github page.

March 10, 2015 by Jeronimo Backes

uniVocity 1.0.7 released with lots of bug fixes and improvements

We just released the 1.0.7 maintenance release of the uniVocity framework, with various bug fixes and some extra functionalities: Exporting data became very easy with dynamic entities and mapping auto-detection. What the hell is that? Let me demonstrate with an example.

Dumping data from a database

Ever spent time and resources isolating test data from a database to implement a particular test case? Try this:

    JdbcDataStoreConfiguration myDatabase = new JdbcDataStoreConfiguration("MyDatabase", my_javax_sql_DataSource);
    myDatabase.setSchema("dbo");

    CsvDataStoreConfiguration myCsv = new CsvDataStoreConfiguration("MyCsvDirectory");
    CsvEntityConfiguration myCsvDefaults = myCsv.getDefaultEntityConfiguration();

    myCsvDefaults.getFormat().setLineSeparator("\r\n");
    myCsvDefaults.setHeaderWritingEnabled(true);
    myCsvDefaults.setEmptyValue("\"\""); //writes empty strings found in the database as "" in CSV

    //A CSV file will be dynamically generated for each database table in this directory: 
    myCsv.setOutputDirectory(System.getProperty("user.home") + "/csv_export", "UTF-8");

    //Let's create a data integration engine with these configurations
    EngineConfiguration engineConfiguration = new EngineConfiguration("MyDataIntegrationEngine", myDatabase, myCsv);
    Univocity.registerEngine(engineConfiguration);

    DataIntegrationEngine engine = Univocity.getEngine("MyDataIntegrationEngine");

    //Now we map the database to a CSV directory.
    DataStoreMapping mapping = engine.map("MyDatabase", "MyCsvDirectory");

    //Let's delete any file in the directory and insert data to the files
    mapping.configurePersistenceDefaults().notUsingMetadata().deleteAll().insertNewRows();

    //This will create a mapping from each database table to a CSV file with the same name (and column names)
    mapping.autodetectMappings(true);
    
    //And this executes a mapping cycle... all data from the database will be mapped to CSV files as configured above.
    engine.executeCycle(/* You can specify which tables you are interested in mapping. No arguments means "map everything". */); 

Let's choose the data we are interested in

One can write queries to select rows of interest, but it takes time to do that. Suppose you need to extract all data associated to a couple of records in the database. The table you are after is some_table, and you need records whose values in the id column are 10 and 11. You have many other tables in the database with foreign references to some_table, and fortunately (for our example) they have all have the same standard name: some_table_id.

Let's just filter the rows in code and be done with the task. We can add the following lines to the code presented above (before the call to engine.executeCycle()):

    mapping.addInputRowReader(new RowReader() {

            //index of the column we are checking
            private int INDEX;

            @Override
            public void initialize(RowMappingContext context) {
                if (context.getSourceEntity().equalsIgnoreCase("some_table")) {
                    INDEX = context.getInputIndex("id");
                } else /* any other table*/ {
                    INDEX = context.getInputIndex("some_table_id");
                }
            }

            @Override
            public void processRow(Object[] inputRow, Object[] outputRow, RowMappingContext context) {
                /* if the input table is [some_table] or if it has the [some_table_id] column */
                if (INDEX != -1) {
                    Integer id = (Integer) inputRow[INDEX];
                    /* Not 10 nor 11? Burn! */
                    if (!(id == 10 || id == 11)) {
                        context.discardRow();
                    }
                }

                /* Else rows from other tables are fully read and migrated */
            }
        }/* , we could apply this to a list of tables, but let's execute the RowReader against all tables */);

 

Now, the call the engine.executeCycle() to produce CSV files containing only with the data we want.

If you have blobs and binary data things can explode.

Well, we can simply convert these values to null (or do something else):

        engine.addFunction(EngineScope.STATELESS, "binaryToNull", new FunctionCall<Object, byte[]>() {
            @Override
            public Object execute(byte[] input) {
                return null;
            }
        });


        //and then
        mapping.getMapping("images", "images").transformFields("binaryToNull", "before_image", "after_image", "another_image");
        mapping.getMapping("maps", "maps").transformFields("binaryToNull", "map_file");

But we need to load this data into a database (possibly an in-memory database)

This is now extremely easy: we need to convert the original database schema into the schema of the database used for testing:

    String pathToSchemaOutput = System.getProperty("user.home") + "/csv_export/schema.sql";

    engine.exportEntities("MyDatabase")
        .asCreateTableScript(DatabaseDialect.HSQLDB)
        .noGeneratedIds() // the test schema should not have generated ID's as we want to load existing data (and ID's).
        .toFile(pathToSchemaOutput, "UTF-8");

Cool, I have a database schema, and data in a bunch of files. How to load everything?

With a few lines of code! Most of this can be organized in nicely reusable methods, but I chose to put everything a single place to make it easier for you to follow the code:

    // we are using Spring JDBC template here: 
    DataSource dataSource = null;
    try {
        Class.forName("org.hsqldb.jdbcDriver");
        DataSource dataSource = new SingleConnectionDataSource("jdbc:hsqldb:mem:sampledb", "sa", "", true);
        this.jdbcTemplate = new JdbcTemplate(dataSource);

        File schema = new File(System.getProperty("user.home") + "/csv_export/schema.sql");
        String script = FileUtils.readFileToString(schema, "UTF-8");
        this.jdbcTemplate.execute(script);
    } catch (Exception ex) {
        throw new IllegalStateException("Error loading scripts for sample database ", ex);
    }

    JdbcDataStoreConfiguration myTestDatabase = new JdbcDataStoreConfiguration("MyTestDatabase", dataSource);
    myTestDatabase.setSchema("public");

    //You don't usually need to convert Strings to the correct column type as expected by the database. Maybe you do, so here it goes:
    myTestDatabase.getDefaultEntityConfiguration().setParameterConversionEnabled(true);

    CsvDataStoreConfiguration myCsvDir = new CsvDataStoreConfiguration("MyCsvDirectory");

    //Let's read "" as empty Strings instead of NULL. 
    //You can have columns with NOT NULL constraints
    myCsvDir.getDefaultEntityConfiguration().setEmptyValue("");
    myCsvDir.getDefaultEntityConfiguration().getFormat().setLineSeparator("\r\n");

    //Use all files in our CSV directory
    myCsvDir.addEntities(System.getProperty("user.home") + "/csv_export", "UTF-8");

    //The rest is business as usual: 
    Univocity.registerEngine(new EngineConfiguration("engine", myCsvDir, myTestDatabase));
    try {
        DataIntegrationEngine engine = Univocity.getEngine("engine");

        DataStoreMapping mapping = engine.map("MyCsvDirectory", "MyTestDatabase");
        mapping.configurePersistenceDefaults().notUsingMetadata().deleteAll().insertNewRows();
        mapping.autodetectMappings();

        engine.executeCycle();
    } finally {
        Univocity.shutdown("engine");
    }

You're ready to use your test database! We hope you enjoyed this little tutorial.

For more details on these and other features, check the updated uniVocity tutorial, API javadocs and github projects.

Have fun with this latest release! In the meantime, we are working hard to bring uniVocity 1.1.0 to life with great new features and out-of-the-box support for:

  • Excel files, even those with complex layouts that made you spend weeks working with apache-poi to read/write.
  • Database dump file parsing and data extraction. You'll be able to read generic, MySQL and PostgreSQL dump files into any other database (or CSV's, TSV's or whatever else you want).

Don't waste more time writing code to map data from A to B: get our latest release and start coding fast and powerful data integration solutions with ease!

January 07, 2015 by Jeronimo Backes

uniVocity-parsers 1.3.1 released with update for critical error

We've just identified and fixed a bug on uniVocity-parsers' CsvWriter class which will cause corrupt CSV outputs when:

  • trailing whitespaces are set to be ignored, and
  • value contains trailing whistespaces, and
  • value should be enclosed within quotes.

In the (likely) scenario with all of the above conditions, the closing quote won't be written, which means the produced CSV will be unreadable!

This has been fixed on version 1.3.1 and we recommend everyone to upgrade to this new version.

Download the latest uniVocity-parsers version here.

January 04, 2015 by Jeronimo Backes

uniVocity-parsers 1.3.0 is here with some useful new features.

We just released another minor version of our text parsing suite, uniVocity-parsers, to introduce some useful features and minor bug fixes.

Our CSV parser has been updated

  • Our CSV parser can handle unescaped quotes inside quoted elements. In other words: some people just want to watch the world burn. Consider this input:
something,"text with ""escaped quotes"" here",something else
something,"text with "unescaped quotes" here",something else

 

The first line contains what any decent CSV parser expects to find when you want to use quotes inside a value: An escape sequence. In the example, "" represents a single quote character. This line is parsed as:

  1. something
  2. text with "escaped quotes" here
  3. something else

Now the second line is obviously non-standard, and we are yet to find a parser that can handle this instead of going belly up. Well, it is controversial, but in the end the client is the boss. So we adjusted our CSV parser to be capable of handling this case. Probably not everyone will agree with us on this one, but by looking at the content there, the intention is obvious: the user wants to use unescaped quote characters, inside a quoted value. We got this requirement from a couple of clients because their clients were providing CSV files produced manually. So here we are, if you are using uniVocity-parsers from version 1.3.0, it will handle this case by default instead of throwing an exception at you, and our CSV parser will read the second line as:

  1. something
  2. text with "unescaped quotes" here
  3. something else

You can disable this capability to get exceptions when parsing such an input by turning off the property parseUnescapedQuotes in the CsvParserSettings class.

CsvParserSettings parserSettings =  new CsvParserSettings();
parserSettings.setParseUnescapedQuotes(false);

Column parsing

We introduced a few RowProcessors that are capable of collecting the values of each column parsed from the input

To avoid problems with memory when processing large inputs, we also introduced the following column processors. These will return the column values processed after a batch of a given number of rows:

Here's an example:

    CsvParserSettings parserSettings = new CsvParserSettings();
    // To get the values of all columns, use a column processor
    ColumnProcessor rowProcessor = new ColumnProcessor();
    parserSettings.setRowProcessor(rowProcessor);

    CsvParser parser = new CsvParser(parserSettings);

    //This will kick in our column processor
    parser.parse(getReader("/examples/example.csv"));

    //Finally, we can get the column values:
    Map<String, List<String>> columnValues = rowProcessor.getColumnValuesAsMapOfNames();

Use your own conversion implementations to parse annotated JavaBeans

This one is easy. If you want to create a custom conversion, simply annotate the fields you need with @Convert. The following example uses the custom conversion class WordsToSetConversion, which gets words from the value parsed for the description field, and adds them to a Set:
    class Car {
    @Parsed
    private Integer year;

    @Convert(conversionClass = WordsToSetConversion.class, args = { ",", "true" })
    @Parsed
    private Set<String> description;

 

More details on our updated tutorial.

Well, that's about it for this new release. We hope you enjoy it!


Download version 1.3.0 here.

Check this and other projects on our github page.

November 24, 2014 by Jeronimo Backes

uniVocity 1.0.6 is out - recommended update

We've just released the 1.0.6 maintenance release of uniVocity which fixes a few bugs and provides better information in case of exceptions. This is the changelong:

  • Performance improvements
  • Providing better error information when text-based data entities (CSV, TSV and Fixed-Width) can't be parsed, with suggested causes of errors
  • Logging all exceptions by default
  • Resolved issue with incorrect metadata being inserted when mappings' persistence is configured with updateNewRows
  • Removed "unexpected engine state" exception when the data integration engine can't be started due to erros in configuration.
  • Ensuring RowReader.cleanup method is invoked by uniVocity with a valid RowMappingContext object when no rows are processed
  • Resolved issue with ResultSets from queries used as source entities being closed unexpectedly

The public API remains unchanged on version 1.0.5.

Please upgrade to version 1.0.6 here.

 

 

November 14, 2014 by Jeronimo Backes

uniVocity-parsers 1.2.0 is here!

We just released another minor version of our text parsing suite, uniVocity-parsers, with the following improvements:

  • Improved parsing performance slightly
  • Parsing exceptions provide detailed error messages with possible root causes.
  • Supporting inputs with null characters in their values (the infamous '\0')
  • Parsers won't need to check for '\0' to test whether there's no more input to read. This includes any parser you may have written on top of our architecture.
  • Changing character and record counters to use long instead of int, as huge inputs will exceed Integer.MAX_VALUE.

EDIT: we released a maintenance version 1.2.1 after getting a few pull requests with very interesting contributions. Thanks to adessaigne.

  • Adding OSGi bundle information in MANIFEST.MF
  • When parsing using parseNext(), the RowProcessor is called as well

We haven't introduced exciting new features to most users, but we recommend you to upgrade to this version due to the improved stability, error reporting and better performance. If you haven't tried uniVocity-parsers yet, check out its unique features and switch to our parsers to parse CSV, TSV or Fixed-Width files.

Download version 1.2.0 here.

If you are a new user, don't forget to check our tutorial.

Check this and other projects on our github page.

November 12, 2014 by Jeronimo Backes

uniVocity 1.0.5 released with minor bug fixes

We've just released the 1.0.5 maintenance release of uniVocity which fixes the hanging "open folder" button on the license request wizard. This was caused by a bug in the JDK (versions 6 and 7) so we removed it.

Additionally, the StringWriterProvider class in our public API had a bug that caused it to incorrectly inform whether or not contents had been written to the output. This caused issues when running the examples with batch operations disabled

Please upgrade to version 1.0.5 here.

 

 

October 18, 2014 by Jeronimo Backes

uniVocity 1.0.4 released - now with built-in TSV support

We've just released another maintenance release of uniVocity. Version 1.0.4 includes built-in support for TSV data stores, and lots of bug fixes we've identified when running uniVocity to generate text-based data with batching disabled

We recommend everyone working with file based data stores and using the free license to upgrade to this new version

Download version 1.0.4 here.

 

October 14, 2014 by Jeronimo Backes

uniVocity-parsers now with TSV support

We just released the first minor version of uniVocity-parsers which includes TSV support!. A few performance improvements were also included.

What was blazing fast is now even faster!

Check out our CSV parser comparison project on our github account. You won't find a parser faster than ours with the same features you get with uniVocity-parsers. And better yet: all parsers in our suite share the same architecture, so the performance improvements we introuce benefit each one of our parsers!

Download version 1.1.0 here, and don't forget to check our updated tutorial

October 11, 2014 by Jeronimo Backes

uniVocity 1.0.3 released - improving your ETL processes even further

The 1.0.3 maintenance release of uniVocity has been released, with further performance improvements, less annoying log messages, better statistics, and minor additions to the public API.

We have also created a new new tutorial on our github page: Import world cities using uniVocity. You can just clone the repository and test how uniVocity performs while loading more than 3 million rows into your preferred database (spoiler alert: it took around 1 minute to load everything into a MySQL database using an Ultrabook). We are pretty sure you can do much better on your enterprise-class servers.

This new tutorial has been included in the 1.0.3 release package, all you have to do is to import it as a maven projet and give it a spin.

Go ahead and download version 1.0.3 here.

 

October 03, 2014 by Jeronimo Backes