Text vs. Tag Analytics

InvestorScopes is a great site for searching S.E.C filings for keywords and generating charts that trend those keywords over time.  Once they combine this with data extracted from Twitter and the blogosphere they will have a very rich content solution for tracking and contextualising interesting stuff that is emerging from the S.E.C. filing ecosystem.

When our VP of R&D Emily Huang first alerted me to this site I assumed that it leveraged the XBRL filing data. The growing financial data ecosystem I’m always banging on about on this blog. But it doesn’t look like it does based on this statement from the InvestorScopes website ‘the textual content from these files is the default dataset for our trending application‘. Something that prompts me to revisit the difference between document-centric text analytics and (XBRL) data-centric tag analytics.

My assumption is that InvestorScopes are full-text indexing the S.E.C. filing documents (not the XBRL instance files) and applying their patent-pending algorithms to help them deliver their search results and trending charts. For example if you search for ‘private jet’ one of the items returned is:

Note 16 Related Party Transactions
The Company paid John H. Sykes, the Companys founder, former Chairman and Chief Executive Officer and current major shareholder of the Company and the father of Charles Sykes, President and Chief Executive Officer of the Company, less than $0.1 million for the use of his private jet during the three and six months ended June 30, 2010, which is based on two times fuel costs and other actual costs incurred for each trip (none in the comparable 2009 period).

This is great and InvestorScopes trending is ideal for quickly figuring out if the term ‘private jet’ is appearing more or less in this recessionary economy compared to the profligate days of the past. But text-based keyword searching assumes that we all agree on what ‘private jet’ means because we take the meaning of the text term at face value. Replace ‘private jet’ with something like ‘earnings before taxes’ or some other esoteric accounting term that could mean different things in different companies or  industry sectors and the possibility for confusion clearly exists. Confusion that potentially makes the trending of – in this case S.E.C. data – less meaningful.

That’s why document-centric text analysis is different from data-centric tag analysis. If you have found data by textual keyword searching you can’t be sure that the data you have found is actually about the same thing. It probably is and you probably will be able to figure it out if it isn’t because you are a smart human – for example if the data actually referred to a deal done with US Army ‘Private Jet’ (some enterprising Bilko type), which you would know to ignore.

But that’s why text-based search results aren’t always reliable because they lack access to a metadata definition to ensure true consistency and comparability in the data that the search analyzes. Many proprietary text analytics algorithms are essentially designed to facilitate search result consistency in a world of untagged data. They are needed precisely because the source data is largely untagged with helpful metadata. Because without access to metadata the systems that consume textual data are bound to mix up data that seems consistent with the search term with data that might not be .

However, if you have found data by searching for it by tag you can be pretty sure that it is all truly consistent and comparable (assuming it wasn’t tagged in error). Because it’s the definition of the tag (element) that is what defines the meaning and context of the data the tag ‘surrounds’ – not the textual data itself, if you see what I mean. So searching S.E.C. data by (XBRL) data tag is likely to be more accurate, consistent and comparable than searching it by text keyword (although I’m happy to be proved wrong).

Keyword searching and trending is a lot of fun and undoubtedly genuinely useful for analyzing the text of S.E.C. document filings rather than the numbers. But text-based keyword searching does not ensure like-for-like comparability in the same way that tag-based searching can do – especially for numeric data. Something that becomes more important when financially significant decisions may be made based on the analysis results.


Tags:



  • Pingback: Tweets that mention Text vs. Tag Analytics | Rivet Software -- Topsy.com

  • Persaples68

    Great article. It is very helpful to have an understanding of the differences of text vs. tag and how XBRL works. Thanks also for pointing out InvestorScopes, quite interesting information there and seems like a great resource for non-numerical trending as you mentioned.

    • Stewart Mckie

      Thanks for your comment. Of course the ideal solution is to have a combined analysis tool that does both – uses text analysis for finding textual items of interest and tag analysis for scrutinizing the numbers via the tag metadata. If I really want to analyze corporate private jet use I want to know who and how they are being used, which text analysis can help with, and how much is being spent on it, which tag analysis could help with. Then I get the full picture with optimum financial comparability.

      • Kenlogen

        This is really very useful information, I really like this blog and Investorscopes.com, How XBRL tag meta data will give the financial information, here suppose you search for private jet for all the filings then all the financial data ( XBRL tag meta) for those filing which contain the keyword private jet should be visible to the users.and not sure how you can see how much spend on the private jet from the XBRL data. can you expalin this, this will really help user to make there Investment decision.

        • Stewart Mckie

          This is a great question because it highlights an issue that relates to note tagging. When a note is tagged as a block there is likely to be financial data within the note that remains buried and not connected to tagged numbers elsewhere in a report. For example our private jet note contains the cost the company was billed for corporate use of the jet. But you can't drill up from or down to this number because it won't be tagged itself nor will it refer to some other 'operational expenses' tag. In this case you, as a human, must extrapolate the financial data from the note and make the connection manually. What would be interesting would be to extract all notes from the XBRL instance, find all numbers within those notes and tag them with selected keywords found by analyzing the text around them. Then our text search for 'jet' could not only bring up the text but also highlight numerics that have been found and associated with the keyword 'jet'.

  • D Stephenson

    It seems to me that text-based searching through Wordles or the site you mentioned are sometimes relevatory (in the homeland security part of my work I've analyzed several bin Laden speeches, and was surprised that the word that came up far more frequently than the U.S. was corporations. Who would have predicted that?) but is at best justifiable only after you've used the full range of tag-based searching, especially to allow apples-to-apples comparisons between two companies.

    • Stewart Mckie

      Thanks for your comment. As I said below and you have suggested, it's the combination of the two types of analysis that could prove very powerful. By doing the tag-based search first to find 'chunks' of data that are truly comparable and then also text searching in relation to these comparable-chunks you stand a better chance of ensuring (a) you are comparing apples-to-apples and (b) the text matches are within a well-scoped contextual domain.

      I'm assuming you are the D Stephenson of the forthcoming book 'Data Dynamite: liberate data to transform our world'. Perhaps you would like to post your chapter 1 download link here as a comment?