InvestorScopes is a great site for searching S.E.C filings for keywords and generating charts that trend those keywords over time. Once they combine this with data extracted from Twitter and the blogosphere they will have a very rich content solution for tracking and contextualising interesting stuff that is emerging from the S.E.C. filing ecosystem.
When our VP of R&D Emily Huang first alerted me to this site I assumed that it leveraged the XBRL filing data. The growing financial data ecosystem I’m always banging on about on this blog. But it doesn’t look like it does based on this statement from the InvestorScopes website ‘the textual content from these files is the default dataset for our trending application‘. Something that prompts me to revisit the difference between document-centric text analytics and (XBRL) data-centric tag analytics.
My assumption is that InvestorScopes are full-text indexing the S.E.C. filing documents (not the XBRL instance files) and applying their patent-pending algorithms to help them deliver their search results and trending charts. For example if you search for ‘private jet’ one of the items returned is:
Note 16 Related Party Transactions
The Company paid John H. Sykes, the Companys founder, former Chairman and Chief Executive Officer and current major shareholder of the Company and the father of Charles Sykes, President and Chief Executive Officer of the Company, less than $0.1 million for the use of his private jet during the three and six months ended June 30, 2010, which is based on two times fuel costs and other actual costs incurred for each trip (none in the comparable 2009 period).
This is great and InvestorScopes trending is ideal for quickly figuring out if the term ‘private jet’ is appearing more or less in this recessionary economy compared to the profligate days of the past. But text-based keyword searching assumes that we all agree on what ‘private jet’ means because we take the meaning of the text term at face value. Replace ‘private jet’ with something like ‘earnings before taxes’ or some other esoteric accounting term that could mean different things in different companies or industry sectors and the possibility for confusion clearly exists. Confusion that potentially makes the trending of – in this case S.E.C. data – less meaningful.
That’s why document-centric text analysis is different from data-centric tag analysis. If you have found data by textual keyword searching you can’t be sure that the data you have found is actually about the same thing. It probably is and you probably will be able to figure it out if it isn’t because you are a smart human – for example if the data actually referred to a deal done with US Army ‘Private Jet’ (some enterprising Bilko type), which you would know to ignore.
But that’s why text-based search results aren’t always reliable because they lack access to a metadata definition to ensure true consistency and comparability in the data that the search analyzes. Many proprietary text analytics algorithms are essentially designed to facilitate search result consistency in a world of untagged data. They are needed precisely because the source data is largely untagged with helpful metadata. Because without access to metadata the systems that consume textual data are bound to mix up data that seems consistent with the search term with data that might not be .
However, if you have found data by searching for it by tag you can be pretty sure that it is all truly consistent and comparable (assuming it wasn’t tagged in error). Because it’s the definition of the tag (element) that is what defines the meaning and context of the data the tag ‘surrounds’ – not the textual data itself, if you see what I mean. So searching S.E.C. data by (XBRL) data tag is likely to be more accurate, consistent and comparable than searching it by text keyword (although I’m happy to be proved wrong).
Keyword searching and trending is a lot of fun and undoubtedly genuinely useful for analyzing the text of S.E.C. document filings rather than the numbers. But text-based keyword searching does not ensure like-for-like comparability in the same way that tag-based searching can do – especially for numeric data. Something that becomes more important when financially significant decisions may be made based on the analysis results.
Tags: InvestorScopes
Pingback: Tweets that mention Text vs. Tag Analytics | Rivet Software -- Topsy.com