Tiered Approach to Data Culling in eDisclosure | Fieldfisher
Skip to main content
Insight

Tiered Approach to Data Culling in eDisclosure

Locations

United Kingdom

Article authored by Marina Goddard, Senior Paralegal in Dispute Resolution 

Data culling is the process of removing irrelevant content from document review exercises; it has the upshot of reducing the need for a linear review, with the knock-on effect of cost reduction and a quicker-paced and more efficient approach.  

This blog post considers the generally adopted two-tier approach to data culling at the initial data interrogation stages of eDisclosure, prior to commencing in-depth analysis of documents. One caveat: the approach reflected within this blog post is software agnostic and applies to post-collection in-house data only.

Tier 1: Utilising built-in analytical tools

The first Tier involves the application of technical engineering filters to remove irrelevant data.

  • DeNISTing is a process of eliminating known standard system and program files that do not have any user-generated data with evidentiary value. The ‘NIST’ stands for the National Institute of Standards and Technology, a federal agency under the U.S. Department of Commerce that promotes measurement science, standards and technology in various fields.
  • Deduplication identifies duplicate documents based on a hash value, which is essentially a unique fingerprint calculated based on the binary contents of documents and intrinsic metadata.
  • Email threading analyses email relationships and groups them by their original thread, allowing the bulk removal of non-inclusive emails from the reviewable dataset.

These bulleted processes are well-established default industry standards in eDisclsoure, and are typically included by default in eDisclosure processing software.

Tier 2: Refining the scope

The second Tier essentially lies at the intersection of law and technology, requiring both a sound grasp of the intricacies of data processing and understanding of the underlying legal issues at hand in any given matter.

  1. Keyword searches

Harnessing the power of advanced search techniques based on a combination of, inter alia, Boolean operators, wildcards and proximity operators helps to further finesse the scope of the reviewable dataset by eliminating irresponsive data in bulk.

A word of caution: keyword searches have their limitations. Thus, it is important to allocate sufficient time to devising a search methodology to optimise search results. There are a few things to note here.

Firstly, the accuracy of a search syntax cannot be overstated as there is a risk of throwing the baby out with the bathwater if the syntaxes are too broad. The process of drafting a search syntax is often exploratory and involves initial input from key stakeholders followed by a round of test searches to finesse the search string.

Secondly, as keyword searches are binary in nature, it is worth considering deploying more sophisticated alternatives, e.g. concept searching (software permitting).

Lastly, when deploying keyword searches, it is crucial to segregate processing exceptions (e.g. audio/visual files) to be addressed separately in case they may fall within the scope of the review.

  1. Multimodal filtering

The overarching goal of this stage is to suppress any immaterial items falling outside the scope of the mandated defensible criteria (e.g. email domains, filetypes, custodians, date ranges, etc.) whilst leveraging a data landscape reporting tool.

Some of these decision points might be evident from the outset (e.g. removal of container files and spam emails), whilst others might crystallise during the discussions within the legal team once the initial data sampling investigating underlying patterns is complete.

  1. Near duplicate detection

Near duplicate detection involves identifying and grouping together textually similar documents. Unlike deduplication, textual similarity runs over the content within a document, rather than assessing or matching metadata. The system parses every document with text whilst comparing every document against each other to determine textual similarity.

Depending on the scope of the review task and the nature of the near duplicates in question, some near duplicates may be suppressed from the reviewable dataset. By way of example, there might be multiple iterations of the same agreement with only final signed version falling within the scope of the review. A word of warning: it is crucial to apply careful discretion and adopt a surgical approach to discrete categories of near duplicates to ensure that there are no potentially responsive documents left behind. 

Conclusion

Whilst identifying and eliminating the content that is out of scope at the outset is instrumental in controlling the review costs and maximising recall, it is crucial to ensure that your methodology is legally defensible, and you are confident about what is left on the chopping block.

Firstly, consider conducting random sampling of the excluded datasets and make adjustments if required. Secondly, consider maintaining a culling log documenting all key decision points for audit purposes. Last but not least, ringfence the processing content and keep it active so that it can be easily re-visited in the event that there are any changes in the scope of the review.

Sign up to our email digest

Click to subscribe or manage your email preferences.

SUBSCRIBE