Skip to main content

Linux sed remove duplicates and get unique values



Removing duplicates and getting unique values is quite easy provided that the input data follows a specific format, for example the string or raw data has spaces in between.

But a dilemma can occur if the data has no spaces in between the characters of the string, instead of spaces it is separated by dashes.

So, how to remove duplicates, get the unique values and still retain the format of the raw data?

Like this raw data: (just a sample string)
the-quick-brown-fox-jumps-over-the-lazy-dog-jumps-over-the-lazy-cat-jumps-over-the-rabbit

When removing duplicates and getting unique values via this command:

sort duplicates.txt | uniq (this will work if the data is separated by spaces)

duplicates.txt assumes that it has the string as illustrated above.

Sample output:


The output will be the exactly be the same with the input. Why? It is because the whole string is treated as literal one string, because the dashes connect between the character eliminating the space delimiter.

Example, if the requirement is to remove duplicates, get unique values and retain the dashes from the final output. How to do it?

Removing the duplicates and getting the unique values and still retain the same format, can be done in 3 process.
  1. Remove the dashes from the raw data
  2. Remove the duplicates and get unique values
  3. Put back the dashes to the final output


Here’s the Bash Shell code:

#!/bin/sh

xfile="file_with_dashes_and_duplicates.txt"

sed 's/-/ /g' "$xfile" > "$HOME/file_with_no_dashes.txt"

cat $HOME/file_with_no_dashes.txt | xargs -n1 | sort -u | xargs > file_no_duplicate.txt
file="$HOME/file_no_duplicate.txt"

sed 's/ /-/g' "$file" > "$HOME/final_output_no_duplicates_unique_content.txt"

=======
sed 's/-/ /g' == find and substitute “-“ dash with a “/ /” or a space. g == all the character that can be found on the string

sed 's/ /-/g' == find and substitute a space with a dash, g (global) all matching characters

> file_no_duplicate.txt == IO redirection to the file specified

 Output of  above script to the final data:

The script:

Raw file or input data:


File or data with no dashes:


File or data with no duplicates:


The final output:



Cheers. Enjoy Linux!!!

================================
Free Android Apps:

Click  links below to find out more:

Excel Keyboard guide:


Heaven's Dew Fall  Prayer app for Android :



Catholic Rosary Guide  for Android:
Pray the Rosary every day, countless blessings will be showered upon your life if you recite the Rosary faithfully. 
https://play.google.com/store/apps/details?id=com.myrosaryapp


Comments

Popular posts from this blog

WMIC get computer name

WMIC get computer model, manufacturer, computer name and  username. WMIC is a command-line tool and that can generate information about computer model, its manufacturer, its username and other informations depending on the parameters provided. Why would you need a command line tool if there’s a GUI to check? If you have 20 or 100 computers, or even more. It’s quite a big task just checking the GUI to check the computer model and username. If you have remote computers, you need to delegate someone in the remote office or location to check. Or you can just write a batch file or script to automate the task. Here’s the code below on how get computer model, manufacturer and the username. Open an elevated command prompt and type:     wmic computersystem get "Model","Manufacturer", "Name", "UserName" Just copy and paste the code above, the word “computersystem” does not need to be change to a computer name. A...

Print error 016-799 - Fuji Film Xerox

016-799 Fuji Xerox or Fuji Film print error code. That shows a description error as “Print instruction Fail detected in decomposer.” The error code and error description are alien languages for users and even system administrators who are not familiar with Fuji Xerox error code. The error code is quite simple and easy to fix, if the job print goes to the printer but print out doesn’t come out. So, basically the print job was received by the printer, but the printer just doesn’t know what type of paper or what size to use or which tray to utilize for the print out. In some instances, this is just a paper mismatch but the error description; if using Windows 10 to print does not exactly points to what is the issue. First thing to check, is the paper size selected by the user to print. Example, if the printer configuration is A3 and A4 sizes only. But then the person printing the file accidentally chooses “A4 Cover” then this error 016-799 will occur. ...

How to check office version from command line

The are quite a few ways to check office version it can be done via registry, PowerShell or VBScript and of course, good old command line can also do it. Checking Windows office version whether it is Office 2010, Office, 2013, Office 2016 or other version is quite important to check compatibility of documents; or just a part of software inventory. For PowerShell this simple snippet can check the office version: $ol = New-Object -ComObject Excel.Application $ol . Version The command line option will tell you where’s the path located; the result will also tell whether office is 32-bit, 64-bit and of course the version of the office as well. Here’s the command that will check the office version and which program directory the file is located which will tell whether it’s 32-bit or 64-bit. Command to search for Excel.exe: DIR C:\ /s excel.exe | find   /i "Directory of"  Above command assumes that program files is on  C: drive. Sample O...